/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Reports of my death have been greatly overestimiste.

Still trying to get done with some IRL work, but should be able to update some stuff soon.


Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB


(used to delete files and postings)

Speech Synthesis general Robowaifu Technician 09/13/2019 (Fri) 11:25:07 No.199
We want our robowaifus to speak to us right?



The Taco Tron project:


No code available yet, hopefully they will release it.

>>5913 > (~ >50%) OK, that's probably an exaggeration, but it's certainly several hundreds of dialogue line examples in the whole series of 25 episodes.
>>5867 >Also, I wonder if the other characters a VA does for other animus might also be helpful in contributing to a full & capable trained model for a character? Perhaps, some VAs change their voice acting quite a bit between characters. It should give the model more data to work with. A big issue with 2B's voice is there isn't enough voice clips to cover every possible word, but I'm hoping this multi-speaker version will learn to fill in the blanks. >>5913 Oh, that sucks and makes sense. Once I finish the next version of WaifuSynth I'll see if I can extend it to speech recognition because that's gonna be a big hassle for people trying to train their own waifu voices.
>>5915 Yeah, the VA for Chii is rather prolific WARNING: don't look her up, it will ruin everything haha! :^) and I thought that since Chii isn't really too diverse in her vocabulary (part of the storyline arc ofc), that perhaps the statistical modeling effect of AI might benefit if I can find another character she did that wasn't too far from Chii's 'normal' voice. >multi-speaker fills in the blanks That's good news. Here's hoping. >auto voice recog That would truly make this into an amazing all-in-one toolset Anon.
Open file (686.68 KB 1535x906 clipchan.png)
>>5916 Anyway, for now don't worry about resampling the clips. They should be the highest quality available before going into Spleeter. In Aegisub you can load up the video or audio, align the subtitles, type it in the proper line, and hit enter to proceed to the next one. When Clipchan is complete I'll make a video explaining the whole process.
>>5917 OK, thanks for the explanation. Sounds like I need to start over with this. Not sure what my timeline will be, probably somewhere around the Trump win.
Open file (168.92 KB 1024x1024 2B.jpg)
Open file (65.44 KB example.mp3)
Open file (33.03 KB example2.mp3)
Open file (17.94 KB example3.mp3)
For some reason I thought I uploaded the 2B voice model for WaifuSynth already but I didn't. You can get it now here: https://anonfiles.com/Hbe661i3p0/2b_v1_pt
>>5932 >2B CATS remake wehn? /robowaifu/ for great justice. This needs to happen.
>>5917 Welp, I just wasted an entire day trying to get Aegisub up and running with no success. Just as an offhand guess, I'm supposing you're not running it on Linux (but rather on W*ndows)?
>>5945 I quit using Windows over a decade ago. What problem are you having with it?
>>5945 It could be either an issue with FFMS: >After upgrading my Linux distro, i ran Aegisub and got this error >aegisub-3.2: symbol lookup error: aegisub-3.2: undefined symbol: FFMS_DoIndexinga >So i had to downgrade from ffms2 2.40 package to ffms2 2.23.1 https://github.com/Aegisub/Aegisub/issues/198 Or Wayland, Aegisub requires x11: >GDK_BACKEND=x11 aegisub does not crash. https://github.com/Aegisub/Aegisub/issues/180
>>5949 >>aegisub-3.2: symbol lookup error: aegisub-3.2: undefined symbol: FFMS_DoIndexing That was exactly the error from the distro package manager install that started my down this long bunny trail. I never found that issue link in my searches. :/ Anyways, I went and downloaded the repo and tried to build from source, but then discovered I had to have wxWidgets as well, so I had to back out then builld that from source (dev version took hours to finish, but at least it succeeded in the end). Afterwards, the Aegisub build failed with 'references not found' type errors. Too many to remember and I tableflipped.execlosed the terminal after all those hours in disgust so I can't recall exactly. Anyway thanks for the links. I'll try it again tomorrow.
>>5932 One thing I'm not perfectly clear on Anon, can WaifuSynth be used for other languages? For example, since animu is basically a Japanese art-form, can your system be used to create Japanese-speaking robowaifus? If so, would you mind explaining how any of us would go about setting something like that up please?
>>5861 OMFG anon this is awesome! Crafting 2B's perfect ass out of silicone will be challenging but this is all the motivation I need!
>>5648 Anon what happened to your gitlab profile? It is deleted, can you post your new one?
>>7560 Anyone downloaded this, at least for archiving reasons? This is also gone: https://anonfiles.com/Hbe661i3p0/2b_v1_pt from here >>5932
For singing, https://dreamtonics.com/en/synthesizerv/ has a free Eleanor, which is pretty fantastic. As with all vocaloid -type software, you have to git gud at phonemes.
>>7576 It is possible that he deleted all the contents and left robowaifu after the latest drama. He might be the anon who involved in the latest one. If that is the case it's pretty unfortunate. I hope he recovers soon and comes back.
>>7594 Possible, but I hope this isn't it. Kind of radical. I tried to explain it to him as reasonable as possible what the problem was. Whatever, I don't wanna get into that again. The more important point is: I think he gave us enough hints how to do this stuff. I'm not claiming that I could reproduce this clipchan program, but I had the same idea before I read it here. It's kind of obvious to take subtitles to harvest voices. Which means, there will be other implementations on the net doing that and explaining how to. We don't need someone come to us or to be into anime nor robowaifus, just take some other implementation from another place or have someone reproducing it based on the knowledge available.
>>7594 What drama happened besides the migration I've been to deep in my projects to browse like I used to.
Open file (40.97 KB example.mp3)
Open file (17.84 KB 863x454 example_eq.png)
Open file (18.51 KB 562x411 example_parameters.png)
I'll post this here for now, since it's definitely relevant. I was experimenting a little bit more with Deltavox RS and Audacity. It seems that there is no "one size fits all" solution when using Deltavox. In order to get a decent result, you have to experiment with different spellings, phonemes, energy, F0, bidirectional padding, and so on. In Audacity, I used a simple filter curve. I was able to get noticeably less tinny audio, which sounds less computer generated. I'm going to explore more options for editing the audio after it's been synthesized to improve its quality. I'll post again if I find anything interesting. I'll repost the links here since they're still relevant: Deltavox User Guide https://docs.google.com/document/d/1z9V4cDvatcA0gYcDacL5Bg-9nwdyV1vD5nsByL_a1wk/edit Download: https://mega.nz/file/CMBkzTpb#LDjrwHbK0YiKTz0YllofVuWg-De9wrmzXVwIn0EBiII
>>8150 Thanks. BTW, do you know if this is open source? Since QT dlls are included I presume this is C++ software. If both are true, then it's very likely I can rewrite this to be portable across platforms -- not just (((Wangblows))) and we can be running it on our RaspberryPis & other potatos. Thanks for all the great information Anon.
>>8244 And one based on Vocaloid: https://youtu.be/OPBba9ScdjU
Is the voice synthesize going to be for English voices or Japanese voices? Or does one also work for the another? It would be pretty awesome if one could take voice samples of their favorite VA, vtuber, etc, put it through a voice synth A.I., and give their robowaifu that her voice.
>>9110 >It would be pretty awesome if one could take voice samples of their favorite VA, vtuber, etc, put it through a voice synth A.I., and give their robowaifu that her voice. More or less, that has already been achieved Anon. So hopefully more options along that line will soon become readily available for us all.
>>9111 Oh wow it already has? Where can I read about it/try it if possible?
>>9112 Our guy here called his WaifuSynth. ITT there are examples from the ponys, who have taken a highly-methodical approach for all the main characters in MLP:FiM cartoon show.
I see. Though all the synths seems to be for English voices. I'm guessing the 2B, Rikka, Megumin, Rem, etc mentioned in >>5861 are referring to their English VA rather than the Japanese ones. Unless I'm missing out on something? (If so, then maybe it'd be best for me to make some time and read this whole thread.)
>>9118 AFAICT, the training approach is just a statistical system matching sounds to words based on examples. It should work for any human language I think -- though you would need to be fluent in the target language to QA the results ofc.
>>9119 Ohh, I see. One last thing: I wouldn't be wrong to assume that, since the dropping of "kokubunji", there is no one working on the voice for robowaifu?
>>9112 WaifuSynth: https://gitlab.com/robowaifudev/waifusynth Clipchan: https://gitlab.com/robowaifudev/clipchan There are better methods now like FastPitch and HiFiSinger. FastPitch is about 40x faster than Tacotron2/Waveglow (what WaifuSynth uses) and is less susceptible to generation errors but is still far from perfect. HiFiSinger uses three different GANs to make realistic speech and singing, and its 48kHz model outperforms the 24kHz ground truth but it still has room for improvement in 48kHz, although I suspect it could be near perfect by training a 96kHz model. FastPitch: https://fastpitch.github.io/ HiFiSinger: https://speechresearch.github.io/hifisinger/ There's still a lot of research to be done before this stuff will be ready for production, namely imitating voices without training a new model, emotion/speech style control, and ironing out undesired realistic noises in generation. Probably in the next 2-3 years it will be easy to make any character sing given a song or read out any given text, and you won't have to go through the whole hassle of collecting audio clips and training models yourself. >>9118 Making Japanese VAs speak English and English VAs speak Japanese should be possible but you will have to train a model that converts the input to phonemes, otherwise it will just garble and misread text. Training Tacotron2 takes a really long time so I'd recommend modifying FastPitch to use phonemes instead of characters. All you have to do is instead of inputting characters like 's a m u r a i', you input the IPA 's a m ɯ ɾ a i'. You can probably save a lot of time on training by initializing the embeddings of the IPA symbols to the character embeddings of a pretrained model, then train it on LJSpeech or another dataset until it sounds good, then fine-tune it on the desired character. This paper reports that it only takes 20 minutes of audio to speak a new language using an IPA Tacotron2 model but they don't provide their trained model or code: https://arxiv.org/abs/2011.06392v1
>>9121 Also, you can use https://ichi.moe/ to convert Japanese subtitles from https://kitsunekko.net/dirlist.php?dir=subtitles%2Fjapanese%2F into romaji and then convert the romaji to IPA. Japanese IPA is straightforward since the syllables sound exactly the same as they are written, unlike English: https://en.wikipedia.org/wiki/Help:IPA/Japanese
>>9121 >>9122 !! Thanks so much Anon!
>>9121 Nice, thanks! I have an 'new' machine (well, old but still much better than my old notebook) pieced together that has an i3 and an Nvidia GT430 (or possibly an 750ti). Not too impressive I know, but I could use it to take a shot at setting up clipchan again. Mind giving me specific set up advice Anon? Like the OS to use, Python version to use, etc., etc. The more specific, the better. TIA.
>>9125 2 GB might not be enough to train FastPitch but you might squeeze by with gradient checkpointing and gradient accumulation to reduce memory usage. A 1 GB card will certainly be too little since the model parameters are 512MB and you need at least twice that to also store the gradient. If it doesn't work you could shrink the parameters down and train a lower quality model from scratch. However, it seems the GT430 supports CUDA 2.1 and the 750Ti supports 5.0. 2.x capability was removed in CUDA 9 and 6.x removed in CUDA 10. If you're lucky you might be able to get them to still work by compiling PyTorch and Tensorflow with the CUDA version you need, but I wouldn't bet on it. I'd recommend using at least Python 3.7 and Tensorflow==2.3.0 since Spleeter requires that specific version. If someone has a newer GPU with at least 6 GB they'll have to download a Tensorflow 2.3.0 wheel with CUDA 11.1 because Tensorflow only supported 10.1 until version 2.4. A Tensorflow 2.3.0 + CUDA 11.1 wheel for Python 3.8 is available here: https://github.com/davidenunes/tensorflow-wheels Again this is only necessary if you have a newer GPU with at least 6 GB. Spleeter will run fine on the CPU. I use Python 3.7 and don't feel like compiling a Tensorflow 2.3.0 wheel for it so I just modified Spleeter's requirements.txt to support tensorflow==2.4.0 and numpy>1.16.0,numpy<=1.19.5 and installed it from source. Spleeter will still work and output the clean voice clips but crash after finishing. This error can be patched by commenting out the del function in spleeter/separator.py:135 since Tensorflow 2.4 closes the session automatically. I'm using PyTorch 1.8.0 with CUDA 11.1 since it supports the full capabilities of my GPU. To use either PyTorch or Tensorflow easily you'll need at least a 4th generation i3 which has AVX2 support. Otherwise you'll have to look for community pip wheels compiled without AVX/AVX2 for your specific version of Python. Older versions of PyTorch are compatible with most deep learning models but lack the newer torchaudio which is an amazing library for processing and training on audio that will certainly start seeing some use soon. Lastly, Clipchan needs to be updated to support the latest Spleeter version which completely changed everything again, but that'll be an easy fix. I'll patch it tonight.
>>9135 Great, thanks for the specific details. Hmmm, from what you're saying it sounds like I still won't be able to run it even with the better (by comparison) hardware. I'm sure someday we'll be able to do this stuff on much more modest hardware. I'll just be patient and focus on other things till then. :^) >Lastly, Clipchan needs to be updated to support the latest Spleeter version which completely changed everything again, but that'll be an easy fix. I'll patch it tonight. Don't rush on my account Anon. Maybe it will help others though.
Due to the 'rona my workspace is limited to a shitty laptop. What's the best (or rather, least worst) model one could conceivably run on a CPU?
>>9136 It works fine on better hardware. The problem is backwards compatibility is a foreign concept to Tensorflow so people end up locking their projects to old versions and creating a hellish nightmare of dependency conflicts. Also short clips don't take up too much memory. Only when processing 10 minute songs does it use up to 6 GB. To avoid this Clipchan processes each clip individually. And Clipchan has been updated now to v0.3. I had to fix it anyway to help someone get voice clips. It's essentially finished, besides making it simpler to use and ideally creating a GUI for it. The most important options are -dza which cleans audio with Spleeter, speeds up subtitle processing, and auto-crops the audio clips. For Tacotron2 -r 22050 -c 1 are also needed to resample and mix stereo to mono (they require the -a option to have any effect right now.) If you don't have FFmpeg with libzvbi, then omit the -z option. And some fresh Natsumi Moe voice clips from v0.3 ready for Tacotron2: https://files.catbox.moe/ipt13l.xz Still a work in progress but there's about 10 minutes of usable audio there. >>9149 Not sure, you won't get much of a speed up running FastPitch on CPU compared to Tacotron2. It's possible for fine-tuned models to be pruned and compressed down so they can run on mobile devices, but I'm not aware of anyone who has taken the time to do that. Pruning and compressing doesn't apply to training though, only works with inference.
>>9150 Thanks for the Natsumi Moe clips Anon! A cute. I hope someday we can manage a basic Chii library to generate voices from. Sounds like certain interests are specifically trying to keep their tools from working with older systems -- even their own systems haha. Doesn't sound like I (and thousands more like me) will ever be able to use this tool at that rate. Maybe if someone creates some Docker or other kind of container that were tuned for different hardware setup ups then we might be able to break free of this intentionally-created ratrace they intend us to run.
>>9121 >Cloning into 'fastpitch.github.io'... >fatal: repository 'https://fastpitch.github.io/' not found >Cloning into 'hifisinger'... >fatal: repository 'https://speechresearch.github.io/hifisinger/' not found I'm a starting to get paranoid or are they really onto us?
>>9162 Those are demo pages, not repositories. FastPitch: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch HiFiSinger (unofficial implementation): https://github.com/CODEJIN/HiFiSinger
>>9162 Those aren't git repositories Anon. Browse there and read the pages.
>>9163 >FastPitch: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch Hmm, the only way I can actually get a clone to work is by going up in the tree a bit? git clone --recursive https://github.com/NVIDIA/DeepLearningExamples.git
>>9165 Git 2.25.0 includes a new experimental sparse-checkout command: git clone --filter=blob:none --sparse https://github.com/NVIDIA/DeepLearningExamples.git cd DeepLearningExamples git sparse-checkout init --cone git sparse-checkout add PyTorch/SpeechSynthesis/FastPitch
>>9150 Yeah, I'm not going to do any training, just inference from released checkpoints. I did manage to get a FastSpeech2 model running with some pretty good results, although for some reason it adds garbled echoes after the generated speech.
>>9179 Ahh, didn't know about that one, thanks Anon.
Open file (156.72 KB 555x419 overview.webm)
Open file (50.42 KB 445x554 conformer.png)
A novel voice converter that outperforms FastSpeech2 and generates speech faster. Although it doesn't do speech synthesis from text it introduced a convolution-augmented Transformer that could easily be adapted into FastSpeech2 and FastPitch to improve the quality of synthesized speech. https://kan-bayashi.github.io/NonARSeq2SeqVC/
>>10159 Quality sounds excellent. Thanks Anon.
Hey I am looking for the dev that did this work https://gitlab.com/robowaifudev/waifusynth I am working on something similar except with Hi-FiGan. I am looking for a collaborator on my project, it is explained more here. >>10377 The gist of it is, I am building a desktop wall paper you can chat with more on my thread
>>9121 >>10383 >robowaifudev

Report/Delete/Moderation Forms

Captcha (required for reports)

no cookies?