/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

LynxChan updated to 2.5.7, let me know whether there are any issues (admin at j dot w).


Reports of my death have been greatly overestimiste.

Still trying to get done with some IRL work, but should be able to update some stuff soon.

#WEALWAYSWIN

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


Welcome to /robowaifu/, the exotic AI tavern where intrepid adventurers gather to swap loot & old war stories...


Speech Synthesis general Robowaifu Technician 09/13/2019 (Fri) 11:25:07 No.199
We want our robowaifus to speak to us right?

en.wikipedia.org/wiki/Speech_synthesis
https://archive.is/xxMI4

research.spa.aalto.fi/publications/theses/lemmetty_mst/contents.html
https://archive.is/nQ6yt

The Taco Tron project:

arxiv.org/abs/1703.10135
google.github.io/tacotron/
https://archive.is/PzKZd

No code available yet, hopefully they will release it.

github.com/google/tacotron/tree/master/demos
https://archive.is/gfKpg
>>9136 It works fine on better hardware. The problem is backwards compatibility is a foreign concept to Tensorflow so people end up locking their projects to old versions and creating a hellish nightmare of dependency conflicts. Also short clips don't take up too much memory. Only when processing 10 minute songs does it use up to 6 GB. To avoid this Clipchan processes each clip individually. And Clipchan has been updated now to v0.3. I had to fix it anyway to help someone get voice clips. It's essentially finished, besides making it simpler to use and ideally creating a GUI for it. The most important options are -dza which cleans audio with Spleeter, speeds up subtitle processing, and auto-crops the audio clips. For Tacotron2 -r 22050 -c 1 are also needed to resample and mix stereo to mono (they require the -a option to have any effect right now.) If you don't have FFmpeg with libzvbi, then omit the -z option. And some fresh Natsumi Moe voice clips from v0.3 ready for Tacotron2: https://files.catbox.moe/ipt13l.xz Still a work in progress but there's about 10 minutes of usable audio there. >>9149 Not sure, you won't get much of a speed up running FastPitch on CPU compared to Tacotron2. It's possible for fine-tuned models to be pruned and compressed down so they can run on mobile devices, but I'm not aware of anyone who has taken the time to do that. Pruning and compressing doesn't apply to training though, only works with inference.
>>9150 Thanks for the Natsumi Moe clips Anon! A cute. I hope someday we can manage a basic Chii library to generate voices from. Sounds like certain interests are specifically trying to keep their tools from working with older systems -- even their own systems haha. Doesn't sound like I (and thousands more like me) will ever be able to use this tool at that rate. Maybe if someone creates some Docker or other kind of container that were tuned for different hardware setup ups then we might be able to break free of this intentionally-created ratrace they intend us to run.
>>9121 >Cloning into 'fastpitch.github.io'... >fatal: repository 'https://fastpitch.github.io/' not found >Cloning into 'hifisinger'... >fatal: repository 'https://speechresearch.github.io/hifisinger/' not found I'm a starting to get paranoid or are they really onto us?
>>9162 Those are demo pages, not repositories. FastPitch: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch HiFiSinger (unofficial implementation): https://github.com/CODEJIN/HiFiSinger
>>9162 Those aren't git repositories Anon. Browse there and read the pages.
>>9163 >FastPitch: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch Hmm, the only way I can actually get a clone to work is by going up in the tree a bit? git clone --recursive https://github.com/NVIDIA/DeepLearningExamples.git
>>9165 Git 2.25.0 includes a new experimental sparse-checkout command: git clone --filter=blob:none --sparse https://github.com/NVIDIA/DeepLearningExamples.git cd DeepLearningExamples git sparse-checkout init --cone git sparse-checkout add PyTorch/SpeechSynthesis/FastPitch
>>9150 Yeah, I'm not going to do any training, just inference from released checkpoints. I did manage to get a FastSpeech2 model running with some pretty good results, although for some reason it adds garbled echoes after the generated speech.
>>9179 Ahh, didn't know about that one, thanks Anon.
Open file (156.72 KB 555x419 overview.webm)
Open file (50.42 KB 445x554 conformer.png)
A novel voice converter that outperforms FastSpeech2 and generates speech faster. Although it doesn't do speech synthesis from text it introduced a convolution-augmented Transformer that could easily be adapted into FastSpeech2 and FastPitch to improve the quality of synthesized speech. https://kan-bayashi.github.io/NonARSeq2SeqVC/
>>10159 Quality sounds excellent. Thanks Anon.
Hey I am looking for the dev that did this work https://gitlab.com/robowaifudev/waifusynth I am working on something similar except with Hi-FiGan. I am looking for a collaborator on my project, it is explained more here. >>10377 The gist of it is, I am building a desktop wall paper you can chat with more on my thread
>>9121 >>10383 >robowaifudev
Facebook made a great speech generator, circa a year ago: https://ai.facebook.com/blog/a-highly-efficient-real-time-text-to-speech-system-deployed-on-cpus/ - It's not free software, but they described how it is build. Yannic Kilcher goes through the system and explains it here: https://www.youtube.com/watch?v=XvDzZwoQFcU One interesting feature is, that it runs on a CPU with 4-cores (not the training of course). On such a CPU it is faster than real-time, which means faster than running the audio output would take. Something like this might be something very useful to us, if we could run our own knock-off on a small SBC inside the robowaifu.
>>10393 >Something like this might be something very useful to us, if we could run our own knock-off on a small SBC inside the robowaifu. It certainly would, if we can somehow obtain access to it or reproduce it, Anon. Thanks for the heads-up, and for the video link. It really helps to get the points across well for us. youtube-dl --write-description --write-auto-sub --sub-lang="en" https://www.youtube.com/watch?v=XvDzZwoQFcU
Open file (111.71 KB 286x286 speechbrain.png)
not sure if this has been posted before, but I came across this and immediately thought of some of the todo list for clipchan. https://speechbrain.github.io/index.html seems like there was some discussion about emotion and speaker ID classifiers.
>>10458 Very cool Anon, thanks. It looks like it's a solid and open source system too, AFAICT.
The model link is dead, while I can train a new model I am looking to avoid that step right now because of other deadlines, though I would love to include 2B in WaifuEngine, would anyone be willing to mirror or provide an updated link? Thanks
>>10499 ATTENTION ROBOWAIFUDEV I'm pretty sure the model in question is your pre-trained one for 2B's WaifuSynth voice, ie, https://anonfiles.com/Hbe661i3p0/2b_v1_pt >via https://gitlab.com/robowaifudev/waifusynth cf. (>>10498, >>10502)
>>10504 Links are both dead
>>10504 To clearify the pretrained model links are both dead repo still up
Open file (14.93 KB 480x360 hqdefault.jpg)
Great! Now my waifu can sing a lullaby for me to sleep well. The only problem is that I don't have the Vocaloid editor. Video demonstration: https://youtu.be/mxqcCDOzUpk Github: https://github.com/vanstorm9/AI-Vocaloid-Kit-V2
Open file (548.83 KB 720x540 lime_face_joy.png)
>>5521 >Cute robowaifu Check >Inspiring message to all weebineers everywhere Check >Epic music Check Best propaganda campaign. 10/10, would build robowaifu. >>5529 >>5530 Damn it lads! You're bringing me closer to starting sampling Lime's VA heheheh (Although I was hoping to use my voice to generate a somewhat convincing robowaifu, so as to minimise reliance on females).
>>11229 Forgot to add. >>5532 >I don't know if that'll be enough. Chii didn't really talk much. You're overcomplicating it. I think he meant create a tts that outputs "Chii" regardless of what you put in ;) (Although you could add different tonality and accents, might be a more fun challenge).
>>10504 Sorry, been busy and haven't been active here lately. Updated the repo link: https://www.mediafire.com/file/vjz09k062m02qpi/2b_v1.pt/file This model could be improved by training it without the pain sound effects. There's so many of them it biased the model which causes strange results sometimes when sentences start with A or H.
>>11474 Thanks! Wonderful to see you, I hope all your endeavors are going well Anon.
>>11474 come join my doxcord server if you have time and pm me! thanks for the model, you will likely see it used on the 2B "cosplay" waifu, we may have in the game
>>11480 The link is expired. What would you like to talk about? I don't have a lot to add. You can do some pretty interesting stuff with voice synthesis by adding other embeddings to the input embedding, such as for the character in a multi-character model, emphasis, emotion, pitch, speed, and ambiance (to utilize training samples with background noise.) This is what Replica Studios has been doing: https://replicastudios.com/
>>11522 If you are interested, I am looking for someone to take over the speech synthesis part of WaifuEngine, I got it working however, to work on it as a specialty takes me away from the rest of the application, like I want to train a new model using glowtts but my time is limited. I also have to work on the various other aspects of the project, to get it off the ground. Right now our inference time using tacotron2 isn't great unless you have a GPU. As for compensation on the project, so far I have been giving away coffee money as we have little resources haha, if the project gets bigger and more funding, I'd be willing to help the project contributors out. https:// discord.gg/ gBKGNJrev4
>>11536 In August I'll have some time to work on TTS stuff and do some R&D. I recommend using FastPitch. It's just as good as Tacotron2 but 15x faster on the GPU and 2x faster on the CPU than Tacotron2 is on the GPU. It takes about a week to train on a toaster card and also already has stuff for detecting and changing the pitch and speed, which is essential to control for producing more expressive voices with extra input embeddings. https://fastpitch.github.io/
>>11550 I'd message you on discord about this this could be useful info for the board. But essentially I did use fast pitch originally, the issue is the teacher student training methodology, you have to use tacotron to bootstrap and predict durations to align, When you don't do that and just train on LJS Model of Fastpitch via fine tuning, it fails to predict the durations. We can definitely try this method I am open to it, I guess in my time crunch I didn't bother. I am optimizing for delivery so that we have a product people can use and enjoy, it should be very simple to update the models in the future, it would be one python script change based off my architecture
>>11559 The 2B model I made was finetuned on the pretrained Tacotron2 model and only took about an hour. Automating preprocessing the training data won't be a big deal. And if a multi-speaker model is built for many different characters it would get faster and faster to finetune. I've been looking into Glow-TTS more and the automated duration and pitch prediction is a nice feature but the output quality seems even less expressive than Tacotron2. A key part of creating a cute female voice is having a large range in pitch variation. Also I've found a pretrained Tacotron2 model that uses IPA. It would be possible to train it on Japanese voices and make them talk in English, although it would take some extra time to adapt FastPitch to use IPA. Demo: https://stefantaubert.github.io/tacotron2/ GitHub: https://github.com/stefantaubert/tacotron2
Some other ideas I'd like to R&D for voice synthesis in the future: - anti-aliasing ReLUs or replacing them with swish - adding gated linear units - replacing the convolution layers with deeper residual layers - trying a 2-layer LSTM in Tacotron2 - adding ReZero to the FastPitch transformers so they can be deeper and train faster - training with different hyperparameters to improve the quality - using RL and human feedback to improve the quality - using GANs to refine output like HiFiSinger - outputting at a higher resolution and downsampling
>>11569 Thanks, but what's the point of this IPA. To let it talk correctly in other languages? >Der Nordwind und die Sonne - German with American English accent I can assure you: I doesn't work. Americans talking German often (always) sounds bad, but this is a level of it's own. Absolutely bizarre.
>>11571 Yeah, I live around Chinese with thick accents and this takes it to the next level, kek. That's not really the motivation for using IPA though. This pilot study used transfer learning to intentionally create different accents, rather than copy the voice without the accent. How IPA is useful to generating waifu voices is it helps improve pronunciation, reduce needed training data, and solves the problem with heteronyms, words spelled the same but pronounced differently: https://jakubmarian.com/english-words-spelled-the-same-but-pronounced-differently/ When models without IPA have never seen a rare word in training, such as a technical word like synthesis, they will usually guess incorrectly how to pronounce it, but with IPA the pronunciation is always the same and it can speak the word fluently without ever having seen it before. Also in a multi-speaker model you can blend between speaker embeddings to create a new voice and it's possible to find interpretable directions in latent space. Finding one for accents should be possible, which could be left in control to the user's preferences to make a character voice sound more American, British or Japanese and so on.
>>11577 Ah, okay, this sounds pretty useful. One more problem comes to mind in regards to this. In English foreign names are often changed in pronunciation, because the name would sound "strange" otherwise. The philosopher Kant would sound like the c-word for female private parts. Therefore they pronounce it Kaant. I wonder if the method helps with that as well.
>>11582 In that case it depends what language you transliterate with. If necessary names could be transliterated as they're suppose to be pronounced in their original language, or it could all be in the same language. Exceptions could also be defined. For example, the way Americans pronounce manga is quite different from the Japanese. If someone wants their waifu to sound more like a weeb and pronounce it the Japanese way, they could enter the Japanese IPA definition for it to override the default transliteration.
Open file (18.23 KB 575x368 preview.png)
Open file (62.21 KB 912x423 aegisub.png)
Finished creating a tool for automatically downloading subtitles and audio clips from Youtube videos, which can be reworked in Aegisub or another subtitle editor, then converted into a training set with Clipchan. https://gitlab.com/robowaifudev/alisub
>>11623 This sounds exciting Anon, thanks! >or another subtitle editor Can you recommend a good alternative Anon? I've never been able to successfully get Aegisub to run.
>>11624 Someone recommended SubtitleEdit but it's Windows only: https://nikse.dk/SubtitleEdit Subtitle Editor can display waveforms but it's far more difficult to use and I don't recommend it.
>>11623 Okay, thanks. This could be useful for more, I guess. Maybe later to train the system on lip reading using YouTube, for example. Or maybe for training voice recognition in the first place? How much data do we need to emulate a particular voice?
>>11625 OK, thanks for the advice. I'll try and see if I can set it up on a virtual box instead or something, Aegisub did look pretty easy to use (first time I've seen it in action, so thanks again). The problem is always a wxWidgets dependency hell issue. I can even get it to build, right up to link time.
>>11631 Finetuning a pretrained model you need about 20 minutes. Training a model from scratch takes about 12 hours. Multispeaker models trained on hundreds of voices can clone a voice with a few sentences but still need a lot of samples to capture all the nuances.
Been doing some work to get WaifuEngine's speech synthesis to run fast on the CPU and found that FastPitch has a real-time factor of 40x and WaveGlow 0.4x. This lead me to testing several different vocoder alternatives to Waveglow and arriving at multi-band MelGAN with an RTF of 20x. So FastPitch+MelGAN has an RTF of 12x, which means it can synthesize 12 seconds of speech every second or 80ms to generate a second of speech. "Advancing robotics to a point where anime catgirl meidos in tiny miniskirts are a reality" took MelGAN 250ms on CPU to generate from 2B's Tacotron2 Mel spectrogram. Now I just gotta set up this shit so it's easy to train end-to-end and the whole internet and their waifus are getting real-time waifus. Multi-band MelGAN repo: https://github.com/rishikksh20/melgan Multi-band MelGAN paper: https://arxiv.org/abs/2005.05106 Original MelGAN paper: https://arxiv.org/abs/1910.06711
>>11636 Interesting, thanks, but I meant how much samples we need to fine-tune a voice. I also wonder if voicesmare being 'blended' that way. Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. >>11647 Thanks for your work. I thought voice generation would take much more time to do. Good to know. Responses to someone talking should be fast.
Open file (153.25 KB 710x710 gawr kilcher.jpg)
>>11648 I meant 20 minutes and 12 hours of samples. Finetuning with 20 minutes of samples takes about 1-2 hours on my budget GPU. >Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. This definitely deserves more thought. If every person on the internet will be able to do speech synthesis and there is a tsunami of voice cloning characters, it's important people are able to have creative freedom with it while the buzz is on. People's curiosity will further advance speech synthesis and diffuse into other areas of AI, including waifu tech. On the other hand if people only straight up copy voices then it would cause a media shitstorm and possibly turn people away, but that could also have its benefits. Whatever happens though the accelerator is stuck to the floor. In the meantime while the hype builds, iteration can continue on until the synthesis of Gawr Kilcher is realized. When people look closely though they'll notice it's neither Yannic or Gura but actually Rimuru and Stunk all along.
>>11647 Thanks for the information, Anon.
>>11650 kek. i just noticed that logo. i wonder what based-boomer AJ would think of robowaifus. white race genocide, or crushing blow to feminazis and freedom to all men from oppression?
>>11677 He doesn't like them or AI in general. Said something once like people are going to stop having kids and masturbate with a piece of plastic all day and how the government is going to know everything about people through them and be able to manipulate them perfectly. He's not really wrong. Look how many people already give up all their data using Windows and Chrome.

Report/Delete/Moderation Forms
Delete
Report

no cookies?