/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

LynxChan updated to 2.5.7, let me know whether there are any issues (admin at j dot w).

Reports of my death have been greatly overestimiste.

Still trying to get done with some IRL work, but should be able to update some stuff soon.

#WEALWAYSWIN

Name Max message length: 6144 Drag files to upload or click here to select them Maximum 5 files / Maximum size: 20.00 MB
More
Spoiler images (used to delete files and postings)

Welcome to /robowaifu/, the exotic AI tavern where intrepid adventurers gather to swap loot & old war stories...

Speech Synthesis general Robowaifu Technician 09/13/2019 (Fri) 11:25:07 No.199
We want our robowaifus to speak to us right?

en.wikipedia.org/wiki/Speech_synthesis
https://archive.is/xxMI4

research.spa.aalto.fi/publications/theses/lemmetty_mst/contents.html
https://archive.is/nQ6yt

The Taco Tron project:

arxiv.org/abs/1703.10135
https://archive.is/PzKZd

No code available yet, hopefully they will release it.

https://archive.is/gfKpg
>>199
This has chatter on HN right now.
news.ycombinator.com/item?id=13992454
>>199
Inmoov project is worth checking out, I believe they already have code available for voice recognition, text to speech and a bunch of other stuff. The only thing the bot is missing is legs because they're still trying to figure out an affordable way to make them.

inmoov.fr/

https://www.invidio.us/watch?v=2sZOyCBbows
Edited last time by Chobitsu on 09/19/2019 (Thu) 12:29:40.
>>489
Thanks for the tip anon. If I find some good code or other cyber assets I find valuable I'll link them back here at some point.
>>489
That's a pretty cool robot anon.
>not using espeak
>>497
>Do you think that is healthy?
I lost it.
>>199
There seems to be a project to make good My Little Pony synthesized voices.
https://clyp.it/r0yypquc?token=e11965be1b6dce146eb61702006c285e
https://mlpol.net/mlpol/res/249436.html#249436
Their technology seems sound and the voice is good. They have assembled the files and resources for us to use and to train the talking bot so we can probably use the same technology to synthesize more AI voices. If any of you guys would like to put Twilight Sparkle in your wAIfu then this is a fantastic development. Even if you don't like MLP these resources can turn a library of voice lines such as in an anime or tv show into a synthesized voice which is pretty cool. Put your waifu into a wAIfu.
>>1563
Thanks for the tip anon, I'll try to check it out.
>>1563
>Cool. Good to know that soon we can make Sweetie Belle recite the 14 words.
I'm OK w/ this tbh.
>>1563
>spend 30+ minutes going through the cuckchan thread b/c AI work
welp, i have to say, love mlp fags or hate them, the level of autism on display in their 'pony preservation project' is impressive.
>>1563
Ponies and cuckchan aside, I am impressed. I'll have to read about how intensive the training is. I'm very interested in trying it out for myself. I wonder what would happen if you tried using SHODAN's voice lines.

There are only about 26 minutes worth of audio from SS2. Does anyone know if that's sufficient for training, or is more needed?

>>1570
I'll amend this post by saying that 26 minutes of audio is probably not sufficient. It sounds like there should be at least several hours for the best results. I think a better approach would be to train a neural network using voice clips from someone who sounds similar to SHODAN's original voice actress. The next step would be to create a program that takes voice audio and adds the distinctive audio "glitches" of SHODAN's voice. Then the voice clips generated by the NN could be fed through this program to "SHODANify" it. There might already be ways to do this quite easily with audio editing programs, I'm only thinking of creating an automated way to do it.
>>1571
>I'm only thinking of creating an automated way to do it.
Sounds like an interesting project idea. I'd imagine the original audio engineers for the game layered effects and filters in a traditional way. Figuring out both how to 'reverse engineer' the effect as well as how to automate it for the general case seems like an intricate process. Any plans to pursue this beyond conception stage?
>>1572
Right now, no. I do have an idea of exactly what kind of things I or someone else who's interested would need to do. Unfortunately, my daily schedule and life in general makes it difficult for me to make time for this kind of project. (College student with a job and family, friends, other hobbies, etc.) normalfag-tier I know
However, I'll say this:
the more I think about this and type out my ideas, the more practical I think it is. It's just a matter of investing some time and effort.

I know that Audacity supports python scripting for automating stuff, but I would have to learn about using it to apply the actual effects. If I can't manually create a SHODAN-like audio clip using Audacity, I won't understand enough to automate the process. I already have a general idea of what kind of effects are needed (shifting pitch, timbre, layering with delay, stuttering, etc.) and listening to the audio clips from the game will help me refine the process. Also, depending on what kind of result I can get with Audacity, I may want to consider other audio editing programs. Audacity is just the one that comes to mind, being FOSS.

Once I understand what kind of effects actually go into the audio, and how to apply them to get the best result, then I can start to play around with a Python script. Fortunately I have experience with Python, and I think this would be fairly straightforward. I'd have to read Audacity's documentation on scripting, which may or may not be sparse. Another tricky part to this is applying the audio effects in a somewhat random way so that there's some variation to the resulting clip and so that multiple clips don't all sound alike. I think there should be some underlying logic to how the effects are applied, but it might take me some time to puzzle out the best strategy (and of course I could probably always find some way to improve upon it).

Getting audio clips to pass through the script would be fairly trivial, I think. For starters, I would probably just use a pre-trained NN, or train one on my own using an available dataset. In a perfect world, we could use a NN trained with Terry Brosius' voice. However, I don't think there's very much audio available. She's done voice acting for a variety of games, but I believe many hours worth of audio are needed in order to effectively train a NN. Unless she happens to record a book on tape someday, I doubt that this will be possible/practical.

Question/request to any anon who's familiar with audio editing and manipulation, whether with Audacity or a different program:
Can you "SHODAN-ify" an audio clip by hand? And if so, can you teach me how?
If not, maybe you can point me towards some good resources to help me.
Also, any advice on creating scripts for Audacity (or a different program) would be welcome.

Question for anons who are more familiar with NN's for speech synthesis:
Would you reckon that we could train a NN with Terry Brosius' lines from other videogames, not just SS1/2?
If there's enough audio and interest in this idea, it would be a great benefit to have multiple anons working to transcribe and clip audio from different games. However, I wouldn't worry about this until after either me or some other anon can get a working SHODAN-ify script created.
>>1571
You only need 5 seconds of arbitrary audio to get pretty good results with the SV2TTS framework. It uses an encoder trained to perform speaker verification to create a speaker embedding vector from a short audio sample. Then a seq2seq network creates a mel spectrogram from some text and the speaker embedding, and a wavenet turns the mel spectrogram into an audio waveform. The encoder in the pretrained model was trained on noisy speech, but I don't know how well it would work with a reference as heavily distorted as SHODAN's voice.

GitHub: https://github.com/CorentinJ/Real-Time-Voice-Cloning
Paper: attached pdf
>>1582
>5 seconds
>quality results
well shit negro we're in business
I have a feeling that training directly with the distorted SHODAN audio will not work (although it might be worth trying), so instead I floated the idea of using Terry Brosius' regular voice. There's probably more than enough audio from other characters that she has voiced, based on this information.

pretty sure everything there has been discussed here already except this:
https://paintschainer.preferred.tech/index_en.html
>>1701
example 'AI' painting it did for me
>>1582 Was just gonna repost Tacotron. It's a really amazing voice synthesizer. It'd be interesting to see what it's capable of adapted with transformers instead of outdated LSTMs. https://www.youtube.com/watch?v=0sR1rU3gLzQ Audio samples: https://google.github.io/tacotron/publications/tacotron2/ Paper: https://arxiv.org/pdf/1712.05884.pdf GitHub: https://github.com/NVIDIA/tacotron2
>>2355 Also a newer paper from this year can convert any voice to many and improved on SOTA in any-to-any conversion. I can imagine this being used to train a synthesized voice on the more subtle and emotional nuances of speech. Demo: https://dunbar12138.github.io/projectpage/Audiovisual/ Paper: https://arxiv.org/pdf/2001.04463.pdf GitHub: https://github.com/dunbar12138/Audiovisual-Synthesis
thanks for the papers anon, i'll try to get through them sometime soon. :^)
does anyone have a good idea how many minutes of audio it should take to train a good text-to-speech model from scratch with current machine learning techniques? I found no dataset containing child or childlike speech. so far there seems to be no academic interest in compiling one, and i really don't think another soul on this planet is pathetic or degenerate enough to make one. so here I am with about 500 ~15 minute long videos ripped from some family's youtube channel. youtube's machine generated subtitles are surprisingly accurate so far, but this is still a really daunting task to label speakers, proofread, and format, and i'm not sure how much of this is needed to get the job done right. also this feels incredibly skeevy, but unless one of you has seen a dataset that has what i'm looking for, it's something i have to do, even if the ethics of duplicating a real living child's voice for my own purpose is dubious at best.
>>2499 You might try asking the Anons working on the Pony Preservation Project. They are likely to be a better source of information on this atm. >>1563
>>2499 People have achieved near human-quality voices with the voice cloning toolkit corpus. It consists of 110 speakers, mostly in their 20's, reading 400 sentences each. https://datashare.is.ed.ac.uk/handle/10283/3443 You're not gonna get very far only using one voice though. You can probably find some children's speech datasets on Kaggle.
>>2508 when i looked on kaggle i was unable to find anything of that nature. as far as using a single voice, i'm not intentionally using only a single voice, but when i was looking into this, waveglow (https://github.com/NVIDIA/waveglow) appeared to be achieving decent quality results using a single person's voice. if i can find good samples of multiple voices i'd be interested in all of my different options but as of right now i'm stuck using data i can put together on my own.
>>2517 >if i can find good samples of multiple voices he linked you a very good one already. and i directed you to a group working with an entire 200-episode-show cast's worth of professional voice actors, including girls. can't find anything 'good' in those anon?
>>2518 i didn't mean to disregard anon's advice to seek out the mlp group. i did appreciate the referral. i don't know yet if that fits the bill or if i need to keep looking, but i will be looking into his suggestion.
>>2519 Haha no worries Anon! I just wanted to point out there is already a lot of content between those. The one is over 10 GB of highly-tagged audio sources, and the other is a growing mountain of autistically-driven creation, much of which is remarkably good. Good luck Anon.
I like you.
>>2640 Thanks Anon. We like you, too. Just the way you are tbh.
>>1582 What a fucking garbage software, I tried to use Totala narrator voice to try reading a few paragraphs and the program shits the bed. Using smaller text samples doesn't even clone the narrator voice at all, what the fuck man.
>>4144 Yeah, it's a little dated and wasn't really a pro-grade project anyway. >"...13/11/19: I'm now working full time and I will not maintain this repo anymore. To anyone who reads this: >If you just want to clone your voice, do check our demo on Resemble.AI - it will give much better results than this repo and will not require a complex setup. The free plan already allows you to do a lot. We have a plugin for Unity as well for game developers. >If, for some reason, you want to spend hours setting up a python environment to run crappy old tensorflow code that will give you very average results, be my guest. One year after my thesis, there are many other better open-source implementations of neural TTS out there, and new ones keep coming every day." Might try the recommendation Anon? Please let us know how it works for you if you do, thanks. Good luck. https://www.resemble.ai/
>>4145 >resemble.ai >you need to add your E-Mail account so that their pajeet tech scammer can spam it Yeah lets not get in there, its not looking too pretty.
>>4242 Heh, they are obviously for-profit and want to monetize the customers. As I implied elsewhere, if you want to have your cake and eat it too, then we'll need to roll our own here at /robowaifu/. Better crack those books, Anon! :^)
I just wanted to mention that there's another thread about voices: https://julay.world/robowaifu/last/156.html Maybe these could be merged? I'd like voices to resemble actresses with good voices or a mix of different ones to avoid trouble. There's enough audio and scripts from subtitles available. Is training a NN on using Audacity the right way? It would first need a network which could tell us how similar two voices are, then we could try to get closer and closer. We also have Festival available as free software, so voices from there could be the starting point. Maybe first think of a way how to tell it how close voices are, some generated others not, then it can learn.
>>4333 >Maybe these could be merged? Good idea, and maybe I'll work out a reasonably convenient way to do that at some point. However, Lynxchan software doesn't have a 'merge these threads' button, and the manual effort to do so is rather tedious (as you might imagine). >Is training a NN on using Audacity the right way? Near as I can tell, those are 'apples and oranges' in comparison. Audacity is a tool for human editing of audio files primarily. NNs are an entirely different type of thing. And you have some good ideas. Honestly, I would suggest you look into the Pony's efforts in this area. They have achieved some impressive results already based on extracting VA from the show. >>1563 But I hope we can manage something like your ideas, that would be great.
There's a program MorphVox Pro which can alter a male voice to female. I have no intention to use my own voice for that, but maybe it could be usefull to change generated voices from eg Festival. More importantly it shows what's possible. The vid is even from 2015. https://youtu.be/CpVwl-FEzl4 Via https://dollforum.com/forum/viewtopic.php?f=6&t=130302&sid=44113180fc656eb7aa41381a0ce12d02
>>4345 There is a merge thread feature on 2.4 tho.
>>4622 That's good news. As long as Robi utilizes that version for the reboot, I'll explore that feature and that idea. Now please allow me to mark individual posts and provide a 'move this post to thread X' (in batches of many posts ofc).
>>199 I just found this speech synthesizer programming tutorial: https://www.youtube.com/watch?v=Jcymn3RGkF4
>>4659 Interesting, but the endresult is useless unless you wanted a bad voice with finish accent. Don't fall for his trick at the beginning when he talks like his voice is the synthesizer. Also, I don't get it. Why would everyone create their own? I just need one program where I can put in the data. Did you try his software: https://github.com/bisqwit/speech_synth_series/tree/master/ep4-speechsyn Is it even reproducible, or just messy unreadable code? Where is it better than eSpeak or Festival?
>>4659 Thanks Anon, appreciated. I'll have a look at it sometime over the next few days.
>>4659 Has anyone ideas how to get phonemes from voices, without manually cutting them out of soundfiles? There seem to be some methods, but it's difficult and complex. Not even sure if this helps: https://youtu.be/x1IAPgvKUmM There are voices available for sale and free ones anyways, might be easier to change those. But what's the best way to do that? That would be something getting us forward. Here some introduction to work with sound and signal processing in Python. Not sure if I should learn that at some point, but I like his approach to teaching and learning by doing projects: https://youtu.be/0ALKGR0I5MA The available software gets better every year, but not for free and often it needs the cloud. However, even if we don't get anything done here, at least something will be available, bc others want this stuff as well. Then again, cloud based stuff is quite useless. EmVoice One: https://youtu.be/Da2DAjKzeaQ and UTAU, Vocaloid, SynthV, Eleanor Forte are mentioned in the comments. Newscaster, wow: https://youtu.be/wHP3J01aEns
The best TTS is AI-based. Check out the demos. Google has some. Amazon has some. You don't hear them, typically, though. There are issues with performance & cost. My opinion, wait for generally-available AI TTS. Someone mentioned espeak. espeak is like the near exact opposite, however, you can speed up espeak way faster than other systems. But what I want is good singing. If you've ever pirated Vocaloid, you know it sucks so bad, not just in terms of English results, but in terms of the interface.
>>4694 The idea about eSpeak was, to use it as a base with a female voice, then have some software to change the voice output to make it better. You can combine software, one program doesn't need to do it all.
Someone mentioned Waveglow here >>5461 and it sounds good. Though the alternatives on their site https://nv-adlr.github.io/WaveGlow sound good as well. I wonder why all of these sound better than Hanson Robotics Sofia in her recent videos. Maybe because Sofias speech is generated live at the time she's talking. "In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable." https://nv-adlr.github.io/WaveGlow Can't upload the paper. There might to be a block for uploads, because some spammer.
>>5467 No, Anon, you are the robots. https://streamable.com/u1ulrp >=== What hath God wrought? Use you're robowaifu-powers only for good Anon. :^)
Edited last time by Chobitsu on 10/06/2020 (Tue) 21:17:15.
>>5474 Nice work Anon. Mind telling us how you did it?
>>5474 I hope this will be up to your standards, Anon. >
>>5474 I haven't looked much into the documentation and stuff, but how easy or hard is it to get it to use another voice pack and it sounding just as natural? For example, I've found some local vtubers who know how to talk like anime waifus in English. I figured I could bribe some of them to donate some voiced lines which can then be used as training data so that we can have some cutesy voices. It's cheaper than trying to figure out how to contact JP voice actresses who actually sound terrible in English. So my question is if it's possible to compile a list of the minimum lines a voice actress will need to get enough data to have an AI adopt her voice? I used to be active in game development so I think I am experienced enough in giving contracts and royalties etc. (even though my actual games were financial failures).
>>5480 Sounds amazing and absolutely sufficient. Some little indicators that she isn't human are even very welcome. Well, of course I'd like to use another voice, something cuter, younger, girlier, or more like Cameron. Also, how usefull it is depends on how fast at least single words or short phrases can be created, and on what kind of hardware. I could imagine having a lot of phrases and sentences stored on an SSD and only fill in the blanks. Then maybe add another modulation system to the output (vid related).
>>5485 Haha, thanks but I'm just a humble webm encoder tending my wares. The credit is due to the actual author. >>5474 But yes, you and this Anon >>5484 have some good ideas for enhancements. I hope we can manage something good with synthesized voices, and something entirely open too.
>>5475 It's just WaveGlow out of the box. I wrote a quick Python script to sample it. You can get the code here: https://gitlab.com/kokubunji/say-something-godot-chan It requires an Nvidia GPU with CUDA cuDNN though and the dependencies are ridiculous, not to mention the 700 MB in models that must be downloaded. >>5484 The pretrained model's dataset is 24 hours of 13100 audio clips. I haven't tried training it yet but I can't imagine that amount of data is necessary to finetune the model. I've finetuned GPT2 on my CPU with a few books and it completely change the output. >>5485 It's very fast. Even on my toaster GPU it generates about 16 words per second or 1000 words a minute. You could generate a 90,000 word audiobook in about an hour and a half. >>5486 My frustration is how inaccessible and convoluted these models are. They can't be deployed to consumers, but I got some ideas for a lightweight and expressive speech synthesis that can be run on a CPU. If my voice box simulation is successful I should be able to use the data generated to create a synthesized voice with complete vocal expression. It's really unbelievable the direction research is heading, they just find more ways to throw money at a problem rather than better data. In a few years we might have barely any new AI at all, except what is made available to us through an API to select individuals.
>>5493 >???? >PROFIT!! Kek.
>>5493 Thanks Kokubunji, nice repo instructions. Much appreciated.
>>5493 Chii a cute in the classroom. I like this idea. >16 words per second or 1000 words a minute Seems like that would be fast enough. I hope you figure out how to make your voice box run on a CPU, that would be great!
Full speech is tough even with modern software. But there is a trick I figured out a while ago. If you take robot speech and try to make a fluent dialog something is going to go wrong and ruin the illusion. However this only happens when you're expectation is solid speech. Flip the script of expectations and build it to speak a foreign language you don't know and sprinkle in "pigeon" English. Your perspective goes from "this thing barley works " to " this thing is trying so hard ". What was once an annoying glitch becomes cute effort as it tries it hardest. All it takes prerecorded bursts of a foreign language mixed with a few awkward text to speech words.
>>5503 The actual simulation will be too slow to run in real-time but the data generated from it of the tongue position, jaw position, vocal posture and such can be used to train a neural network to synthesize a voice with the same parameters. By simulating a large variety of voices I hypothesize it could reverse engineer people's voices to some degree. However, the software I'm using cannot simulate sound vibrations from turbulent airflow (breathiness) or surfaces contacting (such as Ugandan Knuckles clicking), only resonance. I might be able to simulate breathiness though by modulating the sound with some randomness. Either way, converting text to the parameters that produce that sound should be far more efficient and embeddable in a game. It'll be better than nothing. The parameters should also make it possible to generate unique voices for random characters and customize waifu voices to one's liking.
>>5493 >RuntimeError: CUDA out of memory. Kek, so much for using Tacotron2. A 6 GB card isn't enough to train it.
>>5521 >geeks and robots lel'd >vocal posture Interesting. I don't think I've been familiar with that concept? >>5523 Are you running it 'dry' Anon? That is, no other resources especially vidya! contending for the GPU's memory resources?
>>5523 >A 6 GB card isn't enough to train it. AI Anon said 'Even on my toaster GPU' . Maybe there's some kinds of settings you need to tweak. Surely a toaster GPU has less than 6GB of RAM?
>>5525 Yeah, this card is fully dedicated to machine learning, not even attached to the monitor. >>5526 I found out the batch size parameter was hidden in hparams.py and it was set too high. It seems to be working with a batch size of 16. I'm surprised how fast it is. It'll only take about 40 minutes to train on 24 hours of audio clips. Now we just need a dataset of cute voice samples.
>>5520 >Your perspective goes from "this thing barley works " to " this thing is trying so hard ". Yes, I agree with that idea. Robowaifu naivete can actually be quite adorable, and it's an effective trope. > pic related >>5521 >Either way, converting text to the parameters that produce that sound should be far more efficient and embeddable in a game. I see (well kind of, I think). If I understand correctly, the workload of the simulation is primarily used for generating these parameters? So if you pre-generate them ahead of time and store them somehow, then the second part where the parameters are read in then used to generate the actual waveforms should be computationally inexpensive. Is that approximately the idea anon? chii has no knuckles... <tfw ;~; >helping anon find de whey https://www.youtube.com/watch?v=IulR5PXiESk >do it for princess chii anon! >>5529 >Now we just need a dataset of cute voice samples. I nominate Chii first. Surely we could manage to create a clip library of all the Chii VA segments from the Chobits animu?
>>5530 >Surely we could manage to create a clip library of all the Chii VA segments from the Chobits animu? I don't know if that'll be enough. Chii didn't really talk much. Each audio clip also needs text and the background noise has to be removed with something like Spleeter or DTLN. https://github.com/deezer/spleeter https://github.com/breizhn/DTLN It's worth a shot though. The average sentence in the dataset is about 20 words. Output seems to perform best around 10-15 word phrases. Keeping clips a sentence long would be best. I'm gonna try Rikka first since I have limited bandwidth and already have the entire show downloaded. I don't know how well it will handle Japanese though. We can probably automate audio clipping by using subtitles off https://kitsunekko.net/ and piping them through a denoising network. That way it's easy to train a whole bunch of characters.
>>5530 Well, the idea of machine learning is to disentangle useful latent variables from high-dimensional data but without immense amounts of data it's exponentially difficult for backpropagation to separate them. For example, if you wanted to learn the x,y position of something on an image and control it but had a lack of data samples to train on, it may notice that the x and y values are correlated and become biased, so when you try to change the x dimension it causes the object to move diagonally instead because it failed to disentangle the latent variables. If the training data covers the latent space evenly and the model has access to all the underlying variables, it has a much easier time disentangling the data and can interpolate between the gaps without much confusion because it's getting input where that data belongs in the latent space. A smaller and simpler model can be used rather than a bulky slow one because it doesn't have to do all the guesswork of pulling all the dimensions apart trying to sort the data. >>5532 Done: https://gitlab.com/kokubunji/clipchan
>>5537 >clipchan error ModuleNotFoundError: No module named 'ass' > Apparently I need a Python dependency? The .ass file was extracted w/ ffmpeg.
>>5539 just in case it matters >
>>5537 >>5539 >>5540 Nvm, figured it out. > #1 Now I'm getting a 'file name too long' error. (probably some of the interstitial stuff, I can post the .ass text if you'd like. > #2 Also, it's extracting the first audio track (Japanese), but I want the second track (English). Any way to control this? Regardless, very cool work Kokubunji.
>>5541 sample clipchan results, btw. had to convert to .mp3 so I could post them here, but the originals are all .m4a >
>>5541 Found the Dialogue Event that broke things (the middle one) Dialogue: 0,0:01:31.87,0:01:33.95,Chobits Dialogue,Comment,0,0,0,,I'm gonna go to Tokyo! Dialogue: 0,0:01:48.87,0:01:55.88,Chobits OP JP,,0,0,0,,{\fad(400,900)\be1}{\k15\1c&HDF6B7B&}Fu{\k21}ta{\k22}ri {\k23\1c&H4E4FDE&}ga {\k44\1c&HDE8162&}ki{\k45}tto {\k22\1c&HA1CA5D&}de{\k23}a{\k23}e{\k24}ru {\k43\1c&H226FCD&}you{\k48}na {\k20\1c&H56CED9&}ma{\k45}hou {\k26\1c&H7D79D7&}wo {\k47\1c&HDA90CB&}ka{\k48}ke{\k152}te. Dialogue: 0,0:01:48.87,0:01:55.88,Chobits OP EN,,0,0,0,,{\fad(400,900)\be1}Casting a spell that will make sure they meet.
>>5537 Thanks for that detailed explanation. That helps, actually. Not sure how to word a cogent response, but the topic seems to make more sense to me now.
>>5532 Fair enough, good luck with Rikka Anon! :^) I'll try to sort out longer audio/subtitle clips of Chii's speech from all the episodes. I'd expect we should be able to find at least five minutes of this, just in case it might work.
>>5537 >>5541 >>5544 BTW, this is the name of the source file itself, just to be on the same page: 01. Chobits [BD 720p Hi10P AAC][dual-audio][kuchikirukia] [3DD90125].mkv The 'kuchikirukia' version seems to be the best quality one I've found over the years of the Chobits series, so I've just standardized on it.
>>5493 We just need the audio clip equivalent of "The quick brown fox jumps over the lazy dog." where there are enough use cases to build a speech pattern. >>5520 The fake foreign language option also sounds good. How about Klingon, or Hymnos (Reyvateil language in Ar Tonelico series)? Godspeed anon.
>>5539 Whoops, forgot to add requirements.txt. Anyone setting it up now can just do: pip install -r requirements.txt >>5541 I could output subtitles too long to a csv file or something. Long audio clips need to be manually split up anyway. It seems the offending line is the opening with all the formatting code. You should be able to filter the events it clips to only character dialogue with --style "Chobits Dialogue" But there may be still rare cases where English subtitles go over the 255 character limit. I'll start with scrubbing the formatting tags so openings can be clipped too. Also it might be useful to keep the raw audio clips and parameters used to generate them. Later when we clean them up we could use that data to train an AI to automatically crop and prepare clips. >>5547 If there are too few, we can try augmenting the data by splitting up audio clips into shorter phrases to add to the dataset, as well as applying a random delay to the beginning, slightly changing the volume or increasing the tempo in Audacity.
Put up some quick instructions on cleaning audio clips with Spleeter and DTLN: https://gitlab.com/kokubunji/clipchan#remove-music-from-audio-clips It's pretty simple to use. DTLN is better at removing noise but the quality isn't as pleasant as Spleeter.
>>5550 >--style "Chobits Dialogue" Great!, that did the trick. Extracted 408 clips in about 2 minutes. > That's fine to get the Japanese clips (which, frankly I like Chii's VA's voice better tbh), but I'd like to get the English channel's clips too. Have any suggestions?
>>5555 >portentous digits tho Here are Chii's first 5 utterances to Hideki... > Ofc during the first episode her only words were cute variations of 'Chii'. :^)
>>5556 BTW, it's humorous to simply mpv * from inside the clip extract directory. Kind of like 'watching' the show on fast forward.
>>5554 Thank you. Yes, Spleeter seems to preserve the subtlties of the voice better. I wonder what's the difference? Regardless, I'll be post-processing the clip extracts from Chii's dialogue where needed. I don't have a specific time frame, but I plan to work my way through an episode or two here and there until I have the complete set. I'll probably post the completed set as a zip on Anonfiles when it's finished.
>>5555 I just pushed a bug fix and feature update. It should be able to process all subtitles now. Too long filenames are truncated and all needed subtitle text is written into filelist.txt in the output path. You can now inspect the subtitles with --inspect or -i before running and it will count how often the styles are used. The most used one is likely the character dialogue. >>5556 My heart can't handle this much cuteness at once. There's a lot of noise in them but some of them are still usable. >>5558 DTLN has a lower sampling rate than Spleeter and was designed for removing heavy background noise like air conditioners running. Good luck with it. If anyone doesn't have CUDA but wants to train a voice I don't mind training a character voice for them if they have the clips. We could probably train the voices on Kaggle or Google Collab too. If a lot of people become interested in the project one day that would be one way for them to get started.
>>5559 Great, thanks for the inspect flag and fixes. So, again, any way to specify the English language audio track for clip extraction (vs. the defaulted Japanese)?
>>5559 Haha, want an ASMR? create a playlist of all 28 Chii clips from ep01 and then mpv --playlist=01_chii.pls --loop-playlist=inf
>>5560 Sorry I missed that. Just pushed another update to select the audio track. Use -track 2 or -t 2 to extract the second audio track. Also added --quiet / -q to silence ffmpeg output unless there's an error and --guess / -g to automatically pick the most common style for extraction. Also major update: the subtitles file option is now optional and specified with --subtitles / -S. Subtitles can be extracted directly from video now and subtitle track selected with -b if necessary >>5568 Haha, that's a lot of Chii. It seems there's a bug though? The formatting tags shouldn't be showing in the filenames unless those are clips extracted from an early version.
>>5570 >Chii-levels > 9'000 IKR? >unless those are clips extracted from an early version. Yes, it's the older stuff I haven't redone it yet. I'll use the newer stuff for the final processing & edits, etc. BTW, there are still a few formatting things in the newer version. IIRC, '(/N)' (or something similar). Also, portable filenames (for instance that work on W*ndows) need some chars removed to work correctly. I dealt with this issue in BUMP. So, I'd say the !, ? and any other punctuation are good candidates for removal from the filenames for example.
>>5570 >Just pushed another update to select the audio track >Also major update: the subtitles file option is now optional Great! I'll try this over the weekend. Thanks for all the hard work Anon.
>>5571 There, added portable filenames. I noticed the recent version wasn't removing {} stuff again so I fixed that too. Now I just need to automate Spleeter and it should be good to go. >>5572 I don't even think of it as work I'm so excited for this. There's so much that can be potentially done with it from voicing waifus to redubbing anime. The memetic potential is infinite.
>>5570 Seem to be having trouble auto-pulling the subtitles. Here's a portion of the output showing which channel it's in: > #1 Here's the response I'm getting, trying the simplest approach: > #2 I'm sure I'm just flubbing it somehow. Maybe providing a specific example of grabbing auto-subtitles would help a bit ? I can generate them myself w/ ffmpeg, but I would much prefer using your method instead.
>>5575 >a portion of the ffmpeg output*
>>5575 >trying the simplest approach: Actually, I guess this is the simplest approach, but it breaks on me worse: >
>>5577 My bad, I forgot to push my code before going to bed, but I think inspect should have still worked with -b 3. I've updated debug mode -d to provide some more useful output, such as what it's running FFmpeg with. FFmpeg is getting an invalid argument list somehow. It may be due to your version of FFmpeg. Can you pull the latest version of Clipchan and try running these commands to see what they output now? python ../clipchan.py -i 01.mkv -b 3 -d python ../clipchan.py 01.mkv -d ffmpeg -version
>>5581 Haha, no worries and thanks very much Anon. While you slept, I finished grabbing and sorting the basic clips for all Chii utterances in ep01-ep03, in both English and Japanese. This tool of yours is a remarkable time saver. Ofc all the clips will need explicit fine-tuning inside Audacity later, but your Clipchan is kind of revolutionary tbh. What a difference a day brings! :^)
>>5581 >Can you pull the latest version of Clipchan and try running these commands to see what they output now? Sure thing, here we go: python ../clipchan.py -i 01.mkv -b 3 -d > #1 python ../clipchan.py 01.mkv -d > #2 ffmpeg -version > #3
>>5583 Once it automates Spleeter, cropping and normalization it will be truly revolutionary. Every show will be game to doing machine learning effortlessly. >>5584 Your ffmpeg wasn't built with --enable-libzvbi. I pushed another update though that uses a different method to extract the subtitles from a video. Let me know if it works for you.
>>5581 >>5586 >Let me know if it works for you. Great, looks like your patch finds the subtitles stream now. > I simply installed the ffmpeg in the repo iirc. I can probably manage to build from their repo if you think it would be worth the trouble?
>>5587 >I simply installed the ffmpeg in the distro package repo iirc*
>>5587 It's fine, if it works now the dependency was unnecessary.
>>5589 Got you. Alright I'm off for a few hours at least. Cheers.
Spleeter is now automated in v0.2 but not fully tested yet. To try it put the Spleeter pretrained_models directory in the Clipchan directory and use Clipchan as usual plus --spleeter. Due to a bug in Spleeter, the terminal will reset after it completes to prevent the terminal from freezing. Next, to automate cropping and normalization I will make it look for the median point of silence in the padding area and crop it to 0.02 seconds before the next sound detected. This should be good enough. There are some alignment issues with my subtitles so I'm realigning them in Aegisub and reducing the default padding to 0.2s since it's not uncommon for subtitles to be 0.4s apart.
>>5593 That sounds clever. Look forward to trying it out. Sounds like you're almost there Anon.
>>5574 > I noticed the recent version wasn't removing {} stuff again I found an explicit example of the newline char still being left in the filenames/dialogue text Dialogue: 0,0:05:36.45,0:05:39.03,Chobits Dialogue,Comment,0,0,0,,{\i1}What did I say in front\Nof such a beautiful lady? The '\N'
>>5597 >newline char that's still being left*
>>5595 Automated clipping and normalization is almost done. I think after this I'll try making a neural net that can detect which character is speaking. That way an entire show can be fed in and sorted automatically using a few examples of the characters speaking. >>5597 Newlines are being removed from my subtitles. The only place they should appear is in the log file in debug mode. Try pulling the latest update and running the same command with -d and inspecting clipchan.log. It will show the reformatted text <> unedited subtitle text, something like this: [249] 00:20:2.960-00:20:5.980 (0.0) The magma of our souls burns with a mighty flame <> The magma of our souls\Nburns with a mighty flame!
>>5601 >That way an entire show can be fed in and sorted automatically using a few examples of the characters speaking. That sounds awesome. I think I'm going to put my tedious effort on hold and wait for your better approach Anon. :^) You know it occurs to me that you could probably do a 'two-pass' approach as well (at the cost of longer processing) that could sort of do all the cleanup, crops, speaker identification, etc., then feed that information back into a second pass sequence that should then be able to improve the accuracy. Of the crops and noise removal of tricky bits that have a lot going on audio-wise in a short time span, for example.
>>5603 Seems to me, this could also be used to improve a series subtitles as well. Sort of an auto-gen for subtitles, that are actually timed very well, and also more accurate with the actual text. For example, the engrish-translation of some English subtitles often aren't right on-cue with the English VA scripts (even if often much more humorous/possibly more accurate to the original Japanese meanings/idioms). Seems like that might save having to go in and manually edit the filelist.txt entries by hand before passing it into the machine learning so audio/text actually matches first.
>>5603 The cropping is perfect so long as the subtitles are aligned correctly. Sometimes sound effects slip through Spleeter but that can't be avoided. Speaker identification is going to require building a dataset first to train on. >>5605 Auto-aligning subtitles will be tricky. I could probably fix small misalignments with the same method I'm using to crop the audio clips by snapping them to the nearest sensible spot. I'd have to run Spleeter over the whole episode first which shouldn't be too big of a hit since it has to convert the clips anyway. I'll add this feature idea to the to-do list. Maybe two projects down the line someone will create some speech recognition for it that can generate subtitles.
Trying to find the instruction on your repo for removing music from clips. > #1 Discovered a minor naming issue w/ instructions. > #2 Then realized (afaict) a showstopper (for me at least) dependency. > #3 I suppose you can't do this w/o a Nvidia GPU then?
>>5613 >Trying to follow*
>>5613 Spleeter runs off CPU by default. You need to downgrade to Python 3.7 to install Tensorflow.
Trained on 2B's voice overnight without data augmentation and possibly a too high learning rate. It's not perfect and there's only 18 minutes of training data, but the results are pretty satisfying. Training data I used available here: >>5620 Filelist.txt: https://pastebin.com/y3GyyBtR Once I fine tune it better I'll create a Google Collab so anyone can use it even without a GPU.
>>5615 >You need to downgrade to Python 3.7 to install Tensorflow. I have no idea how to do that tbh and I've fought trying everything I know how to do to get spleeter working but have repeatedly failed. I'll just focus on extracting and sorting out the clips for now since clipchan does that part well.
>>5626 Haha, wow that's pretty nice already. Great stuff Anon.
>>5626 >Google Collab Any chance of creating a mechanism to save pre-canned responses out locally. I mean audio files and some way to associate them with the input texts locally? It's one thing to use Google Collab intermittently as a generator for locally-stored content, it's another thing entirely to become wholly-dependent on G*ogle for our waifu's daily operations.
>>5627 If you're on a Debian-based distro you can check which versions of Python are available with apt-cache policy python3 and to downgrade aptitude install python3=3.7.3-1 or whatever 3.7 version is available in your distro. Just be careful it doesn't remove any packages and finds a resolution that downgrades packages as necessary. If that fails, Tensorflow 2 is compatible with Python 3.8 and I can try porting these old projects to Tensorflow 2. >>5629 Yeah, you can use the Levenshtein edit distance to find the closest match and play that pre-generated response. You could generate a whole library of words, phrases and sentences then stitch them together. If someone is really ambitious they could probably write some code for evaluating PyTorch models in OpenCL or simply port the code to the CPU. At the end of the day though if someone doesn't wanna be dependent on Google or Kaggle for compute they need to get a GPU.
>>5630 >Just be careful it doesn't remove any packages and finds a resolution that downgrades packages as necessary. I have no idea how to do that. I did find a tool called downgrade and so I'll try to use that. Thanks.
>>5630 Great samples. Kek. >>5631 Be careful not to mess up your system by downgrading your regular Python or install all kinds of stuff: https://www.pythonforbeginners.com/basics/how-to-use-python-virtualenv
Past couple days have been hell tracking down strange bugs and trying to get this to work flawlessly but it's working good now. --auto-clean normalizes, removes silence and resamples clips to prepare them for machine learning and has a success rate of about 98%. It gives warnings for which clips need further attention. Most of the time these lines aren't usable anyway since they contain overlapping audio or other strong background noise. Also added another tool for quickly captioning audio clips called filelist.py. It goes through all the wave files in a folder and plays them, prompting you what the line should be. Hopefully it's pretty straightforward to use. You will need to install playsound to use it:pip install playsound With that, Clipchan is pretty much done and ready for waifu datamining. Enjoy! https://gitlab.com/kokubunji/clipchan
>>5633 I see, thanks for the advice. >>5648 Thanks for all the hard work. Does --auto-clean rely on dependencies, or no?
>>5615 Ironically enough, I can install Tensorflow 2 just fine on my system. But even when I install TF 1.4, Spleeter refuses to recognize it, and I basically broke my system trying to downgrade to Python3.7 from 3.8 (thankfully I seemed to have recovered from that now). Even when I successfully installed TF1.4 on a RaspberryPi (Debian Buster-based, and already Python3.7), Spleeter still refused to recognize it and failed to install.
Here's as far as I've gotten to : spleeter 1.4.0 requires museval==0.3.0, but you'll have museval 0.3.1 which is incompatible. spleeter 1.4.0 requires pandas==0.25.1, but you'll have pandas 1.1.3 which is incompatible. spleeter 1.4.0 requires tensorflow==1.14.0, but you'll have tensorflow 2.3.1 which is incompatible.
>>5651 >TF 1.14*
>>5649 It depends on Numpy and Scipy at the moment, but I'm gonna remove the Scipy dependency and use the standard library's wave instead. >>5651 Figures. Downgrading is always a nightmare. I'll see if I can port it to Tensorflow 2. Fortunately there's some code to automate translating projects.
>>5677 It seems just a few days ago Spleeter 2.0 was released on PyPI that's compatible with Python 3.8:pip install spleeter I created a separate branch for Spleeter 2.0 and Tensorflow 2.3.0:cd clipchan git checkout python3.8 pip install -r requirements.txt I've tested that it's compatible with the Spleeter 1.4 pretrained models. It seems people have already ported Tacotron2 and WaveGlow to Tensorflow 2 so I'll work on creating a Python 3.8 branch for WaifuSynth too.
>>5682 Great news! I'll give it a shot tonight.
>>5682 BTW (OT) what bearing is the advice to use the '-m' flag with pip? As in pip -m install foobar I've seen that advice (and examples) often, but I don't think I understand what difference it makes yet.
>>5682 Had to first remove the leftover 1.4 spleeter egg file from /usr/lib/python3.8/site-packages/ > #1 Things seemed to be going well, then it errored out with this > #2
>>5687 Apparently, you can specify a version number (but guys recommend against this approach for some reason?) > #1 I'm not sure if this means everything went ok now with pip install spleeter > #2 but I'll push ahead with checking out the 3.8 branch of clipchan...
Don't forget to git fetch first before checkout. > #1 Minor typo in requirements.txt > #2 Seems OK so far now > #3 I'll give it a test in a while and see how --auto-clean goes. Any specific example command you'd care to give me so I don't flub the increasingly complicated heh :^) clipchan flags?
Here's the command I used: python ../clipchan.py 01.mkv --output-path clips/en/01 --style "Chobits Dialogue" -b 3 --spleeter --auto-clean --track 2 Which produced ~400 .wav files for me > #1 but errored out on me right at the beginning of the spleeter post-processing phase. > #2 And the log file seems empty? > #3
>>5690 Also, I'm curious why filelist.txt is being written in the working directory instead of the target directory. Intentional? Seems like it used to be written into the same directory as the output .m4a files IIRC.
>>5686 python -m pip -m tells python to run a module, in this case pip, from that specific python installation. >>5687 Specifying certain versions leads to unnecessary dependency hell. Unfortunately, dependency hell is the reality because all these different machine learning libraries require specific versions or they break. >>5689 Thanks, fixed the typo. Most of the flags are there to give some flexibility and don't need to be changed. It should work great with the default settings. >>5690 If the log is empty I assume Spleeter failed to start. What happens when you run Spleeter by itself?python -m spleeter separate -i clips/en/01/*.wav -o /tmp -n filename >>5691 You can change where filelist.txt is written to with --filelist-output-path / -l or name it to something else like ep1_filelist.txt with --filelist / -f
Hmm, a surprise. I deleted all the output .wav files from the previous effort and decided to try again w/o the --spleeter flag python ../clipchan.py 01.mkv --output-path clips/en/01 --style "Chobits Dialogue" -b 3 --auto-clean --track 2 but had the same outcome? >
>>5692 > from that specific python installation. Hmm, not sure I really understand that (probably why I didn't get it before heh). But this is probably more apropos for our Python General thread? >>159 >Thanks, fixed the typo y/w. >You can change where filelist.txt is written to with Great, more flags to deal with haha. <pull request: default filelist.txt into the same output path :^)
>>5692 >What happens when you run Spleeter by itself? What should I pass as the filename? (sorry if this seems blatantly obvious to everyone else) >
>>5692 >>5695 When I dropped the unrecognized flag, spleeter unceremoniously crashes with no further output >
One other thing. I noticed a spleeter.sh file in the directory (no doubt left over from the 1.4 branch of clipchan). This bash script is explicitly for spleeter v1.4.0 it seems. Does there need to be a similar script for the newer spleeter v2. Just spitballing here tbh.
>>5694 You can have different versions of Python installed, each with their own pip module for managing packages. And sure, I'll make the output folder the default path. >>5693 Double check your command. It's not possible to reach that part of the program unless --spleeter or --spleeter-model is given. >>5695 That is a bug in Clipchan. It seems Spleeter changed its arguments and I didn't realize I was using 1.4 when I was testing because Python ignored the virtual environment. I will have to investigate how Spleeter 2.0 outputs its files. >>5696 However, this is caused by Tensorflow being built with optimizations not supported by your CPU. Likely the only option here to workaround this is to build Tensorflow 2.3.0 from source: https://www.tensorflow.org/install/source >>5697 Pip simplifies the installation of Spleeter. I'm going to change the other one to use pip as well.
Alright, I'm trying to build Tensorflow from source r/n. I'll start over completely from scratch with Clipchan if that goes well. Hopefully, the new 3.8 copy of Clipchan will go well then.
Well, unsurprisingly that was a miserable failure. Building Tensorflow appears well above my paygrade. After a week of fighting with this with no success I'm about done with it. Question to Kokubunji If I simply sort the Chii vocals out of the raw .wav clips and then package them up for you in a zip somewhere, can you take if from there and do all the rest? It's obvious at this point I can't get Clipchan's marvelous, no doubt features to work correctly.
>>5701 Yeah, Clipchan will automate the rest. I'll see if I can find a Tensorflow 2.3.0 pip wheel without AVX or build one for people to use. Most of the machine learning libraries have dropped support for legacy CPUs in a similar way researchers have dropped support for CPUs. Now that I think about it, a lot of the papers require massive GPU clusters or 100's of TPUs and they don't release their models or code. We're already being left in the dust and if /robowaifu/ can't get AI to work, then what hope is there for everyone else?
>>5702 Thanks, I'll get on that and hopefully have it for you within the week. (I already got about 12 episodes done from before but have to redo with .wavs heh). My main issue is my lack of experience with python. I'm more interested in the underlying engines it's scripting that do the actual heavy-lifting. We'll need both skill areas, and we're doing OK in that regard IMO. Thanks for all the great innovation and also the hard work helping us get things working on lower-end hardware. You are much appreciated Anon. >We're already being left in the dust and if /robowaifu/ can't get AI to work, then what hope is there for everyone else? Ehh, we'll get things working, you obviously already have many things now. We're just on a limited budget as typical Anons vs. Big Tech has unlimited deep pockets. It was always going to be asymmetric for us. We just have to be clever about things. I'm not going to stop, I just have to recognize my current limits. We'll all learn more about things as we go along.
>>5702 > Dropped support for legacy CPUs Whhahaaaah, they do that?!? I've spend the last week or so thinking about building a server based on old Xeon CPUs and maybe also buying a Xeon Phi, as the external brain for my waifu... F... https://youtu.be/ZKkzEBtIoH8
>>5708 Yes, this is a common idea I think. It's a strong argument for us to succeed at figuring out a way to use old hardware effectively. There's faar more of that kind available the world over than the other. It's probably fundamentally less botnetted as well, also an important point.
Same anon than >>5708 Looked into it a bit, Pytorch for example seems to support CPUs via Intel Math Kernel Library, which seems to have no limitations to which CPU is working with it, except optimization might not work on non Intel CPUs: https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html
>>5704 Yeah, getting this stuff to work needs to become frictionless as possible for everyone to get involved. People don't have the time or energy for hitting their head on a wall for a week. There's about 1-2 years left before AI really takes off so it's definitely doable to get everything ready for people to use. I can imagine something like Transcoder translating PyTorch and Tensorflow code and models to mlpack in C++ which can build for any system, including embedded systems. >>5708 >>5712 Xeon and Xeon Phi should be fine. The public builds of PyTorch and Tensorflow require AVX and AVX2 since sometime around 2018. The devs have said multiple times it isn't suppose to but the instructions keep slipping into the builds and they don't do anything about it. Sometimes they do but then a minor version later it's fucked again. They've effectively cut off millions of people from using or learning how to use neural networks on common hardware. And just a few years ago PyTorch worked on 32-bit systems but they dropped support for 32-bit builds too. In a few months I'll definitely see if I can port Spleeter and WaifuSynth to mlpack. That would completely disentangle ourselves from Facebook and Google and be a huge step forward to keeping AI open.
>>5718 >Transcoder Is it this project Anon? > Unsupervised Translation of Programming Languages >A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.
>>5718 A week ago I wouldn't have recognized the term AVX, but yeah even ood Xeons have that, just not the newer versions of it, which would be much better. Still a bit concerning if old CPUs are getting ignored, especially for young students doing this on their own small budget. Still, it's amazing what hardware you can get for even 1k, and I figure the devs have to draw the line somewhere. Should have bought Bitcoin when you where in primary school or so, I guess.
>>5721 *old
>>5719 Yeah, that's it. The source code is available here: https://github.com/facebookresearch/TransCoder/ But this is getting off-topic from speech synthesis.
>>5718 >In a few months I'll definitely see if I can port Spleeter and WaifuSynth to mlpack. <Over 9'000!!! Haha not only are they cutting-edge AI library devs, they are shitposting memes within their own paper. Godspeed Anon.
>>5718 We're behind you, Anon! >>5730 >>5731 >>5732
Alright, I've tried my best to build a Tensorflow pip wheel without AVX and mostly succeeded but some AVX instructions are still slipping through inside pywrap, tf2xla and libtfkernel. On the Tensorflow community wheel page no one has succeeded yet in making an AVX-free build for 2.3.0 and I can't even begin to figure out where they forgot to include the optimization parameters in Google's 9 GB clusterfuck. So I've taken a look at the Spleeter model and it uses some algorithms not implemented in mlpack yet and the same is true for Tacotron2 and WaveGlow. I should be able to translate the functions but it's gonna be a lot of work, at least 6 months full-time. I'm gonna play around with mlpack more to see if it's worth the effort. The most off putting thing is the lack of good GPU support but they're working on that. There might even be other people willing to help since it would be a significant contribution to mlpack to have all these signal processing algorithms.
>>5736 Well, no one could possibly criticize your creativity and efforts thus far Anon. You've already done amazing work, and I'll support you whatever your decision. If you choose to simply wait until the mlpack codebase improves, then perhaps anons can simply do what I'm doing; namely just produce waifu vocal clips and then put them out there for those with more hardware to process. This will be a hit-or-miss approach IMO, but it preserves the status quo. It would be the choice of a sensible man, no doubt. If you choose to press forward with the hard choice, then I'd be happy to do any simple tasks I can to help take some of the load off you. Profiling the specific performance characteristics of particular commands/datasets on either my old Intel or ARM hardware, for example. Certainly the template generics approach the mlpack team has taken is both a tried-and-true one, and the run-time performance thus far seems to be smoking hot on this tiny arm7hf hardware currently at my disposal. The high-resolution clock from the standard C++ library seems to work just fine on this old hardware, afaict. If you can give me a list of tests to perform, I'll be happy to participate in that small way in your big effort Anon. Again, regardless, you've already made remarkable strides, and you deserve applause for that alone. :^)
>>5736 >The most off putting thing is the lack of good GPU support but they're working on that. Interesting. Can you give us details on the efforts so far? >There might even be other people willing to help since it would be a significant contribution to mlpack to have all these signal processing algorithms. SYCL is a higher-level form of OpenCL, and entirely in standard C++. This implies some probability of a long-term, broad availability on a widely diverse set of hardware. If you've ever had experience with Microsoft's C++ AMP, you can think of this as a more sophisticated progression of that approach. https://sycl.tech/ CodePlay is taking a lead at heading up practical solutions to support this standard. The ISO C++ Standards Committee also has a study group SG14 (the GameDev & low latency ISO C++ working group), headed up by Michael Wong, the chairman of OpenMP. It seems to me this is probably the most likely path the committee will progress down towards direct, standard support for heterogeneous computing accelerators such as GPUs. Probably worth thinking about in the long-term though quite early to put all your eggs in that one basket just yet IMO. >Porting our robowaifus to mlpack This is likely to be a big, complex topic and probably deserves it's own thread.
>>5698 Sorry, I don't know how I missed that filename Anon. That's exciting! So we're really going to have robowaifus then?
>>5861 >Chii I'm roughly halfway now through sorting out the new .wav versions of Chii's vocals. Should I continue or do you already have them. I estimate it will take me another couple of weeks total to finish up, then go through them all and trim them in Audacity, then make all the many edits needed in each filelist.txt to make the English VA's actual words. BTW, the version of .wav file are full 16-bit stereo files, but IIRC you mentioned something on the repo about 22.5K mono files instead? Should these be downsampled first after I sort them?
>>5863 >to match the English VA's actual words.*
>>5863 I haven't done Chii yet. It will take several days at least to train a new model from scratch, starting with the LJSpeech dataset. You shouldn't need to trim them in Audacity though. It only takes a minute to align the subs in Aegisub. A little bit of noise within 200ms at the end and start of clips will be found clipped out automatically. The clips shouldn't be resampled before going into Spleeter. They will get resampled into mono 22050 Hz automatically after running Spleeter from the Clipchan master branch (but not the 3.8 since it's behind). When I wake up I'll update filelist.py to automatically create the filelist from a subtitle file. That'll make things a lot easier for English VAs instead of tapping > and fixing the punctuation. I originally made it for subbing audio clips without subtitles when I was going through 2B's, but in practice I've found it's a lot easier to create subs in Aegisub for things like audio from a YouTube video and then run Clipchan on them.
>>5866 > It only takes a minute to align the subs in Aegisub. Unfortunately I didn't do that ahead of time (like an idiot), and as I said I've already ripped the entire thing and I'm halfway through sorting them out (requires listening to every clip in order ofc--basically the time req'd to watch the entire episode). I can start the entire process over again for the forth time haha if you think it would be more expedient. It would be really helpful if we had some sort of full tutorial video from you about the correct way to use Clipchan, start to finish, posted on bitchute or somewhere Anon. Regardless, I'll do what needs doing to get Chii's voice on her virtual waifu's avatar. Also, I wonder if the other characters a VA does for other animus might also be helpful in contributing to a full & capable trained model for a character?
>>5866 BTW, I'm the anon who's on the 3.8 branch...
>>5863 By the way why do you need to edit filelist.txt to make the English VA's actual words? It's already automatically generated by Clipchan. There are English subtitles for Chobits here: https://kitsunekko.net/dirlist.php?dir=subtitles%2FChobits%2F
>>5907 Simply b/c many (~ >50%) of the English subs in my source widely diverge from the actual English voice track for Chii. The longer the sentence(s), usually the worse it becomes.
>>5913 > (~ >50%) OK, that's probably an exaggeration, but it's certainly several hundreds of dialogue line examples in the whole series of 25 episodes.
>>5867 >Also, I wonder if the other characters a VA does for other animus might also be helpful in contributing to a full & capable trained model for a character? Perhaps, some VAs change their voice acting quite a bit between characters. It should give the model more data to work with. A big issue with 2B's voice is there isn't enough voice clips to cover every possible word, but I'm hoping this multi-speaker version will learn to fill in the blanks. >>5913 Oh, that sucks and makes sense. Once I finish the next version of WaifuSynth I'll see if I can extend it to speech recognition because that's gonna be a big hassle for people trying to train their own waifu voices.
>>5915 Yeah, the VA for Chii is rather prolific WARNING: don't look her up, it will ruin everything haha! :^) and I thought that since Chii isn't really too diverse in her vocabulary (part of the storyline arc ofc), that perhaps the statistical modeling effect of AI might benefit if I can find another character she did that wasn't too far from Chii's 'normal' voice. >multi-speaker fills in the blanks That's good news. Here's hoping. >auto voice recog That would truly make this into an amazing all-in-one toolset Anon.
>>5916 Anyway, for now don't worry about resampling the clips. They should be the highest quality available before going into Spleeter. In Aegisub you can load up the video or audio, align the subtitles, type it in the proper line, and hit enter to proceed to the next one. When Clipchan is complete I'll make a video explaining the whole process.
>>5917 OK, thanks for the explanation. Sounds like I need to start over with this. Not sure what my timeline will be, probably somewhere around the Trump win.
For some reason I thought I uploaded the 2B voice model for WaifuSynth already but I didn't. You can get it now here: https://anonfiles.com/Hbe661i3p0/2b_v1_pt
>>5932 >2B CATS remake wehn? /robowaifu/ for great justice. This needs to happen.
>>5917 Welp, I just wasted an entire day trying to get Aegisub up and running with no success. Just as an offhand guess, I'm supposing you're not running it on Linux (but rather on W*ndows)?
>>5945 I quit using Windows over a decade ago. What problem are you having with it?
>>5945 It could be either an issue with FFMS: >After upgrading my Linux distro, i ran Aegisub and got this error >aegisub-3.2: symbol lookup error: aegisub-3.2: undefined symbol: FFMS_DoIndexinga >So i had to downgrade from ffms2 2.40 package to ffms2 2.23.1 https://github.com/Aegisub/Aegisub/issues/198 Or Wayland, Aegisub requires x11: >GDK_BACKEND=x11 aegisub does not crash. https://github.com/Aegisub/Aegisub/issues/180
>>5949 >>aegisub-3.2: symbol lookup error: aegisub-3.2: undefined symbol: FFMS_DoIndexing That was exactly the error from the distro package manager install that started my down this long bunny trail. I never found that issue link in my searches. :/ Anyways, I went and downloaded the repo and tried to build from source, but then discovered I had to have wxWidgets as well, so I had to back out then builld that from source (dev version took hours to finish, but at least it succeeded in the end). Afterwards, the Aegisub build failed with 'references not found' type errors. Too many to remember and I tableflipped.execlosed the terminal after all those hours in disgust so I can't recall exactly. Anyway thanks for the links. I'll try it again tomorrow.
>>5932 One thing I'm not perfectly clear on Anon, can WaifuSynth be used for other languages? For example, since animu is basically a Japanese art-form, can your system be used to create Japanese-speaking robowaifus? If so, would you mind explaining how any of us would go about setting something like that up please?
>>5861 OMFG anon this is awesome! Crafting 2B's perfect ass out of silicone will be challenging but this is all the motivation I need!
>>5648 Anon what happened to your gitlab profile? It is deleted, can you post your new one?
>>7560 Anyone downloaded this, at least for archiving reasons? This is also gone: https://anonfiles.com/Hbe661i3p0/2b_v1_pt from here >>5932
For singing, https://dreamtonics.com/en/synthesizerv/ has a free Eleanor, which is pretty fantastic. As with all vocaloid -type software, you have to git gud at phonemes.
>>7576 It is possible that he deleted all the contents and left robowaifu after the latest drama. He might be the anon who involved in the latest one. If that is the case it's pretty unfortunate. I hope he recovers soon and comes back.
>>7594 Possible, but I hope this isn't it. Kind of radical. I tried to explain it to him as reasonable as possible what the problem was. Whatever, I don't wanna get into that again. The more important point is: I think he gave us enough hints how to do this stuff. I'm not claiming that I could reproduce this clipchan program, but I had the same idea before I read it here. It's kind of obvious to take subtitles to harvest voices. Which means, there will be other implementations on the net doing that and explaining how to. We don't need someone come to us or to be into anime nor robowaifus, just take some other implementation from another place or have someone reproducing it based on the knowledge available.
>>7594 What drama happened besides the migration I've been to deep in my projects to browse like I used to.
I'll post this here for now, since it's definitely relevant. I was experimenting a little bit more with Deltavox RS and Audacity. It seems that there is no "one size fits all" solution when using Deltavox. In order to get a decent result, you have to experiment with different spellings, phonemes, energy, F0, bidirectional padding, and so on. In Audacity, I used a simple filter curve. I was able to get noticeably less tinny audio, which sounds less computer generated. I'm going to explore more options for editing the audio after it's been synthesized to improve its quality. I'll post again if I find anything interesting. I'll repost the links here since they're still relevant: Deltavox User Guide https://docs.google.com/document/d/1z9V4cDvatcA0gYcDacL5Bg-9nwdyV1vD5nsByL_a1wk/edit Download: https://mega.nz/file/CMBkzTpb#LDjrwHbK0YiKTz0YllofVuWg-De9wrmzXVwIn0EBiII
>>8150 Thanks. BTW, do you know if this is open source? Since QT dlls are included I presume this is C++ software. If both are true, then it's very likely I can rewrite this to be portable across platforms -- not just (((Wangblows))) and we can be running it on our RaspberryPis & other potatos. Thanks for all the great information Anon.
>>8244 And one based on Vocaloid: https://youtu.be/OPBba9ScdjU
Is the voice synthesize going to be for English voices or Japanese voices? Or does one also work for the another? It would be pretty awesome if one could take voice samples of their favorite VA, vtuber, etc, put it through a voice synth A.I., and give their robowaifu that her voice.
>>9110 >It would be pretty awesome if one could take voice samples of their favorite VA, vtuber, etc, put it through a voice synth A.I., and give their robowaifu that her voice. More or less, that has already been achieved Anon. So hopefully more options along that line will soon become readily available for us all.
>>9112 Our guy here called his WaifuSynth. ITT there are examples from the ponys, who have taken a highly-methodical approach for all the main characters in MLP:FiM cartoon show.
I see. Though all the synths seems to be for English voices. I'm guessing the 2B, Rikka, Megumin, Rem, etc mentioned in >>5861 are referring to their English VA rather than the Japanese ones. Unless I'm missing out on something? (If so, then maybe it'd be best for me to make some time and read this whole thread.)
>>9118 AFAICT, the training approach is just a statistical system matching sounds to words based on examples. It should work for any human language I think -- though you would need to be fluent in the target language to QA the results ofc.
>>9119 Ohh, I see. One last thing: I wouldn't be wrong to assume that, since the dropping of "kokubunji", there is no one working on the voice for robowaifu?
>>9112 WaifuSynth: https://gitlab.com/robowaifudev/waifusynth Clipchan: https://gitlab.com/robowaifudev/clipchan There are better methods now like FastPitch and HiFiSinger. FastPitch is about 40x faster than Tacotron2/Waveglow (what WaifuSynth uses) and is less susceptible to generation errors but is still far from perfect. HiFiSinger uses three different GANs to make realistic speech and singing, and its 48kHz model outperforms the 24kHz ground truth but it still has room for improvement in 48kHz, although I suspect it could be near perfect by training a 96kHz model. FastPitch: https://fastpitch.github.io/ HiFiSinger: https://speechresearch.github.io/hifisinger/ There's still a lot of research to be done before this stuff will be ready for production, namely imitating voices without training a new model, emotion/speech style control, and ironing out undesired realistic noises in generation. Probably in the next 2-3 years it will be easy to make any character sing given a song or read out any given text, and you won't have to go through the whole hassle of collecting audio clips and training models yourself. >>9118 Making Japanese VAs speak English and English VAs speak Japanese should be possible but you will have to train a model that converts the input to phonemes, otherwise it will just garble and misread text. Training Tacotron2 takes a really long time so I'd recommend modifying FastPitch to use phonemes instead of characters. All you have to do is instead of inputting characters like 's a m u r a i', you input the IPA 's a m ɯ ɾ a i'. You can probably save a lot of time on training by initializing the embeddings of the IPA symbols to the character embeddings of a pretrained model, then train it on LJSpeech or another dataset until it sounds good, then fine-tune it on the desired character. This paper reports that it only takes 20 minutes of audio to speak a new language using an IPA Tacotron2 model but they don't provide their trained model or code: https://arxiv.org/abs/2011.06392v1
>>9121 Also, you can use https://ichi.moe/ to convert Japanese subtitles from https://kitsunekko.net/dirlist.php?dir=subtitles%2Fjapanese%2F into romaji and then convert the romaji to IPA. Japanese IPA is straightforward since the syllables sound exactly the same as they are written, unlike English: https://en.wikipedia.org/wiki/Help:IPA/Japanese
>>9121 >>9122 !! Thanks so much Anon!
>>9121 Nice, thanks! I have an 'new' machine (well, old but still much better than my old notebook) pieced together that has an i3 and an Nvidia GT430 (or possibly an 750ti). Not too impressive I know, but I could use it to take a shot at setting up clipchan again. Mind giving me specific set up advice Anon? Like the OS to use, Python version to use, etc., etc. The more specific, the better. TIA.
>>9125 2 GB might not be enough to train FastPitch but you might squeeze by with gradient checkpointing and gradient accumulation to reduce memory usage. A 1 GB card will certainly be too little since the model parameters are 512MB and you need at least twice that to also store the gradient. If it doesn't work you could shrink the parameters down and train a lower quality model from scratch. However, it seems the GT430 supports CUDA 2.1 and the 750Ti supports 5.0. 2.x capability was removed in CUDA 9 and 6.x removed in CUDA 10. If you're lucky you might be able to get them to still work by compiling PyTorch and Tensorflow with the CUDA version you need, but I wouldn't bet on it. I'd recommend using at least Python 3.7 and Tensorflow==2.3.0 since Spleeter requires that specific version. If someone has a newer GPU with at least 6 GB they'll have to download a Tensorflow 2.3.0 wheel with CUDA 11.1 because Tensorflow only supported 10.1 until version 2.4. A Tensorflow 2.3.0 + CUDA 11.1 wheel for Python 3.8 is available here: https://github.com/davidenunes/tensorflow-wheels Again this is only necessary if you have a newer GPU with at least 6 GB. Spleeter will run fine on the CPU. I use Python 3.7 and don't feel like compiling a Tensorflow 2.3.0 wheel for it so I just modified Spleeter's requirements.txt to support tensorflow==2.4.0 and numpy>1.16.0,numpy<=1.19.5 and installed it from source. Spleeter will still work and output the clean voice clips but crash after finishing. This error can be patched by commenting out the del function in spleeter/separator.py:135 since Tensorflow 2.4 closes the session automatically. I'm using PyTorch 1.8.0 with CUDA 11.1 since it supports the full capabilities of my GPU. To use either PyTorch or Tensorflow easily you'll need at least a 4th generation i3 which has AVX2 support. Otherwise you'll have to look for community pip wheels compiled without AVX/AVX2 for your specific version of Python. Older versions of PyTorch are compatible with most deep learning models but lack the newer torchaudio which is an amazing library for processing and training on audio that will certainly start seeing some use soon. Lastly, Clipchan needs to be updated to support the latest Spleeter version which completely changed everything again, but that'll be an easy fix. I'll patch it tonight.
>>9135 Great, thanks for the specific details. Hmmm, from what you're saying it sounds like I still won't be able to run it even with the better (by comparison) hardware. I'm sure someday we'll be able to do this stuff on much more modest hardware. I'll just be patient and focus on other things till then. :^) >Lastly, Clipchan needs to be updated to support the latest Spleeter version which completely changed everything again, but that'll be an easy fix. I'll patch it tonight. Don't rush on my account Anon. Maybe it will help others though.
Due to the 'rona my workspace is limited to a shitty laptop. What's the best (or rather, least worst) model one could conceivably run on a CPU?
>>9136 It works fine on better hardware. The problem is backwards compatibility is a foreign concept to Tensorflow so people end up locking their projects to old versions and creating a hellish nightmare of dependency conflicts. Also short clips don't take up too much memory. Only when processing 10 minute songs does it use up to 6 GB. To avoid this Clipchan processes each clip individually. And Clipchan has been updated now to v0.3. I had to fix it anyway to help someone get voice clips. It's essentially finished, besides making it simpler to use and ideally creating a GUI for it. The most important options are -dza which cleans audio with Spleeter, speeds up subtitle processing, and auto-crops the audio clips. For Tacotron2 -r 22050 -c 1 are also needed to resample and mix stereo to mono (they require the -a option to have any effect right now.) If you don't have FFmpeg with libzvbi, then omit the -z option. And some fresh Natsumi Moe voice clips from v0.3 ready for Tacotron2: https://files.catbox.moe/ipt13l.xz Still a work in progress but there's about 10 minutes of usable audio there. >>9149 Not sure, you won't get much of a speed up running FastPitch on CPU compared to Tacotron2. It's possible for fine-tuned models to be pruned and compressed down so they can run on mobile devices, but I'm not aware of anyone who has taken the time to do that. Pruning and compressing doesn't apply to training though, only works with inference.
>>9150 Thanks for the Natsumi Moe clips Anon! A cute. I hope someday we can manage a basic Chii library to generate voices from. Sounds like certain interests are specifically trying to keep their tools from working with older systems -- even their own systems haha. Doesn't sound like I (and thousands more like me) will ever be able to use this tool at that rate. Maybe if someone creates some Docker or other kind of container that were tuned for different hardware setup ups then we might be able to break free of this intentionally-created ratrace they intend us to run.
>>9121 >Cloning into 'fastpitch.github.io'... >fatal: repository 'https://fastpitch.github.io/' not found >Cloning into 'hifisinger'... >fatal: repository 'https://speechresearch.github.io/hifisinger/' not found I'm a starting to get paranoid or are they really onto us?
>>9162 Those are demo pages, not repositories. FastPitch: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch HiFiSinger (unofficial implementation): https://github.com/CODEJIN/HiFiSinger
>>9162 Those aren't git repositories Anon. Browse there and read the pages.
>>9163 Hmm, the only way I can actually get a clone to work is by going up in the tree a bit? git clone --recursive https://github.com/NVIDIA/DeepLearningExamples.git
>>9165 Git 2.25.0 includes a new experimental sparse-checkout command: git clone --filter=blob:none --sparse https://github.com/NVIDIA/DeepLearningExamples.git cd DeepLearningExamples git sparse-checkout init --cone git sparse-checkout add PyTorch/SpeechSynthesis/FastPitch
>>9150 Yeah, I'm not going to do any training, just inference from released checkpoints. I did manage to get a FastSpeech2 model running with some pretty good results, although for some reason it adds garbled echoes after the generated speech.
>>9179 Ahh, didn't know about that one, thanks Anon.
A novel voice converter that outperforms FastSpeech2 and generates speech faster. Although it doesn't do speech synthesis from text it introduced a convolution-augmented Transformer that could easily be adapted into FastSpeech2 and FastPitch to improve the quality of synthesized speech. https://kan-bayashi.github.io/NonARSeq2SeqVC/
>>10159 Quality sounds excellent. Thanks Anon.
>>9121 >>10383 >robowaifudev
Facebook made a great speech generator, circa a year ago: https://ai.facebook.com/blog/a-highly-efficient-real-time-text-to-speech-system-deployed-on-cpus/ - It's not free software, but they described how it is build. Yannic Kilcher goes through the system and explains it here: https://www.youtube.com/watch?v=XvDzZwoQFcU One interesting feature is, that it runs on a CPU with 4-cores (not the training of course). On such a CPU it is faster than real-time, which means faster than running the audio output would take. Something like this might be something very useful to us, if we could run our own knock-off on a small SBC inside the robowaifu.
>>10393 >Something like this might be something very useful to us, if we could run our own knock-off on a small SBC inside the robowaifu. It certainly would, if we can somehow obtain access to it or reproduce it, Anon. Thanks for the heads-up, and for the video link. It really helps to get the points across well for us. youtube-dl --write-description --write-auto-sub --sub-lang="en" https://www.youtube.com/watch?v=XvDzZwoQFcU
not sure if this has been posted before, but I came across this and immediately thought of some of the todo list for clipchan. https://speechbrain.github.io/index.html seems like there was some discussion about emotion and speaker ID classifiers.
>>10458 Very cool Anon, thanks. It looks like it's a solid and open source system too, AFAICT.
The model link is dead, while I can train a new model I am looking to avoid that step right now because of other deadlines, though I would love to include 2B in WaifuEngine, would anyone be willing to mirror or provide an updated link? Thanks
>>10499 ATTENTION ROBOWAIFUDEV I'm pretty sure the model in question is your pre-trained one for 2B's WaifuSynth voice, ie, https://anonfiles.com/Hbe661i3p0/2b_v1_pt >via https://gitlab.com/robowaifudev/waifusynth cf. (>>10498, >>10502)
>>10504 To clearify the pretrained model links are both dead repo still up
Great! Now my waifu can sing a lullaby for me to sleep well. The only problem is that I don't have the Vocaloid editor. Video demonstration: https://youtu.be/mxqcCDOzUpk Github: https://github.com/vanstorm9/AI-Vocaloid-Kit-V2
>>5521 >Cute robowaifu Check >Inspiring message to all weebineers everywhere Check >Epic music Check Best propaganda campaign. 10/10, would build robowaifu. >>5529 >>5530 Damn it lads! You're bringing me closer to starting sampling Lime's VA heheheh (Although I was hoping to use my voice to generate a somewhat convincing robowaifu, so as to minimise reliance on females).
>>11229 Forgot to add. >>5532 >I don't know if that'll be enough. Chii didn't really talk much. You're overcomplicating it. I think he meant create a tts that outputs "Chii" regardless of what you put in ;) (Although you could add different tonality and accents, might be a more fun challenge).
>>10504 Sorry, been busy and haven't been active here lately. Updated the repo link: https://www.mediafire.com/file/vjz09k062m02qpi/2b_v1.pt/file This model could be improved by training it without the pain sound effects. There's so many of them it biased the model which causes strange results sometimes when sentences start with A or H.
>>11474 Thanks! Wonderful to see you, I hope all your endeavors are going well Anon.
>>11474 come join my doxcord server if you have time and pm me! thanks for the model, you will likely see it used on the 2B "cosplay" waifu, we may have in the game
>>11480 The link is expired. What would you like to talk about? I don't have a lot to add. You can do some pretty interesting stuff with voice synthesis by adding other embeddings to the input embedding, such as for the character in a multi-character model, emphasis, emotion, pitch, speed, and ambiance (to utilize training samples with background noise.) This is what Replica Studios has been doing: https://replicastudios.com/
>>11522 If you are interested, I am looking for someone to take over the speech synthesis part of WaifuEngine, I got it working however, to work on it as a specialty takes me away from the rest of the application, like I want to train a new model using glowtts but my time is limited. I also have to work on the various other aspects of the project, to get it off the ground. Right now our inference time using tacotron2 isn't great unless you have a GPU. As for compensation on the project, so far I have been giving away coffee money as we have little resources haha, if the project gets bigger and more funding, I'd be willing to help the project contributors out. https:// discord.gg/ gBKGNJrev4
>>11536 In August I'll have some time to work on TTS stuff and do some R&D. I recommend using FastPitch. It's just as good as Tacotron2 but 15x faster on the GPU and 2x faster on the CPU than Tacotron2 is on the GPU. It takes about a week to train on a toaster card and also already has stuff for detecting and changing the pitch and speed, which is essential to control for producing more expressive voices with extra input embeddings. https://fastpitch.github.io/
>>11550 >related (>>9165, >>9179)
>>11550 I'd message you on discord about this this could be useful info for the board. But essentially I did use fast pitch originally, the issue is the teacher student training methodology, you have to use tacotron to bootstrap and predict durations to align, When you don't do that and just train on LJS Model of Fastpitch via fine tuning, it fails to predict the durations. We can definitely try this method I am open to it, I guess in my time crunch I didn't bother. I am optimizing for delivery so that we have a product people can use and enjoy, it should be very simple to update the models in the future, it would be one python script change based off my architecture
>>11559 The 2B model I made was finetuned on the pretrained Tacotron2 model and only took about an hour. Automating preprocessing the training data won't be a big deal. And if a multi-speaker model is built for many different characters it would get faster and faster to finetune. I've been looking into Glow-TTS more and the automated duration and pitch prediction is a nice feature but the output quality seems even less expressive than Tacotron2. A key part of creating a cute female voice is having a large range in pitch variation. Also I've found a pretrained Tacotron2 model that uses IPA. It would be possible to train it on Japanese voices and make them talk in English, although it would take some extra time to adapt FastPitch to use IPA. Demo: https://stefantaubert.github.io/tacotron2/ GitHub: https://github.com/stefantaubert/tacotron2
Some other ideas I'd like to R&D for voice synthesis in the future: - anti-aliasing ReLUs or replacing them with swish - adding gated linear units - replacing the convolution layers with deeper residual layers - trying a 2-layer LSTM in Tacotron2 - adding ReZero to the FastPitch transformers so they can be deeper and train faster - training with different hyperparameters to improve the quality - using RL and human feedback to improve the quality - using GANs to refine output like HiFiSinger - outputting at a higher resolution and downsampling
>>11569 Thanks, but what's the point of this IPA. To let it talk correctly in other languages? >Der Nordwind und die Sonne - German with American English accent I can assure you: I doesn't work. Americans talking German often (always) sounds bad, but this is a level of it's own. Absolutely bizarre.
>>11571 Yeah, I live around Chinese with thick accents and this takes it to the next level, kek. That's not really the motivation for using IPA though. This pilot study used transfer learning to intentionally create different accents, rather than copy the voice without the accent. How IPA is useful to generating waifu voices is it helps improve pronunciation, reduce needed training data, and solves the problem with heteronyms, words spelled the same but pronounced differently: https://jakubmarian.com/english-words-spelled-the-same-but-pronounced-differently/ When models without IPA have never seen a rare word in training, such as a technical word like synthesis, they will usually guess incorrectly how to pronounce it, but with IPA the pronunciation is always the same and it can speak the word fluently without ever having seen it before. Also in a multi-speaker model you can blend between speaker embeddings to create a new voice and it's possible to find interpretable directions in latent space. Finding one for accents should be possible, which could be left in control to the user's preferences to make a character voice sound more American, British or Japanese and so on.
>>11577 Ah, okay, this sounds pretty useful. One more problem comes to mind in regards to this. In English foreign names are often changed in pronunciation, because the name would sound "strange" otherwise. The philosopher Kant would sound like the c-word for female private parts. Therefore they pronounce it Kaant. I wonder if the method helps with that as well.
>>11582 In that case it depends what language you transliterate with. If necessary names could be transliterated as they're suppose to be pronounced in their original language, or it could all be in the same language. Exceptions could also be defined. For example, the way Americans pronounce manga is quite different from the Japanese. If someone wants their waifu to sound more like a weeb and pronounce it the Japanese way, they could enter the Japanese IPA definition for it to override the default transliteration.
Finished creating a tool for automatically downloading subtitles and audio clips from Youtube videos, which can be reworked in Aegisub or another subtitle editor, then converted into a training set with Clipchan. https://gitlab.com/robowaifudev/alisub
>>11623 This sounds exciting Anon, thanks! >or another subtitle editor Can you recommend a good alternative Anon? I've never been able to successfully get Aegisub to run.
>>11624 Someone recommended SubtitleEdit but it's Windows only: https://nikse.dk/SubtitleEdit Subtitle Editor can display waveforms but it's far more difficult to use and I don't recommend it.
>>11623 Okay, thanks. This could be useful for more, I guess. Maybe later to train the system on lip reading using YouTube, for example. Or maybe for training voice recognition in the first place? How much data do we need to emulate a particular voice?
>>11625 OK, thanks for the advice. I'll try and see if I can set it up on a virtual box instead or something, Aegisub did look pretty easy to use (first time I've seen it in action, so thanks again). The problem is always a wxWidgets dependency hell issue. I can even get it to build, right up to link time.
>>11631 Finetuning a pretrained model you need about 20 minutes. Training a model from scratch takes about 12 hours. Multispeaker models trained on hundreds of voices can clone a voice with a few sentences but still need a lot of samples to capture all the nuances.
Been doing some work to get WaifuEngine's speech synthesis to run fast on the CPU and found that FastPitch has a real-time factor of 40x and WaveGlow 0.4x. This lead me to testing several different vocoder alternatives to Waveglow and arriving at multi-band MelGAN with an RTF of 20x. So FastPitch+MelGAN has an RTF of 12x, which means it can synthesize 12 seconds of speech every second or 80ms to generate a second of speech. "Advancing robotics to a point where anime catgirl meidos in tiny miniskirts are a reality" took MelGAN 250ms on CPU to generate from 2B's Tacotron2 Mel spectrogram. Now I just gotta set up this shit so it's easy to train end-to-end and the whole internet and their waifus are getting real-time waifus. Multi-band MelGAN repo: https://github.com/rishikksh20/melgan Multi-band MelGAN paper: https://arxiv.org/abs/2005.05106 Original MelGAN paper: https://arxiv.org/abs/1910.06711
>>11636 Interesting, thanks, but I meant how much samples we need to fine-tune a voice. I also wonder if voicesmare being 'blended' that way. Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. >>11647 Thanks for your work. I thought voice generation would take much more time to do. Good to know. Responses to someone talking should be fast.
>>11648 I meant 20 minutes and 12 hours of samples. Finetuning with 20 minutes of samples takes about 1-2 hours on my budget GPU. >Maybe our waifus shouldn't sound too much like some specific proprietary character or real actress. This definitely deserves more thought. If every person on the internet will be able to do speech synthesis and there is a tsunami of voice cloning characters, it's important people are able to have creative freedom with it while the buzz is on. People's curiosity will further advance speech synthesis and diffuse into other areas of AI, including waifu tech. On the other hand if people only straight up copy voices then it would cause a media shitstorm and possibly turn people away, but that could also have its benefits. Whatever happens though the accelerator is stuck to the floor. In the meantime while the hype builds, iteration can continue on until the synthesis of Gawr Kilcher is realized. When people look closely though they'll notice it's neither Yannic or Gura but actually Rimuru and Stunk all along.
>>11647 Thanks for the information, Anon.
>>11650 kek. i just noticed that logo. i wonder what based-boomer AJ would think of robowaifus. white race genocide, or crushing blow to feminazis and freedom to all men from oppression?
>>11677 He doesn't like them or AI in general. Said something once like people are going to stop having kids and masturbate with a piece of plastic all day and how the government is going to know everything about people through them and be able to manipulate them perfectly. He's not really wrong. Look how many people already give up all their data using Windows and Chrome.
>>8151 >>12193 A routine check on the Insights->Traffic page led me here. While the program itself is written with Qt, what actually makes the voices work (Voice.h and beyond) does not contain a single trace of Qt (well, almost, but what little there is is just error boxes). This is a deliberate design decision to allow the actual inference engine to be copied and ported anywhere with minimal trouble. For inference on embedded devices you probably want to use TFLite, which is on my list because I plan on Windows SAPI integration.
>>12257 Hello Anon, welcome. We're glad you're here. Thanks for any technical explanations, we have a number of engineers here. Please have a look around the board while you're here. If you have any questions, feel free to make a post on our current /meta thread (>>8492). If you decide you'd like to introduce yourself more fully, then we have an embassy thread for just that (>>2823). Regardless, thanks for stopping by!

Report/Delete/Moderation Forms
Delete
Report