It works fine on better hardware. The problem is backwards compatibility is a foreign concept to Tensorflow so people end up locking their projects to old versions and creating a hellish nightmare of dependency conflicts. Also short clips don't take up too much memory. Only when processing 10 minute songs does it use up to 6 GB. To avoid this Clipchan processes each clip individually.
And Clipchan has been updated now to v0.3. I had to fix it anyway to help someone get voice clips. It's essentially finished, besides making it simpler to use and ideally creating a GUI for it.
The most important options are -dza which cleans audio with Spleeter, speeds up subtitle processing, and auto-crops the audio clips. For Tacotron2 -r 22050 -c 1 are also needed to resample and mix stereo to mono (they require the -a option to have any effect right now.) If you don't have FFmpeg with libzvbi, then omit the -z option.
And some fresh Natsumi Moe voice clips from v0.3 ready for Tacotron2: https://files.catbox.moe/ipt13l.xz
Still a work in progress but there's about 10 minutes of usable audio there.
Not sure, you won't get much of a speed up running FastPitch on CPU compared to Tacotron2. It's possible for fine-tuned models to be pruned and compressed down so they can run on mobile devices, but I'm not aware of anyone who has taken the time to do that. Pruning and compressing doesn't apply to training though, only works with inference.