TTS voices that sound nice?

Pantherina@feddit.de · edit-2 1 year ago

TTS voices that sound nice?

Ferk@lemmy.ml · edit-2 1 year ago

espeak default voice backend is synthesized without using actually real voice samples. So it doesn’t require downloading a huge package for each language, which is convenient in some cases, but the outcome is extremely robotic.

You can use MBROLA as backend for espeak so that it uses some voice samples and the result should be less jarring (it’d still be easy to tell it’s not natural voice, but at least you’d be able to understand it better). There’s a tutorial on this here: https://github.com/espeak-ng/espeak-ng/blob/master/docs/mbrola.md

Or you can try piper (https://github.com/rhasspy/piper) it’s one of the most natural-sounding TTS (here are some samples).

Pantherina@feddit.de · 1 year ago

Piper is pretty good, thank you!

woelkchen@lemmy.world · 1 year ago

In addition to the very insightful reply by Ferk: Sadly most TTS development seems to be happening as online service these days. Google Neural TTS and Microsoft Azure TTS sound really great but require an online connection, an account, and possibly even paying (there’s a threshold until it’s free, then it costs almost nothing but almost nothing isn’t free).

Btw, I don’t know about the blind people you know but the ones I know use so insanle fast TTS output, the “sounds nice” aspect isn’t really there in the first place. At least not to me.

Starfighter@discuss.tchncs.de · 1 year ago

The development of Piper is being driven by the Home Assistant Project. That probably makes it one of the larger OSS TTS projects. Hope may not be lost yet ;)

woelkchen@lemmy.world · 1 year ago

Hope may not be lost yet ;)

And then we’ll live in a TikTok TTS hellscape. 🤣

Pantherina@feddit.de · 1 year ago

Okay that is really interesting, so a TTS engine should be optimized to run very fast.

interdimensionalmeme@lemmy.ml · 1 year ago

Here is one of the best

https://github.com/neonbjb/tortoise-tts

ffhein@lemmy.world · 1 year ago

A few days ago I wrote down a couple of links to interesting TTS projects that I was going to look into whenever I have time, along with some brief notes.

https://github.com/coqui-ai/TTS TTS + XTTS, GPU inference? 3GB model.

https://github.com/rhasspy/piper Low resource, CPU inference. 50MB model.

https://github.com/p0p4k/vits2_pytorch GPU inference? 500MB model. https://github.com/p0p4k/vits2_pytorch/discussions/27 Someone’s models for vits2

mesamune@lemmy.world · 1 year ago

I’ve recently made aware of the bark project. Especially the bark-ui project. It takes a long time to run but it does work. Sometimes it makes cursed stuff too: https://social.rootaccess.org/@michaelc/111277260439738652

Pantherina@feddit.de · 1 year ago

Wow that is crazy! 10seconds sounds like an unnecessary flex though, wouldnt like 30min/all sounds be best?

mesamune@lemmy.world · 1 year ago

I have a tiny laptop with the literal bare minimum to get this running haha. Your probably right but the models explode your memory pretty quick.

I did get some really good audio out of this model after a while. I threw the first chapter of the hobbit at it and it seemed to be doing ok. It’s better than espeak and you only need to do it once to get audiobooks out.

Pantherina@feddit.de · edit-2 1 year ago

I have to check that! But wait is that Windows only?

mesamune@lemmy.world · 1 year ago

I got it working on Ubuntu/PopOS.

iopq@lemmy.world · 1 year ago

https://github.com/mozilla/DeepSpeech the best quality available that I’ve seen

schnurrito@discuss.tchncs.de · 1 year ago

That says it is speech to text, not text to speech

grue@lemmy.world · 1 year ago

He probably meant https://commonvoice.mozilla.org

Pantherina@feddit.de · 1 year ago

Omg is that the data from the Common voice project? Nice!

iopq@lemmy.world · 1 year ago

So it is, that’s why some languages don’t have good support yet - not enough recordings

Pantherina@feddit.de · 1 year ago

Also the last release is very old. Mozilla is weird, their pinned projects are often dead or outdated…

grumpyrico@lemmy.world · 1 year ago

Isn’t speech to text the opposite of tts?

TunaCowboy@lemmy.world · 1 year ago

Not sure if this is gonna be much better than the alternatives you’ve listed, but you can try adjusting pitch, rate, range, etc. with spd-say.

draeath@lemmy.sdf.org · 1 year ago

If you don’t mind doing some development work, needing online connectivity, and paying for usage, AWS’s Polly has some very good sounding TTS voices: https://aws.amazon.com/polly/

Pantherina@feddit.de · 1 year ago

Hm, as a backup that could be okay, but not working offline /without being a huge privacy problem…

frostycakes@beehaw.org · 1 year ago

Just please don’t use one of the kid voices for technical videos, a la babywogue and their GNOME development videos.

hungover_pilot@lemmy.world · 1 year ago

Samsung’s TTS engine for android is the best I have found. I use it to listen to epub books.

demesisx@infosec.pub · 1 year ago

Not FOSS.

Pantherina@feddit.de · 1 year ago

Samsung sucks though… also softwarewise. Like, I debloated many phones and Samsung is crazy.

https://github.com/trytomakeyouprivate/Android-Tipps/tree/main/debloat

So their TTS will probly neither work offline, nor standalone