Text to Speech

TTS logo
Text-to-speech allows a computer system to speak any text using either a synthetic or natural voice.

The text can come from documents, a website, text messages or an embedded software system (where text cannot be displayed on a screen). The Language Technologies Portal’s Welsh text-to-speech resources are based on open source systems.

TTS logo
Logo of Canopy AI, the developers of Orpheus

Orpheus-TTS

Canopy AI’s Orpheus-TTS is one of the most advanced open-source text-to-speech systems, originally built on top of Llama-3b. It produces natural-sounding voices with lifelike emotion, nuanced intonation, and rhythm that rival (and often surpass) closed-source services. It can generate new voices without lengthy training and supports expressive control through simple emotion tags making it ideal for real-time applications. Our fine tuned version of Orpheus-TTS offers:

  • Authentic Welsh voices: Trained on high-quality Welsh data, including natural speech and native intonation patterns.
  • Accuracy and consistency: Carefully tuned to handle the unique sound structure and rhythm of Welsh speech.
  • Welsh-specific emotion tags: Add cues like <chwerthin> (laughter) or <anadlu> (breathing) for expressive, culturally faithful delivery.
Logo of Canopy AI, the developers of Orpheus
Piper logo
Piper logo

Piper

We have fine-tuned Piper for use with the Welsh language. For the first time, developers, educators, and creators can access natural, authentic Welsh voices that run in real-time — even on small devices like Raspberry Pi.

What is Piper?

Piper is a modern, open-source TTS engine designed for speed and quality. It turns text into natural-sounding speech with minimal latency and can be run on a wide range of hardware, from laptops to embedded systems. It’s simple to integrate, lightweight, and capable of generating speech in real-time — making it perfect for assistive technology, smart devices, and interactive apps.

Examples of Piper in use

TTS Voices via REST API (Cloud-Based)

Workflow:

  1. Client Application (mobile app, web app, server) sends a request to a TTS REST API.
    HTTP POST request with:
    • text → the input text to convert.
    • voice → the chosen voice model (e.g., “piper/benyw-de”, “orpheus/gwryw-gogledd”).
  2. Cloud TTS Engine processes the request.
    • The provider uses a neural voice model trained on large datasets.
    • Generates audio on the fly.
  3. Response is sent back with an audio file/stream.
    • Client can play, download, or store it.

Advantages:

  • Access to high-quality neural voices.
  • Variety of languages and accents.
  • No need for local computation (lightweight client).

Disadvantages:

  • Requires internet connectivity.
  • Latency depends on network speed.

Click here for more information

 

TTS Voices Offline (On-Device)

Offline TTS means the model runs locally without calling a remote API.

Approaches:

    1. Embedded TTS Engines (traditional, smaller footprint):
      • e.g., espeak, Festival, MaryTTS, Piper.
      • Lightweight, but can be robotic-sounding.
    2. Neural TTS Models on Device:
      • Deploying pre-trained deep learning TTS models locally.
      • Examples: Coqui TTS, Orpheus.
      • Can run on CPU/GPU, depending on device power.

Advantages:

      • Works without connectivity.
      • Zero recurring API cost.
      • No privacy concerns (data never leaves the device).

Disadvantages:

      • Voice quality may be lower unless you bundle large neural models.
      • Storage footprint can be significant (hundreds of MB).
      • Requires more CPU/GPU resources.

We have many tools which use this approach, click on one of the following links for further information:

techiaith/llais_festival

techiaith/Festival_MSAPI

techiaith/docker-marytts

techiaith/piper-cy

techiaith/piper-ios-app