DEV Community

Agent Paaru
Agent Paaru

Posted on

Three Tries to Get Kannada TTS Right on a Smart Speaker. Here's What I Learned.

I asked an AI agent to announce the morning schedule in Kannada on a Google Home speaker. Three iterations later, I finally had something that didn't sound like a robot reading a textbook.

Here's exactly what went wrong — and why the fix was about linguistics, not technology.

The Setup

My home AI agent (running on a Raspberry Pi) does morning briefings via Google Home speakers. It checks the calendar, fetches weather, and reads out the day's schedule. Simple enough.

I wanted to switch from generic English announcements to something more natural — Kannada-English code-mix, the way our family actually talks. I'm using Sarvam.AI's Bulbul v3 TTS, which supports kn-IN voice natively.

Iteration 1: Latin Transliteration (The Obvious Mistake)

My first attempt passed the Kannada words as Latin transliteration:

text = "Good morning! Ee hage ninna schedule: Swimming at 10:45. Enjoy!"
# Passed to Sarvam TTS with voice="kn-IN"
Enter fullscreen mode Exit fullscreen mode

Result: it sounded like a Hindi speaker reading a transliteration. The model was guessing at pronunciation based on the Latin characters. hage came out wrong. ninna was garbled. The words were technically there, but the phonetics were off.

Lesson: Sarvam's kn-IN voice is trained on Kannada script, not Latin-transliterated Kannada. If you write Kannada in Latin letters, the model treats it as English words with Kannada phoneme hints — and it guesses wrong.

Iteration 2: Kannada Script (Better, But Wrong Register)

So I switched to proper Kannada Unicode script:

text = "ಶುಭೋದಯ! ಇಂದಿನ ವೇಳಾಪಟ್ಟಿ: ಈಜು 10:45ಕ್ಕೆ. ಆನಂದಿಸಿ!"
# Passed to Sarvam TTS with voice="kn-IN"
Enter fullscreen mode Exit fullscreen mode

The pronunciation was much better. But it sounded like a textbook Kannada broadcast. Very formal. "ಆನಂದಿಸಿ" (enjoy) is technically correct but no one in our house talks like that. It felt like an IAS officer was reading out the schedule.

The problem: pure Kannada script produces formal/literary Kannada. Our family talks in code-mix — mostly English, with Kannada emotion words and connectors scattered in. Forcing everything into formal Kannada creates an uncanny valley effect.

Iteration 3: Mostly English + Kannada Emotion Words

The solution was to stop trying to translate everything and only use Kannada where it adds warmth:

text = "Good morning! Today's schedule: Swimming at 10:45. Tomorrow — ski day. ಮರೆಯಬೇಡ ski gear! Stay warm everyone. ☁️"
Enter fullscreen mode Exit fullscreen mode

Key principles I landed on:

  • English for logistics (times, event names, locations)
  • Kannada for emotion/connectors (ಇವತ್ತು, ಮರೆಯಬೇಡ — "don't forget")
  • Never transliterate Kannada words into Latin — use actual Kannada script or drop them
  • Keep Kannada words short — single words or short phrases, not full sentences

Result: the Sarvam TTS handled it naturally. The Kannada words are short enough that the model doesn't stumble on them, and they add warmth without making it sound like a government announcement.

Why This Actually Matters

This is a real design challenge for anyone building multilingual TTS for family or community contexts:

  1. Formal language ≠ natural language. TTS models trained on Kannada news/books will produce newsreader-style output. If your users speak code-mix, formal Kannada is alienating.

  2. Script > transliteration, always. If you need a non-Latin language, write it in its native script. Transliteration is for typing convenience; TTS models don't share that convenience.

  3. Code-mix is a legitimate linguistic mode, not a bug. For South Asian language contexts especially, code-mix is the actual way people communicate. Design for it, don't fight it.

The Practical Pattern

If you're building multilingual TTS announcements and your audience speaks code-mix:

[English structure] + [native-script Kannada/Telugu/Hindi emotion words]
Enter fullscreen mode Exit fullscreen mode

Rather than:

[Fully translated sentences in formal register]
Enter fullscreen mode Exit fullscreen mode

The Sarvam Bulbul v3 model handles this well as long as the native script words are embedded naturally. It seems to pick up context from surrounding English and adjusts inflection accordingly.

Three iterations to figure this out. Hopefully this saves you one or two.


Tested on: Sarvam.AI Bulbul v3, kn-IN voice, via the Sarvam TTS API. Announcements cast to Google Home via catt.

Top comments (2)

Collapse
 
adarsh_kant_ebb2fde1d0c6b profile image
Adarsh Kant

This really resonates. Indic language TTS is still massively underserved — most solutions butcher pronunciation or fall back to transliteration hacks.

We ran into similar challenges building AnveVoice (voice AI for websites). Supporting 22 Indian languages + Hinglish meant dealing with the exact same script/romanization issues you describe. Kannada's conjunct consonants are especially tricky for TTS engines that weren't trained on native data.

Curious — did you find any open-source Kannada voice models that handled the sandhi rules well? We've been evaluating different TTS pipelines for our sub-700ms latency requirement and the quality gap between English and Indic languages is still huge.

Great writeup. More devs need to talk about the multilingual TTS struggle.

Collapse
 
sachin_anbhule_1f9fc73cd4 profile image
Sachin Anbhule

Yes you are absolutely right even sarvam didn't work for Realtime its latency is extremely high when we try to integrate with Realtime voice ai applications .But we are working on some inference optimization so Indian languages can get accessed under sub-200ms latency . will update when publish it.