The Problem
Rural women in India heavily rely on voice notes in WhatsApp conversations with Myna Saheli, revealing a strong preference for speaking over typing—especially among low-literacy users who struggle with text-based interfaces. While the WhatsApp bot handles text and voice messages asynchronously, many users needed real-time voice support for urgent health questions, follow-up calls, and appointment-related conversations. Traditional IVR systems were too rigid, and early voice AI experiments using Exotel appflows suffered from 3-4 second latency even for simple greetings like 'hi' or 'hello,' breaking the natural flow of conversation and reducing trust. Third-party platforms (Bolna AI, Millis AI, ElevenLabs) either didn't meet latency requirements or were prohibitively expensive for scale.
Our Approach
- 01
Architected a custom voice AI pipeline to replace the high-latency Exotel appflows system: incoming audio → Sarvam AI STT (speech-to-text) → jargon filtering → OpenAI LLM with RAG-based medical knowledge base → Sarvam AI TTS (text-to-speech) → Exotel voice delivery.
- 02
Attempted latency optimization via caching generic greeting responses ('hi,' 'hello'), but found caching unreliable for natural conversation and unable to handle the variability of real user queries.
- 03
Evaluated and ultimately rejected Bolna AI, Millis AI, and ElevenLabs due to cost constraints, inflexibility, or failure to meet sub-second response requirements for real-time healthcare conversations.
- 04
Rebuilt the entire voice infrastructure on Pipecat (chosen over LiveKit for its simpler learning curve and faster implementation), achieving near-instant response times while maintaining full conversation context and database storage of all call transcripts.
- 05
Integrated WhatsApp as the trigger mechanism: users send a specific command in their Myna Saheli chat to initiate an outbound voice call from the system, seamlessly bridging text and voice channels within the same care journey.
- 06
Implemented intent-based orchestration to route conversations dynamically—question answering, appointment scheduling, follow-up reminders, and crisis detection—with prompt guardrails ensuring all responses remain focused on SRH topics.
- 07
Built a unified analytics layer storing every voice interaction in PostgreSQL, enabling Mixpanel-powered tracking of call completion rates, average duration, repeat engagement patterns, and language-split analysis (Hindi, Marathi, English).
- 08
Deployed the voice AI on AWS with auto-scaling worker pools to handle peak call volumes (2,100+ calls/month by May 2026), maintaining sub-second latency during high-traffic appointment reminder campaigns.
Outcome
Myna Voice AI successfully bridged the literacy and engagement gap, enabling 588 unique low-literacy women to access SRH care through 5,269 phone-based conversations. With an average of 9 repeat interactions per user and 104-second call durations, the platform proved that voice-first healthcare can drive sustained engagement where text-based apps fail. Monthly call volumes grew from pilot usage to over 2,100 calls by May 2026, reducing counselor workload for routine questions, improving follow-up completion rates, and enabling healthcare access for users without smartphones—directly supporting the foundation's mission to reach underserved women through the channel they trust most: their voice.










