Understand the two voice generation modes available for your AI assistants and when to use each one.
Label in UI | Pipeline |
How it works | Speech-to-Text → LLM → Text-to-Speech |
Latency | ~800 – 1500 ms (depends on language & model) |
Best for | Complex reasoning, dynamic prompts, multi-sentence replies |
Label in UI | Speech-to-speech |
How it works | Direct speech-to-speech generation (no intermediate text) |
Latency | ~300 – 600 ms (ultra low) |
Best for | Natural back-and-forth, short & reactive replies |