← All articles
ai-agentsSignal 85/100

AI intel digest

Voice AI: when is the "Her" moment? — Neil Zeghidour, CEO, Gradium AI

Neil Zeghidour, CEO of Gradium AI and co-creator of Moshi, argues that voice AI is stuck between two flawed architecture

2026-05-0930 min read5,914 words12 facts · 0 assumptions
Start here

Executive summary

1. SUMMARY Neil Zeghidour, CEO of Gradium AI and co-creator of Moshi, argues that voice AI is stuck between two flawed architectures: cascaded systems (STT→LLM→TTS) that cannot achieve human-like latency, and speech-to-speech models that solve latency but remain half-duplex and lack utility. He demonstrates Gradium's voice cloning, a live travel-agent demo with latency-masking fillers, and Phoneon—a sub-100M parameter on-device TTS running on smartphone CPUs. The core claim: the "Her moment" requires solving full-duplex conversation, paralinguistic understanding, and cost scalability simultaneously, which remains an open science problem. 2. KEY FACTS FACT: Gradium AI trains speech-to-text, text-to-speech, and speech-to-speech models as building blocks for voice agents, not vertical solutions or orchestration. | EVIDENCE: "We train voice models, speech-to-text, text-to-speech, speech-to-speech... We want to be main model provider for voice for everyone building voice agents and voice solutions. We are not working on orchestration, we are not working on specific verticals." | CONFIDENCE: HIGH FACT: Gradium's voice cloning requires only 10 seconds of audio to replicate tone, pitch, accent, and vocal quirks. | EVIDENCE: "You record like 10 seconds of your voice, that's it, and the system analyzes the tone, the pitch, the accent, all those little quirks that make your voice yours. Then boom, you type text and it talks back." | CONFIDENCE: HIGH FACT: Gradium originated as a non-profit lab funded by Eric Schmidt, Rodolphe Saadé, and Xavier Niel, and created Moshi (first speech-to-speech conversational model), speech-to-speech translation, and Pocket TTS. | EVIDENCE: "This is a spin-off from a non-profit lab we created 2 years ago with the funding from philanthropists including Eric Schmidt, Rodolphe Saadé and Xavier Niel... we developed Moshi which was the first speech-to-speech model for conversation, speech-to-speech translation, Pocket TTS most recently a CPU model." | CONFIDENCE: HIGH FACT: Human conversational response latency is approximately 200ms total (comprehension + answer generation + pronunciation), while TTS alone in cascaded systems exceeds 200ms. | EVIDENCE: "just the TTS is still more than 200 milliseconds. While in a human conversation, you need the entire stack of understanding, producing an answer, and pronouncing it to be around 200 milliseconds." | CONFIDENCE: HIGH FACT: Tool call latency in voice agents ranges from 500ms to 4 seconds, making it the current primary bottleneck rather than TTS optimization. | EVIDENCE: "you have a tool call or open router that is going to have a latency between 500 milliseconds and 4 seconds... I think now the main bottleneck is becoming the tool call." | CONFIDENCE: HIGH FACT: Every speech-to-speech model except Moshi is half-duplex, meaning it cannot listen and speak simultaneously. | EVIDENCE: "every single speech to speech model except Moshi is half duplex... even the best speech to speech model we could argue... maybe that's the advanced voice model of Open AI or Cezanne... is still half duplex." | CONFIDENCE: HIGH FACT: In Japanese conversation, backchanneling (overlapping speech like "mhm") occupies up to 20% of speaking time and is culturally expected. | EVIDENCE: "in Japanese it's a sign of politeness and that you are actively listening to do a lot of back channeling... you get up to 20% of the time that is overlapped between the people." | CONFIDENCE: HIGH FACT: Moshi is approximately two years old and remains the only full-duplex speech-to-speech model; NVIDIA's Personal Plex model is based on it. | EVIDENCE: "Moshi, I think we saw still the only full duplex model. Recently, Nvidia published the personal plex model based on it." | CONFIDENCE: HIGH FACT: Moshi lacked utility—no tool calls, no observability, poor content filtering, and no paralinguistic exploitation despite capturing paralinguistic data. | EVIDENCE: "it was not an agent. It had no tool call, no ability to do anything... very hard to detect if someone said something that should be not accepted... there was no real paralinguistic understanding." | CONFIDENCE: HIGH FACT: Voice mode at major hyperscalers runs at a loss due to the cost of gigantic multimodal models. | EVIDENCE: "the voice mode of most hyperscalers is run at a loss. It's a gigantic multimodal model, and they lose money every time they you use it." | CONFIDENCE: HIGH FACT: TTS is the dominant cost in voice AI applications, with some startups burning through fundraising on TTS bills before achieving user growth. | EVIDENCE: "TTS is really what is going to consume most of the... I saw people burning their fundraising in TTS bills, and they don't even get the opportunity to get their user base to grow." | CONFIDENCE: HIGH FACT: Phoneon is a sub-100M parameter TTS model that runs on smartphone CPUs with voice cloning capability, entering private beta. | EVIDENCE: "Gradion Phonon... it's a very small model, less than 100 million parameter... on-device means it runs on a smartphone CPU... we open a private beta for this model." | CONFIDENCE: HIGH 3. KEY IDEAS IDEA: Cascaded systems are architecturally incapable of achieving human-like conversation regardless of incremental latency improvements. | REASONING: Even optimal TTS exceeds 200ms alone, while humans execute the full pipeline in 200ms; tool calls add 500ms-4s unpredictably; the three-block pipeline cannot be compressed below biological limits. | IMPLICATION: Incremental engineering on STT→LLM→TTS stacks is a local optimum—breakthrough requires architecture change, not optimization. IDEA: Speech-to-speech models solve cascaded latency but introduce a different failure mode: half-duplex rigidity breaks conversational flow. | REASONING: Half-duplex systems must choose between listening and speaking, making them unable to handle overlaps, interruptions, backchanneling, or environmental noise—behaviors that constitute up to 20% of natural conversation time. | IMPLICATION: Latency reduction without duplex capability trades one robotic artifact for another; full-duplex is a necessary but insufficient condition for naturalness. IDEA: Paralinguistic information (tone, hesitation, discomfort, cultural signals) is preserved in audio-native models but useless without explicit training to exploit it. | REASONING: Speech-to-speech models technically contain paralinguistic data, but if trained on factual audio-instruct datasets, the model has no incentive to map vocal cues to semantic or emotional responses. | IMPLICATION: "Her" requires scientific advances in how models learn from paralinguistics, not just audio-native architecture—a data and training problem, not a model-class problem. IDEA: Cost scalability is a hidden wall behind the latency problem; natural-sounding voice at scale is economically unsustainable with current cloud TTS. | REASONING: Hyperscalers subsidize voice mode at a loss; startups exhaust funding on TTS costs; consumer adoption requires hours-daily usage that current pricing cannot support. | IMPLICATION: On-device inference (CPU, not GPU) is not merely a privacy feature but an economic prerequisite for consumer voice AI at scale. IDEA: Fillers are a practical engineering patch for tool-call latency, not a solution. | REASONING: By splitting the LLM into tool-calling and conversational threads, the system maintains natural speech during unpredictable API waits, then reintegrates results. | IMPLICATION: This pattern will become standard in production voice agents but does not address the fundamental unpredictability of tool-dependent intelligence. 4. KEY QUOTES "just the TTS is still more than 200 milliseconds. While in a human conversation, you need the entire stack of understanding, producing an answer, and pronouncing it to be around 200 milliseconds. So that will not allow to have a conversation that sounds human." "we are fighting for latency of the TTS, trying to grab 10 milliseconds, 20 milliseconds, and then you have a tool call or open router that is going to have a latency between 500 milliseconds and 4 seconds." "every single speech to speech model except Moshi is half duplex... even the best... maybe that's the advanced voice model of Open AI or Cezanne... is still half duplex." "you get up to 20% of the time that is overlapped between the people, right? So that's what makes a conversation human." "this snippet here it's the AI understanding that the character is a bit uncomfortable. So, that's paralinguistic understanding... technically, that is in Moshi, that is in any speech-to-speech model because this information is not lost. However, if you don't exploit this information to make your model say relevant things, it's never going to exploit it." "I saw people burning their fundraising in TTS bills, and they don't even get the opportunity to get their user base to grow." "on-device means it runs on a smartphone CPU... less than 100 million parameter... you can use that to power any kind of voice application without paying a single cent of API fee." "I have a strong strict opposition to some of our competitors say that voice is a commodity now. I think it's completely false. Voice is very challenging. The last mile is going to be the most difficult to solve." 5. SIGNAL POINTS The 200ms biological ceiling makes cascaded systems a dead end for human-like conversation—this is a hard architectural limit, not an engineering gap. Tool call latency (500ms-4s) has overtaken TTS latency as the primary bottleneck, rendering incremental TTS optimization strategically misplaced. Full-duplex is solved (Moshi, ~2 years ago) but useless without intelligence, reliability, and observability—natural flow without utility is a demo, not a product. Paralinguistic understanding is the most underrated blocker: voice carries meaning that text strips out, and current training paradigms waste this signal. TTS cost is the hidden killer of consumer voice AI; hyperscalers run voice mode at a loss, and startups burn funding before reaching scale. On-device CPU inference (Phoneon, <100M params) is framed as an economic necessity, not just privacy theater—zero API cost enables sustainable consumer apps. The "Her moment" requires simultaneous advances in three domains: full-duplex architecture (solved), paralinguistic intelligence (unsolved science), and cost scalability (requires on-device inference). 6. SOURCES MENTIONED 11 Labs — Referenced as "the best voice AI company in the world"; demo shown of government AI helper (latency and naturalness still insufficient). OpenAI — Advanced Voice Model mentioned as half-duplex, not full-duplex. Cezanne — Voice model mentioned as impressive but still half-duplex. NVIDIA — Published Personal Plex model based on Moshi. Eric Schmidt, Rodolphe Saadé, Xavier Niel — Philanthropist funders of the non-profit lab that spawned Gradium. Kokoro — Existing on-device TTS model; lacks voice cloning. Samuel — Unidentified speaker who presented on cascaded systems earlier in the event. Joel (famous podcaster) — Voice cloned for live demo. 7. VERDICT This video carries unique signal for AI tracking because Zeghidour speaks as both researcher (Moshi co-creator) and CEO (Gradium), giving him credibility across science and production economics. The core value is his explicit mapping of where each architectural approach fails and why partial solutions don't compose: full-duplex without intelligence is useless, intelligence without latency control is annoying, and naturalness without cost scalability is unshippable. Most voice AI commentary focuses on one dimension (usually latency or naturalness); this talk forces a three-dimensional analysis. The live demos—voice cloning, filler-based travel agent, half-duplex breakdown, Moshi full-duplex excerpt, Phoneon CPU TTS—provide concrete anchors for abstract claims. The explicit cost data (TTS burning startup funding, hyperscalers running voice mode at a loss) is rare in public talks. Worth watching for anyone building or investing in voice AI, or anyone who needs to calibrate "Her moment" claims against actual architectural constraints. --- Count: 12 facts, 0 assumptions, 6 demonstrations (voice cloning, 11 Labs demo, rich mini gym-bro demo, live travel agent with fillers, half-duplex failure demo, Moshi full-duplex demo, Phoneon CPU demo) Signal density: 85% — high ratio of concrete claims, live demonstrations, and explicit cost/latency data to filler or promotional content.

What matters

Signal points

  1. 1

    The 200ms biological ceiling makes cascaded systems a dead end for human-like conversation—this is a hard architectural limit, not an engineering gap.

  2. 2

    Tool call latency (500ms-4s) has overtaken TTS latency as the primary bottleneck, rendering incremental TTS optimization strategically misplaced.

  3. 3

    Full-duplex is solved (Moshi, ~2 years ago) but useless without intelligence, reliability, and observability—natural flow without utility is a demo, not a product.

  4. 4

    Paralinguistic understanding is the most underrated blocker: voice carries meaning that text strips out, and current training paradigms waste this signal.

  5. 5

    TTS cost is the hidden killer of consumer voice AI; hyperscalers run voice mode at a loss, and startups burn funding before reaching scale.

  6. 6

    On-device CPU inference (Phoneon, <100M params) is framed as an economic necessity, not just privacy theater—zero API cost enables sustainable consumer apps.

  7. 7

    The "Her moment" requires simultaneous advances in three domains: full-duplex architecture (solved), paralinguistic intelligence (unsolved science), and cost scalability (requires on-device inference).

  8. 8

    6. SOURCES MENTIONED

Interpretation

Key ideas

1

Cascaded systems are architecturally incapable of achieving human-like conversation regardless of incremental latency improvements.

Why: Even optimal TTS exceeds 200ms alone, while humans execute the full pipeline in 200ms; tool calls add 500ms-4s unpredictably; the three-block pipeline cannot be compressed below biological limits.

Implication: Incremental engineering on STT→LLM→TTS stacks is a local optimum—breakthrough requires architecture change, not optimization.

2

Speech-to-speech models solve cascaded latency but introduce a different failure mode: half-duplex rigidity breaks conversational flow.

Why: Half-duplex systems must choose between listening and speaking, making them unable to handle overlaps, interruptions, backchanneling, or environmental noise—behaviors that constitute up to 20% of natural conversation time.

Implication: Latency reduction without duplex capability trades one robotic artifact for another; full-duplex is a necessary but insufficient condition for naturalness.

3

Paralinguistic information (tone, hesitation, discomfort, cultural signals) is preserved in audio-native models but useless without explicit training to exploit it.

Why: Speech-to-speech models technically contain paralinguistic data, but if trained on factual audio-instruct datasets, the model has no incentive to map vocal cues to semantic or emotional responses.

Implication: "Her" requires scientific advances in how models learn from paralinguistics, not just audio-native architecture—a data and training problem, not a model-class problem.

4

Cost scalability is a hidden wall behind the latency problem; natural-sounding voice at scale is economically unsustainable with current cloud TTS.

Why: Hyperscalers subsidize voice mode at a loss; startups exhaust funding on TTS costs; consumer adoption requires hours-daily usage that current pricing cannot support.

Implication: On-device inference (CPU, not GPU) is not merely a privacy feature but an economic prerequisite for consumer voice AI at scale.

5

Fillers are a practical engineering patch for tool-call latency, not a solution.

Why: By splitting the LLM into tool-calling and conversational threads, the system maintains natural speech during unpredictable API waits, then reintegrates results.

Implication: This pattern will become standard in production voice agents but does not address the fundamental unpredictability of tool-dependent intelligence.

Evidence

Key facts

Gradium AI trains speech-to-text, text-to-speech, and speech-to-speech models as building blocks for voice agents, not vertical solutions or orchestration.

HIGH

Evidence: We train voice models, speech-to-text, text-to-speech, speech-to-speech... We want to be main model provider for voice for everyone building voice agents and voice solutions. We are not working on orchestration, we are not working on specific verticals.

Gradium's voice cloning requires only 10 seconds of audio to replicate tone, pitch, accent, and vocal quirks.

HIGH

Evidence: You record like 10 seconds of your voice, that's it, and the system analyzes the tone, the pitch, the accent, all those little quirks that make your voice yours. Then boom, you type text and it talks back.

Gradium originated as a non-profit lab funded by Eric Schmidt, Rodolphe Saadé, and Xavier Niel, and created Moshi (first speech-to-speech conversational model), speech-to-speech translation, and Pocket TTS.

HIGH

Evidence: This is a spin-off from a non-profit lab we created 2 years ago with the funding from philanthropists including Eric Schmidt, Rodolphe Saadé and Xavier Niel... we developed Moshi which was the first speech-to-speech model for conversation, speech-to-speech translation, Pocket TTS most recently a CPU model.

Human conversational response latency is approximately 200ms total (comprehension + answer generation + pronunciation), while TTS alone in cascaded systems exceeds 200ms.

HIGH

Evidence: just the TTS is still more than 200 milliseconds. While in a human conversation, you need the entire stack of understanding, producing an answer, and pronouncing it to be around 200 milliseconds.

Tool call latency in voice agents ranges from 500ms to 4 seconds, making it the current primary bottleneck rather than TTS optimization.

HIGH

Evidence: you have a tool call or open router that is going to have a latency between 500 milliseconds and 4 seconds... I think now the main bottleneck is becoming the tool call.

Every speech-to-speech model except Moshi is half-duplex, meaning it cannot listen and speak simultaneously.

HIGH

Evidence: every single speech to speech model except Moshi is half duplex... even the best speech to speech model we could argue... maybe that's the advanced voice model of Open AI or Cezanne... is still half duplex.

In Japanese conversation, backchanneling (overlapping speech like "mhm") occupies up to 20% of speaking time and is culturally expected.

HIGH

Evidence: in Japanese it's a sign of politeness and that you are actively listening to do a lot of back channeling... you get up to 20% of the time that is overlapped between the people.

Show 5 more facts

Moshi is approximately two years old and remains the only full-duplex speech-to-speech model; NVIDIA's Personal Plex model is based on it.

HIGH

Evidence: Moshi, I think we saw still the only full duplex model. Recently, Nvidia published the personal plex model based on it.

Moshi lacked utility—no tool calls, no observability, poor content filtering, and no paralinguistic exploitation despite capturing paralinguistic data.

HIGH

Evidence: it was not an agent. It had no tool call, no ability to do anything... very hard to detect if someone said something that should be not accepted... there was no real paralinguistic understanding.

Voice mode at major hyperscalers runs at a loss due to the cost of gigantic multimodal models.

HIGH

Evidence: the voice mode of most hyperscalers is run at a loss. It's a gigantic multimodal model, and they lose money every time they you use it.

TTS is the dominant cost in voice AI applications, with some startups burning through fundraising on TTS bills before achieving user growth.

HIGH

Evidence: TTS is really what is going to consume most of the... I saw people burning their fundraising in TTS bills, and they don't even get the opportunity to get their user base to grow.

Phoneon is a sub-100M parameter TTS model that runs on smartphone CPUs with voice cloning capability, entering private beta.

HIGH

Evidence: Gradion Phonon... it's a very small model, less than 100 million parameter... on-device means it runs on a smartphone CPU... we open a private beta for this model.

Memorable lines

Quotes

just the TTS is still more than 200 milliseconds. While in a human conversation, you need the entire stack of understanding, producing an answer, and pronouncing it to be around 200 milliseconds. So that will not allow to have a conversation that sounds human.
we are fighting for latency of the TTS, trying to grab 10 milliseconds, 20 milliseconds, and then you have a tool call or open router that is going to have a latency between 500 milliseconds and 4 seconds.
every single speech to speech model except Moshi is half duplex... even the best... maybe that's the advanced voice model of Open AI or Cezanne... is still half duplex.
you get up to 20% of the time that is overlapped between the people, right? So that's what makes a conversation human.
this snippet here it's the AI understanding that the character is a bit uncomfortable. So, that's paralinguistic understanding... technically, that is in Moshi, that is in any speech-to-speech model because this information is not lost. However, if you don't exploit this information to make your model say relevant things, it's never going to exploit it.
I saw people burning their fundraising in TTS bills, and they don't even get the opportunity to get their user base to grow.