← All articles
ai-agentsSignal 75/100

AI intel digest

Give Your Chat Agent a Voice — Luke Harries, Head of Growth, ElevenLabs

ElevenLabs announced Voice Engine, a new product (early preview, shipping in weeks) that wraps existing text-based chat

2026-05-0923 min read4,669 words10 facts · 0 assumptions
Start here

Executive summary

1. SUMMARY (3-5 sentences) ElevenLabs announced Voice Engine, a new product (early preview, shipping in weeks) that wraps existing text-based chat agents with voice capabilities via a lightweight SDK, without requiring teams to rebuild their underlying agent logic. The speaker, Luke Harries, demonstrated converting a generic chat support agent to a voice agent in approximately one prompt using server and client SDKs, Shadcn-based UI components, and a code-analysis skill. The engine bundles speech-to-text (Scribe), text-to-speech (V3), emotion-aware turn-taking, semantic vad, and unlocks omnichannel deployment (telephony, widgets, Zoom). The core pitch is that chat agents will either add voice or become obsolete, and Voice Engine is the migration path for teams that have already invested in chat infrastructure. 2. KEY FACTS (5-10 bullets) FACT: ElevenLabs is releasing a new product called Voice Engine in a couple of weeks. | EVIDENCE: "I'm giving you an early preview of a new product which will be coming out in a couple of weeks" | CONFIDENCE: HIGH FACT: Voice Engine wraps existing chat agents without replacing their underlying logic. | EVIDENCE: "we've basically taken this voice engine bit and wrapped it up into its own first class primitive, which makes it really easy for you to add and wrap any existing agent" | CONFIDENCE: HIGH FACT: The engine uses ElevenLabs' Scribe for speech-to-text and V3 for text-to-speech. | EVIDENCE: "We combine the best models, so speech to text with scribe... as well as the text to speech models like V3" | CONFIDENCE: HIGH FACT: Turn-taking is emotion-context-aware and includes semantic VAD. | EVIDENCE: "It's got this really advanced turn taking, which is emotion context aware. It can tell when you're pausing. The it does the semantic bad as well." | CONFIDENCE: HIGH FACT: The server SDK involves creating a client, creating a voice engine, and attaching a wrapper to the existing chat agent. | EVIDENCE: "you create your client, you then create your voice engine. And then you add this little wrapper to your existing chat agent where you basically attach it" | CONFIDENCE: HIGH FACT: The client SDK is approximately three lines of code to add a widget to a site. | EVIDENCE: "It's basically three lines you can then add and you have a widget in your site" | CONFIDENCE: HIGH FACT: UI components are based on Shadcn and Vercel style. | EVIDENCE: "we have a bunch of really beautiful well thought out UI components all based on the Shaan CN and Vercel style" | CONFIDENCE: HIGH FACT: A live demo showed a generic chat support agent being converted to voice in about one prompt. | EVIDENCE: "you can literally in about one prompt actually convert an existing chat agent to a voice agent. So I'll give you a quick demo now." | CONFIDENCE: HIGH FACT: ElevenLabs is looking for design partners for early access. | EVIDENCE: "We're also looking for some design partners. So if you want to be some of the first people to do this, we'd love to chat." | CONFIDENCE: HIGH FACT: Tool calling is handled by the existing chat agent; the wrapper proxies to it. | EVIDENCE: "your chat agent actually normally does the majority of tool calling. So it's actually already built out on the back end here. And so you can have this wrapper without needing to deal with any of the issues of tool calling." | CONFIDENCE: HIGH 3. KEY IDEAS (5-10 bullets) IDEA: Voice is not an incremental feature but a paradigm shift that makes chat feel dated. | REASONING: Speaker argues voice is faster, more accessible, and unlocks omnichannel interaction (Zoom, phone, screen readers) that text cannot. | IMPLICATION: AI product strategy should prioritize voice as a first-class interface, not a bolt-on. IDEA: The "voice engine" should be treated as a separable primitive, not bundled into a monolithic agent platform. | REASONING: Many teams already built and evaluated chat agents; forcing them to rebuild is friction. Extracting voice into a wrapper preserves existing investment. | IMPLICATION: The market will likely move toward higher-abstraction bundles (STT + TTS + turn-taking + VAD) as reusable infrastructure layers. IDEA: Developer experience for voice adoption must be trivial (one prompt, three lines of client code). | REASONING: Teams have already spent "a ton of time" on chat agents; the migration cost must approach zero to drive adoption. | IMPLICATION: Voice infrastructure winners will be judged by SDK ergonomics and migration friction, not just model quality. IDEA: Chat agents face a binary future: add voice or die. | REASONING: 2025 was the year of chat agents, but text-in/text-out is already feeling dated; the next upgrade is voice, not better prompts or RAG. | IMPLICATION: Existing chat-first products are at risk of rapid obsolescence if they do not integrate voice. IDEA: Omnichannel deployment (widget, telephony, SaaS integrations) should be automatic once voice is wrapped. | REASONING: The client SDK unlocks telephony and CSaaS "pretty much out the box" after wrapping the agent. | IMPLICATION: Voice infrastructure should be judged by downstream integration breadth, not just the core audio pipeline. IDEA: The community should move toward "higher abstraction bundles" instead of raw STT/TTS primitives. | REASONING: Developers currently assemble STT, TTS, turn-taking, and VAD manually; bundling these reduces repeated engineering. | IMPLICATION: Expect consolidation around bundled voice-agent SDKs and commoditization of the underlying raw models. 4. KEY QUOTES (3-7 bullets) "2025 was the year of the chat agents. And I think you either like died a SAS or you became AI first by adding a chat agent to your app." — Luke Harries "Chat's cool, but it doesn't feel like you're building the future though. And I really think voice is this natural medium." — Luke Harries "you can literally in about one prompt actually convert an existing chat agent to a voice agent." — Luke Harries "I think these chat agents will either die chat agents or start adding voice." — Luke Harries "your chat agent actually normally does the majority of tool calling. So it's actually already built out on the back end here. And so you can have this wrapper without needing to deal with any of the issues of tool calling." — Luke Harries "we start kind of moving to this higher abstraction bundles instead of just the pure text to speech speech to text." — Luke Harries 5. SIGNAL POINTS (5-8 bullets) Voice Engine ships in weeks as a standalone wrapper for existing chat agents, not a replacement platform. Migration cost is designed to be near-zero: one prompt to analyze codebase and wrap, ~3 lines for client widget. The engine bundles Scribe (STT), V3 (TTS), emotion-aware turn-taking, and semantic VAD into a single primitive. Tool calling is not handled by the wrapper; it proxies to the existing agent, preserving backend logic. Omnichannel unlock (telephony, widgets, Zoom) is automatic after wrapping, not requiring separate integrations. ElevenLabs is actively recruiting design partners, indicating the product is pre-GA and seeking validation. The speaker explicitly predicts chat agents that do not add voice will die, framing this as a binary strategic inflection. 6. SOURCES MENTIONED Linear — cited as a company that made its home screen a chat interface. PostHog — cited as a company that made its home screen a chat interface; used as example of a voice agent joining a Zoom call. Gov.uk — cited as a government example adopting chat agents. Revolut — cited as an ElevenLabs customer for customer support. Scribe — ElevenLabs' speech-to-text model, described as "the most accurate model." V3 — ElevenLabs' text-to-speech model. Shadcn — UI component library style used for Voice Engine components. Vercel — design style reference for UI components. ElevenLabs Voice Engine — the new product being announced. ElevenLabs server SDK / client SDK — developer tools demonstrated. 7. VERDICT This video is worth watching for AI product and infrastructure trackers because it delivers a concrete, shipping-soon product (Voice Engine) with a clear technical paradigm: voice as a separable, wrap-around primitive rather than a ground-up rebuild. The signal unique to this talk is the explicit SDK ergonomics and migration path for existing chat agents — most voice-agent discussions assume greenfield builds, but Harries focuses on brownfield preservation. The live demo, code snippets, and design-partner callout give it higher credibility than a pure vision talk. The main limitation is that performance claims (e.g., "most accurate" STT) are unbenchmarked in the transcript, and the one-prompt migration claim is demonstrated on a generic agent, not a complex production system. Signal density is moderate-high: there is some repetition and applause filler, but the core announcement, architecture, and SDK details are dense and actionable. COUNT Facts: 10 Assumptions: 0 (all claims are either directly stated or demonstrated; no unbacked speculation) Demonstrations: 2 (live demo of chat-to-voice conversion; code snippet showing voice engine wrapper) SIGNAL DENSITY: 75

What matters

Signal points

  1. 1

    Voice Engine ships in weeks as a standalone wrapper for existing chat agents, not a replacement platform.

  2. 2

    Migration cost is designed to be near-zero: one prompt to analyze codebase and wrap, ~3 lines for client widget.

  3. 3

    The engine bundles Scribe (STT), V3 (TTS), emotion-aware turn-taking, and semantic VAD into a single primitive.

  4. 4

    Tool calling is not handled by the wrapper; it proxies to the existing agent, preserving backend logic.

  5. 5

    Omnichannel unlock (telephony, widgets, Zoom) is automatic after wrapping, not requiring separate integrations.

  6. 6

    ElevenLabs is actively recruiting design partners, indicating the product is pre-GA and seeking validation.

  7. 7

    The speaker explicitly predicts chat agents that do not add voice will die, framing this as a binary strategic inflection.

  8. 8

    6. SOURCES MENTIONED

Interpretation

Key ideas

1

Voice is not an incremental feature but a paradigm shift that makes chat feel dated.

Why: Speaker argues voice is faster, more accessible, and unlocks omnichannel interaction (Zoom, phone, screen readers) that text cannot.

Implication: AI product strategy should prioritize voice as a first-class interface, not a bolt-on.

2

The "voice engine" should be treated as a separable primitive, not bundled into a monolithic agent platform.

Why: Many teams already built and evaluated chat agents; forcing them to rebuild is friction. Extracting voice into a wrapper preserves existing investment.

Implication: The market will likely move toward higher-abstraction bundles (STT + TTS + turn-taking + VAD) as reusable infrastructure layers.

3

Developer experience for voice adoption must be trivial (one prompt, three lines of client code).

Why: Teams have already spent "a ton of time" on chat agents; the migration cost must approach zero to drive adoption.

Implication: Voice infrastructure winners will be judged by SDK ergonomics and migration friction, not just model quality.

4

Chat agents face a binary future: add voice or die.

Why: 2025 was the year of chat agents, but text-in/text-out is already feeling dated; the next upgrade is voice, not better prompts or RAG.

Implication: Existing chat-first products are at risk of rapid obsolescence if they do not integrate voice.

5

Omnichannel deployment (widget, telephony, SaaS integrations) should be automatic once voice is wrapped.

Why: The client SDK unlocks telephony and CSaaS "pretty much out the box" after wrapping the agent.

Implication: Voice infrastructure should be judged by downstream integration breadth, not just the core audio pipeline.

6

The community should move toward "higher abstraction bundles" instead of raw STT/TTS primitives.

Why: Developers currently assemble STT, TTS, turn-taking, and VAD manually; bundling these reduces repeated engineering.

Implication: Expect consolidation around bundled voice-agent SDKs and commoditization of the underlying raw models.

Evidence

Key facts

ElevenLabs is releasing a new product called Voice Engine in a couple of weeks.

HIGH

Evidence: I'm giving you an early preview of a new product which will be coming out in a couple of weeks

Voice Engine wraps existing chat agents without replacing their underlying logic.

HIGH

Evidence: we've basically taken this voice engine bit and wrapped it up into its own first class primitive, which makes it really easy for you to add and wrap any existing agent

The engine uses ElevenLabs' Scribe for speech-to-text and V3 for text-to-speech.

HIGH

Evidence: We combine the best models, so speech to text with scribe... as well as the text to speech models like V3

Turn-taking is emotion-context-aware and includes semantic VAD.

HIGH

Evidence: It's got this really advanced turn taking, which is emotion context aware. It can tell when you're pausing. The it does the semantic bad as well.

The server SDK involves creating a client, creating a voice engine, and attaching a wrapper to the existing chat agent.

HIGH

Evidence: you create your client, you then create your voice engine. And then you add this little wrapper to your existing chat agent where you basically attach it

The client SDK is approximately three lines of code to add a widget to a site.

HIGH

Evidence: It's basically three lines you can then add and you have a widget in your site

UI components are based on Shadcn and Vercel style.

HIGH

Evidence: we have a bunch of really beautiful well thought out UI components all based on the Shaan CN and Vercel style

Show 3 more facts

A live demo showed a generic chat support agent being converted to voice in about one prompt.

HIGH

Evidence: you can literally in about one prompt actually convert an existing chat agent to a voice agent. So I'll give you a quick demo now.

ElevenLabs is looking for design partners for early access.

HIGH

Evidence: We're also looking for some design partners. So if you want to be some of the first people to do this, we'd love to chat.

Tool calling is handled by the existing chat agent; the wrapper proxies to it.

HIGH

Evidence: your chat agent actually normally does the majority of tool calling. So it's actually already built out on the back end here. And so you can have this wrapper without needing to deal with any of the issues of tool calling.

Memorable lines

Quotes

2025 was the year of the chat agents. And I think you either like died a SAS or you became AI first by adding a chat agent to your app." — Luke Harries
Chat's cool, but it doesn't feel like you're building the future though. And I really think voice is this natural medium." — Luke Harries
you can literally in about one prompt actually convert an existing chat agent to a voice agent." — Luke Harries
I think these chat agents will either die chat agents or start adding voice." — Luke Harries
your chat agent actually normally does the majority of tool calling. So it's actually already built out on the back end here. And so you can have this wrapper without needing to deal with any of the issues of tool calling." — Luke Harries
we start kind of moving to this higher abstraction bundles instead of just the pure text to speech speech to text." — Luke Harries