What will voice AI look like in the future? | Drew Ross | TEDxBoston
The speaker argues that voice AI represents the pinnacle of human-machine interaction, evolving from rigid keyword recognition to a truly conversational partner capable of anticipating needs and executing real-world tasks. This transformation is supported by advancements in emotional embeddings and low-latency compute, promising a future where interaction is as natural as human conversation. Finally, the speaker challenges the audience to act as leaders, innovators, and builders to guide this technology responsibly.
## Speakers & Context
- Unnamed speaker.
- Addresses the audience by asking them to close their eyes and think back to their first words, using this memory to frame the discussion on communication.
## Theses & Positions
- Voice is the *purest bridge* between human minds, making it the highest bandwidth form of human communication before writing.
- Conversational AI is moving beyond mere word recognition to understanding true *meaning* and *intent*.
- Current struggles of conversational AI include lacking true conversational intelligence (emotion, intent recognition, turn-taking) and high latency.
- The future of AI interaction will involve anticipation and seamless, real-world task execution, moving beyond just speech to action.
- Voice-first interfaces allow humanity to adapt machine behavior to human behavior, rather than adapting human behavior to machines.
## Concepts & Definitions
- **Voice:** Defined as the purest bridge for human interaction, occurring before written language.
- **Conversational AI:** The current stage of AI interaction, surpassing earlier limited systems.
- **Conversational intelligence:** The ability to understand emotion, recognize intent, and take turns naturally in dialogue.
- **Emotional embeddings:** Advances enabling AI voices to capture subtle cues in pitch and rhythm, adapting to a speaker's mood in real time.
- **Voice-first UI:** An interface that relies primarily on voice commands, eliminating the need for typing or clicking.
- **Jarvis:** Used as an archetype for the ideal, always-present, helpful personal AI assistant.
## Mechanisms & Processes
- **Evolution of AI Voice:** Progressed from first-generation assistants (Siri, Alexa) relying on "rigid commands [and] keyword recognition" to current sophisticated models using *Large Language Models* (LLMs).
- **AI Improvement Pipeline:** Advances are being addressed through:
- Training on **synthetic speech data** to improve emotional ability at scale.
- Developing **low latency compute** and robust integrations.
- **Future Functionality:** The technology aims to enable administrative assistance (scheduling, note-taking, basic decision-making), personalized education (filtering the internet into study plans), and voice-based programming (developing entire products without touching a keyboard).
- **Real-World Integration:** Currently evident in the **$300 billion Dollar Market of call centers**, where AI agents hold conversation and then autonomously perform backend actions.
## Timeline & Sequence
- **Early Life:** The ability to speak marks the start of extraordinary communication.
- **Past Technology:** First generation included Siri and Alexa, built for an era of *rigid commands*.
- **Recent Years:** Witnessed an "explosive advance in text-based intelligence" with LLMs and Transformer/diffusion models revolutionizing listening technology.
- **Present:** Capable of holding real conversations with AI via platforms like *chat.com* and *openAI realtime API*.
- **Future State:** Anticipation and seamless execution of tasks, making voice the fundamental interface for all technology.
## Named Entities
- **Siri** — Example of an older talking machine assistant.
- **Alexa** — Example of an older talking machine assistant.
- **OpenAI** — Platform where real-time conversations can occur.
- **chat.com** — Platform where real-time conversations can occur.
## Numbers & Data
- **$300 billion Dollar** — Estimated size of the current call center market utilizing AI agents.
## Examples & Cases
- **Basic AI failure:** Systems giving "a complete nonsense answer" or the inability to grasp *meaning* or *intent*.
- **Administrative Voice Assistance:** Potential to schedule meetings, take notes, and make basic choices without typing or clicking.
- **AI Teachers:** Potential to adapt tone and pacing to individual student needs and filter knowledge into personalized plans.
- **Voice-based Programmers:** Potential to develop entire products without manual keyboard interaction.
- **Call Center Agents:** Current example where AI agents converse and then autonomously perform backend actions.
- **Critical Professions:** Hands-free access and voice control systems benefiting surgeons, first responders, and industrial workers.
## Tools, Tech & Products
- **Large Language Models (LLMs)** — Revolutionized capability in understanding and generating human language.
- **Transformer and diffusion based models** — Technologies enabling more humanlike voice output.
- **AI agents** — Capable of both conversation and autonomous backend action.
- **Voice-first UIs** — The target interface concept for future technology.
## Trade-offs & Alternatives
- **Keyboards and touch screens:** The established method of interacting with technology, which the speaker suggests is an outdated model.
- **Adapting behavior to machines vs. adapting machines to us:** The core philosophical trade-off being presented.
## Counterarguments & Caveats
- Conversational AI *lacks* conversational intelligence, specifically around emotion and intent.
- Latency due to heavy computations behind the scenes can break immersion.
- Current AI interactions are limited to speech, meaning they often fail to take "any real world action."
## Methodology
- Analogy progression: Using the evolution from baby speech -> Siri -> LLMs -> Anticipatory AI to structure the argument.
- Identifying core functional gaps in current AI (latency, emotional understanding, physical action).
- Addressing future state through examples across professional sectors (medicine, education, industry).
## Conclusions & Recommendations
- The speaker recommends that leaders push for *reliable and ethical* voice AI.
- Innovators should create new solutions harnessing voice intelligence.
- Builders should engineer robust platforms to scale voice-first experiences.
- The goal is to make interacting with technology as natural as a conversation.
## Implications & Consequences
- **Workforce Transformation:** Voice AI will fundamentally transform how work is done across industries, from call centers to manufacturing.
- **Accessibility:** Voice-first UIs promise "unprecedented Independence" for people with disabilities by breaking down language barriers.
- **Interface Paradigm Shift:** The shift signals the decline of keyboards and touch screens as primary interaction modalities.
## Verbatim Moments
- *"voice is the purest bridge between one human mind and another it's the highest bandwidth form of communication us humans have"*
- *"they don't truly grasp meaning intent remotion"*
- *"Today we can hold real conversations with AI you can go to chat.com open up open AI realtime API and talk to it about anything from baking recipes to business ideas and even for emotional support"*
- *"understanding emotion recognizing intent and smoothly taking turns in dialogue are things us humans take for granted it's second nature to us but AI struggles with them"*
- *"AI will be more than just virtual assistance it will become the fundamental way we interface with all technology like a personal Jarvis"*
- *"it will mean breaking down language barriers and all human technology interaction allowing people to command intelligent systems naturally in their native tongue"*
- *"the future of human computer interaction will not be seen it will be heard"*