Alexis Conneau thinks a lot about the movie “Her.” For the last several years, he’s obsessed over trying to turn the film’s fictional voice technology, Samantha, into a reality.
Conneau even uses a picture of Joaquin Phoenix’s character in the movie as his banner on Twitter.
Conneau’s X/twitter banner (Image Credit: X)
With ChatGPT’s Advanced Voice Mode, a project Conneau started at OpenAI after doing similar work at Meta, he kind of did it. The AI system natively processes speech and talks back much like a human.
Now, he has a new startup, WaveForms AI, that’s trying to build something better.
Conneau spends a good chunk of time thinking about how to avoid the dystopia shown in that movie, he told TechCrunch in an interview. “Her” was a science fiction film about a world where people develop intimate relationships with AI systems, instead of other humans.
“The movie is a dystopia, right? It’s not a future we want,” said Conneau. “We want to bring that technology – which now exists and will exist – and we want to bring it for good. We want to do precisely the opposite of what the company in that movie does.”
Building the tech, minus the dystopia that comes with it, seems like a contradiction. But Conneau intends to build it anyway, and he’s convinced his new AI startup will help people “feel the AGI” with their ears.
On Monday, Conneau launched WaveForms AI, a new audio LLM company training its own foundation models. It’s aiming to release AI audio products in 2025 that compete with offerings from OpenAI and Google. The startup raised $40 million in seed funding, it announced on Monday, led by Andreessen Horowitz.
Conneau says Marc Andreessen – who previously wrote that AI should be part of every aspect of human life – has taken a personal interest in his endeavor.
It’s worth noting that Conneau’s obsession with the movie “Her” may have landed OpenAI in trouble at one point. Scarlett Johansson sent a legal threat to Sam Altman’s startup earlier this year, ultimately forcing OpenAI to take down one of ChatGPT’s voices that strongly resembled her character in the film. OpenAI denied ever trying to replicate her voice.
But it’s undeniable how much the movie has influenced Conneau. “Her” was clearly science fiction when it was released in 2013 — at the time, Apple’s Siri was quite new and very limited. But today, the technology feels scarily within reach.
AI companionship platforms like Character.AI reach millions of users weekly who just want to talk with its chatbots. The sector is emerging as a popular use case for generative AI — despite occasionally tragic and unsettling outcomes. You can imagine how someone typing with a chatbot all day would love the chance to speak with it too, especially using tech as convincing as ChatGPT’s Advanced Voice Mode.
The CEO of WaveForms AI is wary of the AI companionship space, and it’s not the core of his new company. While he thinks people will use WaveForms’ products in new ways – such as talking to an AI for 20 minutes in the car to learn about something – Conneau says he wants the company to be more “horizontal.”
“[WaveForms AI] can be that teacher that inspires, you know, maybe that teacher that you wouldn’t have in your life, at least, your physical life,” said the CEO.
In the future, he believes talking to generative AI will be a more common way to interact with all kinds of technology. That may include talking to your car, and talking to your computer. WaveForms aims to supply the “emotionally intelligent” AI that facilitates it all.
“I don’t believe in the future where human-to-AI interaction replaces human-to-human interaction,” said Conneau. “If anything, it’s going to be complementary.”
He says AI can learn from the mistakes of social media. For instance, he thinks AI shouldn’t optimize for “time spent on platform,” a common metric of success for social apps that can promote unhealthy habits, like doomscrolling. More broadly, he wants to make sure WaveForms’ AI is aligned with the best interests of humans, calling this “the most important work you could do.”
Conneau says OpenAI’s name for his project, “Advanced Voice Mode,” doesn’t really do justice to how different the technology is from ChatGPT’s regular voice mode.
The old voice mode was really just translating your voice into text, running it through GPT-4, and then converting that text back into speech. It was a somewhat hacked-together solution. However, with Advanced Voice Mode, Conneau says that GPT-4o is actually breaking down the audio of your voice into tokens (apparently, every second of audio is equal to roughly three tokens) and running those tokens directly through an audio-specific transformer model. That, he explained, is what enables Advanced Voice Mode to have such low latency.
One claim that gets thrown around a lot when talking about AI audio models is that they can supposedly “understand emotions.” Much like text-based LLMs are based on patterns found in heaps of text documents, audio LLMs do the same thing with audio clips of humans talking. Humans label these clips as “sad” or “excited” so that AI models recognize similar voice patterns when they hear you say it, and even respond back with emotional intonations of their own. So it’s less that they “understand emotions” and more that they systematically recognize audio qualities that humans associate with those emotions.
Making AI more personable, not smarter
Conneau is betting that generative AI today doesn’t need to get significantly smarter than GPT-4o to create better products. Instead of improving the underlying intelligence of these models, like OpenAI is with o1, WaveForms is simply trying to make AI better to talk to.
“There will be a market of people [using generative AI] who will just choose the interaction that is the most enjoyable for them,” said Conneau.
That’s why the startup is confident it can develop its own foundational models — ideally, smaller ones that will be less expensive and faster to run. That’s not a bad bet given recent evidence that the old AI scaling laws are slowing down.
Conneau says his former co-worker at OpenAI, Ilya Sutskever, often talked to him about trying to “feel the AGI” – essentially, using a gut feeling to assess whether we’ve reached superintelligent AI. The CEO of WaveForms is convinced that achieving AGI will be more of a feeling, instead of reaching some sort of benchmark, and audio LLMs will be the key to that feeling.
“I think you’ll be able to feel the AGI a lot more when you can talk to it, when you can hear the AGI, when you can actually talk to the transformer itself,” said Conneau, repeating comments he made to Sutskever over dinner.
But as startups make AI better to talk to, they clearly also have a responsibility to figure out how to make sure people don’t get addicted. However, Andreessen Horowitz’s general partner Martin Casado, who helped lead the investment in WaveForms, says it’s not necessarily a bad thing if people are talking to AI more often.
“I can go talk to a random person on the internet, and that person can bully me, that person can take advantage of me… I can talk to a video game which could be arbitrarily violent, or I could talk to an AI,” said Casado in an interview with TechCrunch. “I think it’s an important question study. I will not be surprised if it turns out that [talking to AI] is actually preferable.”
Some companies may consider someone developing a loving relationship with your AI as a marker of success. But from a societal standpoint, it also could be seen as a marker of total failure, much like the movie “Her” tried to depict. That’s the tightrope that WaveForms now has to walk.