The oft-quoted line is that people overestimate what can be done in two years and underestimate it in ten years. It’s been a year since Dalle launched and a month since GPT4 did. Before the next ten years is out, we’ll have this:
That was a five year old girl (”Jesse”) talking to a cat AI (”Blue”). Blue is fully controlled by ML tooling. It has an animated emotive digital face and body that moves like you’d expect from a cartoon character. Blue’s purpose is to get Jesse to tell a story that it then turns into an animated book. This was Blue’s first time meeting Jesse. It will take 5-10 minutes for her to become comfortable with him and then she’ll tell him things she hasn’t yet told mom like what she wants to be when she grows up. It will all go into that session’s book, which she’ll co-create from an ML-generated starting point, and then mom and dad will read the book with her before bed.
If we’re anchored at point A today, then the lighthouse I’m steering towards is a communicative audiovisual AI that feels just like I’m talking to a [likely cartoon] character with a human behind the curtain. You can call Blue on the phone or open our app for a more interactive experience. Anyone can engage and not just for companionship, but for entertainment and creation and imaginative exploration and interactive learning.
Blue is just one character of many, each with their own personality. Those personalities can be thematic, such as biblical or renaissance artist or soccer coach. Each would be favored by a different subset of people, especially when it comes to parents & their kids. A 1:1 session lends itself to interactive possibilities like digital drawings or story creation, while a group session towards others like gaming or a socratic discussion.
I’m only going to list businesses that I think have tremendous $ potential. Happy to talk about others offline.
- The first one is gaming. Building NPC characters that can receive backgrounds as input and don’t need anything to then generate convincing communicative experiences is a big deal. This is roughly Inworld’s business but as far as I can tell they work mostly with big studios and require a lot of work on the developer [customer] side to make the character movement great. We could target anyone who wants to make a game or story and let them incorporate AIs that have a life of their own without any developer interference past the first stage.
- Another is interviews for menial jobs. McDonald’s or Walmart can replace their phone/in-person filtering screens with this instead.
- Another is what Google tried with Duplex, which was arguably just too early. People would pay to handle the obnoxious and not fun conversations, such as dealing with Comcast, etc.
- Then there’s the companion route for kids, adults, or seniors. These would all three be very different products but have playbooks that have the chance to be huge businesses. The kids one I understand the best and the playbook there is to break even on mobile app revenue towards building at least one character with high brand value. This is what turns it into a billion dollar plus business because then there are lots of licensing opportunities. Along the way, you sell animated bedtime stories of the interactions, generated imagery, toys, etc, with the goal to turn at least one of the characters into a notable brand.
Today, we can imbue agents with response superpowers via LLMs. Very quickly, we’re seeing that this has opened up other possibilities with explorations like BabyAGI and AutoGPT. What happens in the world I’m positing that has communicative audiovisual agents with whom it feels great to interact? Here’s two directions:
- Every character in a private or business setting becomes an opportunity to connect. This is most apparent in museums and similar curated experiences where every painting can come to life through your device [and eventually with just our tech sans phone]. It can talk to you about what the painting represents and the life of the artist. It can also just have fun with you and be something different where the museum controls what is input as the backstory. But museum here is a stand-in for a private business with control. Every face can come to life with a personality and purpose controlled by the business, which means there’s tremendous possibility to do something materially different on this new platform, which we would control.
- Every character in the public world becomes an opportunity to change how people interact with the world. With the tech we’d build here, anyone could draw a face on a piece of paper or on a wall and bring it to life through their phone; this would be their avatar for which they could build a backstory and more. That includes street art, which would all come to life and be controllable by the street artist. Different from the above, this is public, which means that its memory is not person-specific but actually global, and that’s a manifestly different outcome. Who uses this and what do they do with it? I’m not sure yet, but it is an exceptionally different world where you can walk up to a face on a wall and expect it to have something interesting to say or do. It’s closer to a guerilla version of Pokemon Go, which is a huge business already.
A first go at this would entail an ASR system to understand the speech being said, an LLM to get the next utterance, and a TTS system to say it back as speech, plus very important engineering to make the speech feel more natural and in line with how humans do conversation and not just iterative texting. Expanding this to video requires using a controllable animation approach such as with FACS, JALI, or a work from the FOMM lineage.
Beyond this first go, there’s some cool research shit that we will want to do to make the experience feel seamless. For example, listen to how the cadence and intonation of Blue’s voice is excitatory for Jesse. Or how laughter hits at the right time. Plus check out the hand movements that make her feel like Blue is normal. It’s an incredibly personalized character animation that Jesse loves.
And of course there’s everything about memory, which is challenging in its own right. That needs to work for both private experiences and public experiences, which are each a very different setup. All of this is important. We also need to get the wav → face animation done really well. And we want to make the character aware of its environment, whether it’s internally in the app or externally in the world. Eventually, we want to shift from ASR and stay in audio only world, traversing from speech directly to speech.
To do this complete justice, there is a copious amount of research to complete, some of which I highlight here. I have ideas on how to do every Q on that doc that I think will work, plus there’s a lot of energy in the research community that will help expedite our progress in all areas. However, note that we don’t need to resolve every question to pull this off well and each one has milestones we can hit along the way. This is the right time to start!
- We would have time to build the ideal solution, which will require machine learning, some legs of which might take on the order of 1-3 years; as you know, ML is hard to predict.
- Along the way, we are held accountable for milestones that start with more straightforward solutions and trend towards the ideal solution. We have a product goal and we should strive to properly straddle the balance between long term ideal solutions and short term good enough solutions.
- Those milestones coincide with other internal objectives so that what we do in this direction helps the team more broadly.
- A resourceful environment that cares about bringing interactive and creative AI into the world and not just assistants.
- Those resources include compute resources.
- They also include other people that are energized by these directions.