Part 1: Communicate with a Character

(April 10th, 2023)

Today, in the guise of large language models, we have excellent prompt responding agents. They are phenomenal assistants and I expect them to get much better.

However, we still have a lot of research to do in building a creative agent that is communicative. We know this because there’s no current expectation that a small amount of clever engineering will produce an AI that convincingly plays the in-ear AI Samantha character from Her. Conditioned on us having walking and dexterous robots, there is a similar story for Rosie from The Jetsons.

Below are 14 research questions towards building a creative agent that is more capable in communicating with humans. I think just one of them is reasonably answered, and that’s the first one on what to say (as of GPT3.5). The goal of this research would be to build an agent like Samantha or Rosie. It’s not just an assistant. It’s an agent to whom we want to communicate and with whom it is easy to not just convey information but actually share state and convey emotion and deep understanding. This would be a creative agent that perhaps even has its own self-intent. It’s the difference between the fidelity of a text <prompt, response> loop and that of a high fidelity audio[visual] communication.

There’s just questions in this post. In the next post, I’ll add (hypothesized) solutions.

Problem: What should I say? Given a conversation history (a sequence of utterances) involving another participant, what should be said next? This is the one most solved, and it was done recently with LLMs. Doing it in audio without text in between is another frontier.

Problem: I don’t want to reply. How can we choose not to respond? People may tell the agent stuff that doesn’t require an answer. These may be rhetorical questions or they may be statements that just don’t need an answer. Right now, all queries get a response, but eventually we have to move away from that because it’s unnatural for realistic Communication.

Problem: I want to lead, not just respond. The current LLM paradigm is a <prompt, response> cycle, and it’s quite hard to shake the agent from doing that and instead prompt us. At a meta level, it works to tell the agent that it’s talking to another person and to be engaging; the agent will often make its own queries then. But that just means we’ve shifted the responsibility of the <prompt, response> to the human (or engineering) wizard running the show.

Problem: How should I time my reply? How do we know when to respond? Sometimes, an agent should interrupt the speaker. Other times, they should wait a few beats afterward to let something settle.

Problem: When should I stop speaking? How do I know when to stop talking? The default is to stop when I’m done saying what I wanted to say, but sometimes I should be interrupted and be ok with it and maybe even never say what I was going to say.

Problem: Audio Communication has more sounds than just the words. How do we know when to give acknowledgement feedback to the speaker? These take the form of mmhms, nods, rights, yeps, and also doubts.

Problem: Communication is connection as well as information exchange. How do we know when to have the agent exhibit personally connecting behaviors like speaking the name of the other person? That changes the tenor terrifically and adds emotion and connection.

Problem: Statements can be questions, and questions can be statements. How can we know when to formulate the response as a question vs as an imperative? People do this in order to hone their true meaning. An example would if you are trying to guide someone to an answer and frame it as a socratic discussion rather than a commandment.

Problem: People use follow-up questions to get on the same page. How do we know when we have enough information to formulate a coherent response in line with what the other participants have said? It already happens fairly frequently with LLMs that because they must respond with information, rather than with a follow-up, that users must be very specific or endure a litany of “well, no, that’s not what I was looking for” experiences.

Problem: A communicative agent can gauge what type of answer is sought and otherwise ask. How do we know when the other agent (human) actually wants us to answer with solutions vs with information vs with understanding vs with inquiry? This is important for agents to act with communicative intent and not just answers.

Problem: I am talking with this party. How can we hone in on whom is speaking to the agent so that there aren’t problems in crowded areas? For example, we could spend the first few utterances building a signature of the person speaking in order to tune out others; if another comes along and starts speaking, we would add that as a different signature and then ask the original person if they should include this new person in the conversation. Another version would be to turn on a video camera and look for recognition in nearby faces as to whom we should be speaking with.

Problem: I feel a certain way and that feeling comes out in what I say. How do we add intonation that is properly expressive? In other words, how do I add the emphasis to connote excitement, curiosity, anger, etc. This is important because it allows for a lot more understanding to be put through the same channel.

Problem: I feel a certain way and that feeling comes out in how my face acts. Conditioned on having a face, how do we respond to and interact with the speaker? This includes movements like raising eyebrows, shifting eyes, smiling, frowning, etc.

Problem: I feel a certain way and that feeling comes out in how my body acts. Conditioned on having a body, how do we respond to and interact with the speaker? This includes expressing emotions through the body such as being shy or being very interested, moving away, and also hand movements.