Mustafa Suleyman on Copilot Vision, AI Companions, Infinite Memory, AI Agents, and more Artwork

The State of AI with Rowan Cheung

The State of AI is a podcast hosted by Rowan Cheung, where he talks with experts in the AI industry about the latest developments, why they matter, and how you can leverage them for the future of work. This podcast is produced by rundown.ai, the world's largest daily AI newsletter.

All Episodes

The State of AI with Rowan Cheung

Mustafa Suleyman on Copilot Vision, AI Companions, Infinite Memory, AI Agents, and more

December 05, 2024 • Rowan Cheung

Exclusive: Microsoft just launched Copilot Vision in Edge—the first AI that can navigate the internet with you in real time.

Rowan Cheung (@rowancheung) sat down for an exclusive interview with Mustafa Suleyman (@mustafasuleyman), CEO of Microsoft AI, to discuss how it works, infinite memory, AI companions, agents, and more.

Find all the details of Copilot Vision on Microsoft’s official announcement blog post.

__

Join our daily AI newsletter: https://www.therundown.ai/subscribe
Learn AI hands-on with our AI University: https://rundown.ai/ai-university/

0:00

Things that really stood out to me as well just talking with Copilot Vision was how personable it really was. It really did feel like that friend. He even gave me like some sass. Like, oh, I don't, I don't really like this place. And it's like, oh no, this place is pretty nice. That's a profound moment because a true friend would do that. No one wants like a sycophantic AI that just always mirrors you and always obeys you. It's not going to be interesting for very long. And if you think about how much time we spend on our laptops and phones today, we've created this entire arbitrary, made up graphical user interface to accommodate for the fact that computers are too dumb to understand the words that are coming out of my mouth. That is all going to get washed away. Now, your AI, your co pilot, is clearly going to understand everything that you're bringing to the table. Your emotional state, what you need to get done that day. Your interests, your hobbies. It's more than just an interface. It is a new plane of connection, and it's going to feel like a new digital species, and it's a magical experience. Thanks so much for joining me today, Mustafa. Obviously, there's a ton of news coming out for Microsoft AI today, so I'm gonna get right into it. First, can you give us a quick rundown of everything that was released and why this is an important moment for AI? Well, I mean, I guess the first thing to say is that we are on a mission to create a true AI companion. And to me, an AI companion is one that can hear what you hear. Um, and see what you see and live life essentially alongside you, um, you know, your AI companion will be able to remember, uh, everything that you've talked about session to session, understand the content of the web pages that you browse, um, and be able to talk to you just like I'm talking to you now. So it's going to have this seamless, Fluid, very, very smooth conversational interaction and that part of it, you know, the first version of the voice is already out, but, um, in the next couple of days we're launching, uh, vision, which we announced a few weeks back and and is really a magical experience and it is quite different to any kinds of A. I. Or even general kind of computer interaction experiences that we've seen before. Uh, it feels Like a new interface to be able to make an ambiguous reference to what you see on the screen and be like, Hey, co pilot. What is that? That doesn't make sense. Describe that to me. This is kind of interesting. Um, that the use of this or that, because you know that co pilot can see what you're currently seeing, um, is, is really a fundamentally different way to interact with computers. So I'm very excited about it. Yeah. Co pilot vision is. a huge breakthrough and I was lucky enough to get access before this interview actually and it got me incredibly excited. So really particular kind of use case that really opened up to my eyes to like the technology and the use cases is actually just travel plans. So a real world use case is I'm actually traveling to Switzerland in a couple months and I went up Airbnb to look for a place and I just asked Copilot, I'm like, Hey, Where are the nice areas of town that I should look for? And we're kind of the places I should try to avoid. And it just kind of like instantly started guiding me through the entire map, giving me suggestions of like particular parts of the city that I should check out, like by the lake or, uh, by the specific restaurant. And then. It got even crazier when I clicked into the actual places and I scrolled through the reviews super quickly, like too fast for me to even read a single review. And I just asked it, I'm like, Hey, um, what are some things I should look out for based on these other people? And what are they saying? And it just gave me this list of like potential cons of a place. And it like. It opened up my eyes to like how, how valuable this is. Cause I could not read that fast and copilot just summarize it instantly. And, um, I'm just like super excited for like potential use cases of this and what other people are going to find. Um, but. Enough of like my initial reactions. Um, I'd love to hear from you what kind of use cases do you think this is going to really solve and are there any like practical things you and the Microsoft AI team has kind of Tinkered with internally that is kind of like really exciting to you Well, I think the one that you mentioned is is a genuine one, right? Like reviews are super important whether you're buying a washing machine or a car You know, or trying to book a holiday and, you know, traditionally forums have been the place for that. Um, reams and reams of detailed, you know, sort of very technical, quite niche, uh, interactions. Um, you know, probably intermixed with a bunch of trolls who, you know, are kind of like ranting at each other for, you know, hot takes on all kinds of things. And so they're really, really important and it contains tons of important information, but at the same time, they're really tiring for us to. You know, like read that most of us are too lazy to go through all of that. So the cool thing about the copilot vision experience is that it has three components. There's firstly the underlying LLM. So it has all the knowledge of those reviews from across the web. Second is it pauses the text on the page that you're reading instantly. So you don't even have to scroll down. It actually just absorbs everything that's on the DOM straight away. And then third, it, um, sees exactly the image that you're looking at. So let's say that you wanted to go, uh, move into a new apartment, you're looking for furniture, you know, it's quite difficult if you're not like a expert and you don't, you don't know the language of soft furnishings or fashion, you know, um, to describe exactly what you're seeing on the page, let alone have a strong opinion about, you Whether you like it or not. And so a big part of having an AI companion is having a kind of expert support in your pocket, giving you advice, giving you a feedback, um, you know, just co describing what you see. Um, and I think that feels very different to any other kinds of experience that we've had, um, previously. Yeah. How does, how deeply does Copilot Vision kind of understand you as a user? Like, my question is. How long until it truly starts memorizing, learning you as a user over time? Yeah. Great question. Memory is the key thing that is coming soon. Um, you know, not, certainly not this year, but we are working on it as a number one priority because, you know, it's super important that it remembers what your preferences are, um, and is able to kind of reason over them and give you, um, advice based on the fact that it knows that you, um, Really don't like modernist, you know, furniture. You like more traditional stuff or you're really not into big, bright, bold colors. Because last time when you talked about those curtains or that sofa. All those clothes, you know, you were kind of eked out by it. So that is, I think, going to be a real, like, big breakthrough when it comes. But, um, even like now, it's just fun to kind of co browse or co explore the digital world with your co pilot. You know, um, and you know, I think whether it's like both reading the reviews on the page or looking at the images while you're scrubbing through an Airbnb and you're trying to decide, you know, it's a big expense, right? You're gonna spend a few hundred dollars a night, perhaps maybe more, maybe less, and you've never been there. And all you have is like 20 photos to read. And You know, making that decision independently is kind of tricky sometimes. So, um, having, uh, you know, co pilot be there to think out loud with you, you know, that's the sort of the way I think about it. I think it is a kind of sounding board for you to make sense of everything that you see. In your visual experience. Um, and we've seen particularly on some of our experiments that we've been doing internally on social media, like instagram and so on. People really appreciate being able to scroll through, um, you know, their feeds of various kinds and when they laugh at something or when they're kind of surprised by something. Or, you know, when they're a bit like kind of disgusted by something, you know, just having your co pilot there to kind of also share in that experience is, is, is pretty cool. So obviously Microsoft is a major investor in OpenAI and Chachapiti kind of has this yet to be released kind of teased version of their vision product. Um, how is Microsoft AI differentiating itself from other competitors? Um, the main thing is that we're really leaning into the idea of it being a proper companion. So just the fluency of our voice and how smooth it is, how fast it is, it's very interruptible, very easy to talk to. A lot of people have like remarked on that and that's very deliberate design intention. Um, putting vision inside of the browser is the next step. And, um, so now in Edge just having it. With you all the time able to kind of watch and learn and talk to you about. That is a really big differentiator, which I think those guys don't just have yet. But I mean, fundamentally, the way to think about it is that there's just a small group of us that are really innovating at the frontier, and we're all pushing the envelope here and going as fast as we can to pull all these capabilities in. And we have a whole chunk more on their way in terms of generative UI experiences that kind of create it. You know, a very immersive, interactive experience, which unfolds on the fly in response to what you say and what you do and what you see. Wow. Yeah. Something, the things that really stood out to me as well, just talking with Copilot Vision was how personable it really was. Like it really did feel like that friend almost. And it even gave me like some sass at some points that I said. I'm like, Oh, I don't, I don't really like this place. And it's like, Oh no, this place is pretty nice. Uh, it like kind of laughed at some of my jokes that were not good and it. even understood like the tone of my voice when I was like more frustrated or when I was like really excited about something. So it really seems like Vision is pushing us in the direction of the AI companion and kind of like that true personal assistant rather than like the traditional chatbot of the original like co pilot or chat QBT. Yeah, I'm glad you picked up on that actually because that's actually really interesting. You know, what you've described there is It's exactly the line that we're trying to draw. So when it occasionally pushes back on you, that's a profound moment because a true friend would do that, right? We no one wants like a sycophantic AI that just always mirrors you and always obeys you. Like that's not going to be interesting for very long. Um, but at the same time, it's got to be kind of respectful. Um, it's got to be really aligned to you and on your side looking out for you and stuff. And so getting that balance right between when it like shares your energy, um, like if you're really dour and sad and you slow down, you know, the pace of your words and so on, it will bring, you know, an appropriate vibe for that. But if you're super fast and excited and super enthusiastic, it will mirror that energy and that back and forth, that chemistry, if you like, uh, is really amplified by. Copilot also being able to see what you're currently seeing at that moment and what it is that you're excited about And so, yeah, I think it really creates a very different dynamic in the interactions. It's very exciting. It is really exciting. I think a lot of people are going to try it and be shocked at how good it really is. And yeah, like that really stood out to me as well. Um, I guess. Kind of looking forward 10 years from now, what do you think these personal AI assistants will kind of have in our lives? What are some of like the most interesting things you think might happen when we have like this co intelligence living there with us on the web or by our sides with our phones? I mean, if you think about how much time we spend on our laptops and phones today, we've created this entire, basically arbitrary, made up, Graphical user interface to accommodate for the fact that computers are too dumb to understand the words that are coming out of my mouth. Right? The browser, the fact that you have to press a button, the fact that you have all these different apps, you've got menu drop downs, you have scrolling. The entire, User interface, what is predicated on the idea that the, to get a computer to do something, you have to be able to write code because it doesn't speak the language that I use to ask you to do something or my friend to do something and that is all going to get washed away. Right now, your computer or your AI, your co pilot is clearly going to understand everything that you're bringing to the table, your emotional state, your intellectual state, what you, what you need to get done that day, you know. Your interests, your hobbies, your personal knowledge graph, your family, Um, your dislikes. So it's not just that it speaks our language, it's actually that it is able to reason over what we see, what we hear, what we believe and think. So it's, More than just an interface, it is a new plane of connection, um, that I think is just like fundamentally different. And it's going to feel, you know, as I've long said, like a new digital species. I mean, it is going to feel like a member of the family, like another, you know, Layer of connectivity because you're going to have an A. I'm going to have my I those A. I's are going to connect with one another in advance and brief you brief me follow up afterwards, so it's kind of kind of be like a second brain. I think of it as like outsourcing a lot of the mental processing, um, to a very reliable, highly accurate, completely interactive. Thought partner and companion that, um, is going to help make me much smarter, more productive, feel more supported and so on is very, very different to just using a computer in the way that we do today. Yeah, such, such different times. Yeah. With any sort of powerful AI application, such as Copilot Vision, it needs like kind of copious amounts of data to kind of really be accurate and helpful. But of course, with this amount of personal data, there's always a new set of privacy concerns for users. How is Microsoft tackling this right now with the launch of Copilot Vision? And how do users know that their data is safe? Great question. So we're keeping a very open mind on this, right? I mean, some users will want to keep their ephemeral session. So at the moment, co pilot vision, um, throws away the content, the contents of what it has seen at the end of the session. Now there's benefits to that from a privacy perspective because it's super straightforward and it's an easy to communicate rule. But there's also downsides to it. Like we talked about earlier, the benefit of having your A. I know you session to session week to week, month to month, I think is going to be pretty significant. So we're going to cross that bridge when we come to it, because right now, you know, the models that we have don't actually have good enough memory for us to make that real consideration. But if we are to that, it's going to need a new privacy and security infrastructure to be able to store that kind of content because it's going to be very rich. Um, it's going to describe in immense detail, not just, you know, moments in time, but strings of activity over, you know, hours and days that are very rich and very high dimensional. Um, and personally, I'm pretty optimistic that the value of that, um, information for the user is going to be sufficient, like it's going to be You know, exciting enough, useful enough, interesting enough that, um, some people at least will want to save that. And our job will be to create like super secure, private, um, and, and, and safe infrastructure to be able to, um, give people the benefit of those experiences. And at the same time, many other people will choose not to save those sessions, and that's also totally fine. Um, so that's kind of our approach is that we're going to be taking, um, you know, we're going to be taking the path of giving the user a choice. So when exactly is Copilot Vision going to be rolled out to users and, um, when is it going to be rolled out more broadly? Yeah, good question. So I mean right now it's going to be available in a few days time in Copilot Labs, um, to paying subscribers who will get special access, uh, to trial it and experiment with it, give us feedback. I mean this is a very complicated feature. Um, has a lot of latency requirements as a lot of inference requirements. Um, and so we've been, been very careful on the safety front as well to really be very thoughtful and deliberate, um, in making sure that it works well most of the time, but it's still not perfect. And we're just iterating on it. Um, you know, steady as she goes kind of thing. So, um, sometime next year, uh, in the early part of next year, it will go into GA and lots of people get access to it. So Copilot Vision now is already a huge step up from the original Copilot or Bing. But another question that I think many people will be wondering is, when will the regular user be able to have the Copilot Vision with fully functioning vision across all their apps, across all their screens 24 7, and no limitations on websites and apps, and possibly even infinite memory? Yeah, I mean, that is definitely coming. Uh, you know, next year. I don't know when next year might be the back end of next year might be the summer might be a bit earlier. But, um, you know, we are working extremely hard on getting the cost of inference down so that we can make it widely available to everybody. But fundamentally, this is gonna be a cost issue, right? Um, because this is an amazing piece of technology that can see what you see and read your screen in real time. Um, and obviously that comes with a, with a, with a little bit of a cost. So that's going to be the main constraint basically to getting it to GA. Yeah, that's great. So are there any plans for CoPi to come become kind of agentic and have the ability to kind of take control of your computer and do regular tasks just like a human could? Yeah, absolutely. I mean, one thing that I'm very excited about is that, um, on Windows, particularly many people sort of struggle with, you know, fixing, um, their system, whether it's doing a software update or might even be as simple as turning on Bluetooth or, you know, turning on Wi Fi or adjusting a setting. Um, and the cool thing about Vision is that Copilot can now. See what you're doing on your screen. And when you pop up a, you know, a screen just as you would, if you called a, you know, technical support system, it can actually guide you through that, um, step by step. And it's a magical experience. I mean, we have over a billion daily active users on Windows and, you know, not all of them are as technical as your audience. And so, you know, um, I think it's gonna be pretty cool to have people get the opportunity to just ask copilot. In natural language, you know, where do I click? What does that menu say? Where do I go from there? Um, so that's just one example of increasingly agentic like behaviors, but we're also very interested and working hard on how it navigates the browser, fills in forms, calls APIs, etc. So a big priority for us at the moment. So something I'm really excited for personally is when AI can not only see my screen in real time with something like vision, but also guide me through learning. To use new apps by controlling my screen and showing me exactly where I need to go. For example, if I'm learning how to use Photoshop, I can have a vision, kind of see my screen in real time and guide me through certain tasks and tell me how to do it. But imagine if the AI can actually take control of my screen and show me, this is the tool you want to use to, you know, mask a character, or this is the tool you want to do to, uh, clip out that character and kind of have that. AI guide me through that end point. I think the AI assistant or co pilot for education is going to be absolutely massive. Are there any other use cases that get you really excited with these full on vision and agentic co pilots? Yeah, I mean, the future of co pilot vision is definitely co pilot help, step by step, taking you through, um, troubleshooting when you're trying to fix your computer or you're trying to learn a new piece of software like, like Photoshop, like you said. Um, that's, that's for sure coming and I think it will really help people, um, who might be like less capable of using their computer as well. Debugging their issues. I think that's going to be very liberating. Um, but it's also going to help you with basic tasks in the background. Like, um, you know, going shopping, for example, I mean, it will populate your basket in advance. It will ask you if you want this or that, or it will find a bunch of prices and be like, You know, there's a better opportunity. You could get it here. And typically, you might not be bothered to do that, right? Because you're like, Oh, can I really be bothered to buy, you know, buy my groceries from three different stores to get the optimal price? No, like you probably just going to get them all from one place. And so, you know, now you're just going to have copilot in the background, go off and, uh, Find you all those different options aggregate them for you and then go and execute the purchases in all the different environments. So, you know, there's just so many possibilities is kind of limitless. Um, one of the things that I've been really excited about is the role in gaming as well. Um, you know, you can imagine having your copilot companion, um, talk to you about the worlds that you're building in Minecraft or, um, hang out with you and call a duty. You know, it's probably going to feel like an ever present. Companion in whatever setting you're in, whether it's the browser, whether it's your slack app, whether it's, you know, um, you know, your gaming environment, because naturally you're going to be like, well, It's my co pilot, like I, of course, I wanted to be here and be like, Hey, co pilot, did you see that? Um, you know, what should I do next? Like, you know, that, that kind of ever present thing is, is definitely where we're headed. And so we'll just see it show up in a lot of different settings, I think. Yeah, we're headed into a completely new era. And I guess my next question is, how is Microsoft thinking about educating the next generation of students and businesses and builders on using AI as a co pilot? Co intelligence rather than a full on replacement. Yeah, I mean, that is the beauty of how we've framed, I think, co pilot. Like, it is explicitly designed to be in the background, to be an aide, you know, to be a conciliary. Giving you great advice, giving you support, giving you feedback. It's, it's, it's explicitly in service of you. Um, and we've sort of tried to frame the aesthetic of the UI like that, the tone of the conversation, you know, the way that we've deployed, um, voice is, is very much in tune with that and the way we think about vision too. So as it progresses, we'll keep that same, um, you know, approach and set principles, which is like, this is something that you control. That you're in charge of, that is aligned to your interests, that is on your team, that is backing you up. It's like your little hype man. And I think that's an important distinction to, um, you know, my friends and the people that I love from the last 15 years of being in the field. That where there's a bit more of a focus on AGI and super intelligence and this kind of inevitable explosion of something that's more powerful than us. I mean, that's just not what we're building and what we, what we want to build. And. It's really a very different kind of experience. It's a personal companion. It's for you. It's not for the brand or the business or the company. There will be other AIs that do that, but everyone is going to have their own AI that represents them, that's on their team. And that's kind of dramatic departure from You know, the previous world of social media, right, where you have to rely on your eyeballs as a human to make sense of that feed, you know, wherever you're on Twitter or Instagram or, you know, you're using a computer, whereas now you have this other interface layer that is is kind of almost adversarially Interacting with the content that is being shown at you is your filter, you know, and I think that's a big paradigm shift because it's like putting the most powerful technologies, um, you know, in your hands on your side so that you can so that it can so that you can teach it the kinds of things that you want the preferences that you have rather than just be at the mercy of the preferences and attitudes of values of the feeds that you're being shown by any one of those companies. So that's a big shift in that sense. And my last question here, I know you're super busy, but do you have any advice for future generations of students and businesses? My advice is always like think of yourself as an accelerationist, you know, adopt, embrace, absorb, do so critically. And those things aren't intention, you know, they're not intention with one another. You can still be as optimistic and excited as I am about the future of technology. Whilst at the same time being critical, wide eyed, trying to establish boundaries, say what you do or don't like. That's the game. There's no kind of rejection of it. And there's no, you know, sort of complete, you know, on unbridled, uh, you know, investment in it, uh, to just, just kind of have its own life or something and just be, have its own organic form and do its own thing. There is a middle path where you can be. You can sit in between those two things. And I think that's the right way to think about it. All right, well, that's it for the interview. Thanks so much for taking the time and good luck with the launch. Thank you, man. This has been a lot of fun and really appreciate you. So thanks for doing this.