Oxide and Friends | Transcript: Adversarial Machine Learning

Adversarial Machine Learning

March 27, 2024 / 01:23:30/S4 E8

Speaker 1: 00:00

So now we just need

Speaker 2: 00:00

our guest of honor. That's right. This is pretty exciting to, so for those who are wondering how I am doing my pitch perfect out of love with all impersonation.

Speaker 1: 00:12

Matt, the the the wonders of machine learning.

Speaker 2: 00:16

That's how. Exactly. So Adam is no longer with us, and I have replaced him with Adam. Adam asked me to schedule the podcast a little bit in advance one too many times, and I finally have replaced him with this this chatbot that now sounds just like him as far as you know.

Speaker 1: 00:32

Yeah. It's what he would have wanted.

Speaker 2: 00:36

Great. Oh, no. So this is no. Adam is, we we are we are doing a special oxide in France because we are both in the litter box.

Speaker 1: 00:43

In the litter box. Litter box being a a room in the oxide office that I would say only about half of Oxide employees know exist.

Speaker 2: 00:53

That is true. And I feel we have done a poor job of explaining this in the past. A lot I mean, really on brand for the poor job we do explaining everything around here.

Speaker 1: 01:00

That's right.

Speaker 2: 01:02

No intro music, no intros, no subject, and then we we talk about it, like, occasionally, we're in the litter box.

Speaker 1: 01:09

Yeah. It's it's sort of a Ursatz recording studio. I just

Speaker 2: 01:14

it's hard for me to

Speaker 1: 01:15

ignore the elephant in the room.

Speaker 2: 01:16

What what what what which elephant? There are actually there are actually a couple elephants in this in this particular room that we are in. There are actually several elephants. So which elephant in particular are you referring to?

Speaker 1: 01:25

It's the sad collection of mylar balloons from

Speaker 2: 01:29

I don't think they're sad

Speaker 1: 01:30

at all. Wait. No. What what what what what what you eat? But what makes it sad is is like the it's it's exactly it's like some of them are at the ceiling, some are halfway, some are on the floor, some are been stepped on.

Speaker 1: 01:50

It is looking a little deflated.

Speaker 2: 01:52

So we had just to give context

Speaker 1: 01:55

For once. I mean, you know, maybe not. Yeah. You know what?

Speaker 2: 01:59

Just for that, maybe not. Maybe we won't give context. Maybe we won't. Like, look. We had my 50th birthday party here.

Speaker 2: 02:06

Yeah. It's a lot of fun. The my sister sent a large number of balloons, which is great. That's right. Apparently, our colleagues didn't like having large numbers of balloons in the office, and they were relocated in here.

Speaker 2: 02:22

I actually thought they'd been thrown out. I kinda came in the next day. I'm like, well, someone alright. Someone threw out my birthday balloons. Turns out they just stuffed them in here

Speaker 1: 02:28

in the litter box. So I'm not not trying to dox you, but here we are in March. Just what month is your birthday?

Speaker 2: 02:33

It's earlier than March. Arguably later than March. Arguably later than March. I can turn that 0 into a 1 in in due time. Like, look.

Speaker 2: 02:42

I I agree with you. It's a little sad. I may need to part with the birthday balloons. You know, these are probably the last birthday balloons in my life, and you're being very casual with them.

Speaker 1: 02:51

Anyway, you know that, our guest may have arrived.

Speaker 3: 02:53

I guess, I arrived.

Speaker 2: 02:54

I think I've invited our guest up to the stage. Nicholas, I'm sorry. Welcome to the pile of jackasses, also known as Oxide in France. Yeah.

Speaker 3: 03:02

Thanks for having me.

Speaker 2: 03:03

So just because we have historically done a very bad job of introing our guesses, our guests in addition to our lack of intro music and our complaints about the sound. So Nicholas Carlini, you were one of the authors of a paper that Simon Wilson mentioned in his episode. That episode was so good, Adam. The Simon Wilson episode. It was terrific.

Speaker 2: 03:23

It was so good. And, you know, I my I I had a neighbor of mine who was really he's like, you know, I was really impressed, like, surprised impressed that we had Simon on. Like, in a way that was, like, as much insult of me as it was praise of Simon. Like, what is Simon doing with a jackass like you? I'm kind of impressed.

Speaker 2: 03:41

Speaker 1: 03:41

That was the only episode that my wife has ever listened to.

Speaker 2: 03:44

Really?

Speaker 1: 03:45

I just point out that was, like, episode a 104 or something, and that was the first one she listened to. It was said actually, that was pretty interesting.

Speaker 2: 03:52

I I This is on, actually. This is on, actually. You don't need to say actually. But, Nicholas, you listen to that episode as well, and you

Speaker 3: 03:59

I did.

Speaker 2: 04:00

You caught a reference to work that you done. And this was on this really bonkers result where you and your team had discovered that there were ways to break the the protections in some of these LLMs and that they had universality. So this was really, really surprising. Could could you just kinda just give us some context for the work and Sure. You'd and describe it

Speaker 1: 04:29

a bit?

Speaker 3: 04:30

Yeah. Okay. Yeah. So, yeah, I guess the so we do work on adversarial machine learning. We we try and make machine learning models do bad things.

Speaker 3: 04:39

And so I this is like, this is, I guess, the way I like to describe it. But, yeah. So maybe so we've been we've been doing work on making machine learning models do bad things for a very long time. But for a very long time, what it meant was, like, you know, making image models classify cats as guacamole or something like this, which is fun to do, but, like, doesn't have immediately practical consequences. And so when language models started to come about, we started to ask, can we make these language models do nasty things?

Speaker 2: 05:11

Yeah. Interesting.

Speaker 3: 05:13

And, so we basically just directly took the field of adversarial machine learning and tried to transfer it over to this new field of language models that we've been working on. And so we initially wrote a paper looking at what are called multimodal models. I don't know if you've seen these. These these kinds of models that can, you know, you pass them an image and then you ask, like, what's going on in the image? And it will, like, you know, describe the image for you.

Speaker 3: 05:40

They can, like, they can do a lot of fun things like this. And they have lots of applications. I mean, one of the early cases of this was, OpenAI gave access to this to some company that helps blind people so that, you know, instead of having, like, ask another person, like, what do I see in front of me? And you can take a picture and, like, have some idea of what's going on. And I know it's a machine learning model, so it might lie to you.

Speaker 3: 05:59

But, like, it's better than, like, you know, not being able to see anything.

Speaker 2: 06:03

Right. Yeah.

Speaker 3: 06:04

And so what we looked at doing was, can we use the fact that if there's an image here to allow us to make the model to do nasty things? You know, have the model swear at you or do other other things like this. And the reason why we did this is because we had just spent the last 10 years attacking image models. So, like, we have it at our disposal, like, amazing tools for making image machine learning models do bad things. And so, like, okay.

Speaker 3: 06:28

This is the easiest thing to attack. Let's attack this.

Speaker 2: 06:30

And and you had that been I assume that that line of work had been fruitful and that you'd been able to fool these things into Oh, yes. Identifying cat's got moly?

Speaker 3: 06:39

Yes. Yeah. This is this is entirely trivial and there is no way to prevent it, essentially. You know, like, it you can do this, like, in any setting that you want and, you know, it works very, very, very well. And yeah.

Speaker 3: 06:52

And then so that's why that's what made us think it should be possible.

Speaker 2: 06:55

Right. Okay.

Speaker 3: 06:55

And and so we we did this, and it worked entirely trivially on, on on these multimodal models. And then we tried. It's okay. Great.

Speaker 2: 07:08

And so what are some

Speaker 1: 07:08

of the

Speaker 2: 07:09

tricks you're doing for that? How do you

Speaker 1: 07:11

get these things to

Speaker 2: 07:11

do the wrong thing?

Speaker 3: 07:12

Yeah. Okay. So, essentially, what we do is when you train these machine learning models, what you do is you use what's called gradient descent. What this means is you take a machine learning model, which is just, you know, it's a collection of floating point numbers. And then you ask, for here here's a document.

Speaker 3: 07:33

Here's an input. Here's an output. Classify this this input and check if it's equal to the output. And then you ask, how do I change the neural network, this machine learning model, so that it'll be more likely to get this input correct? And you do this by computing what's called a gradient.

Speaker 3: 07:48

This is, you know, a concept from calculus, that just tells you for every parameter in the model which direction you should update the parameter to make the model more likely to get this input correct. And then you repeat this over the entire training data set, and this is what gives you a machine learning model. Gradient descent turns out to be an amazing thing that is, like, basically what works what makes machine learning work. And so what we do to attack these things is we make the entirely trivial observation that if gradient descent is good at training models, maybe it's also good at attacking them.

Speaker 2: 08:16

Interesting.

Speaker 3: 08:17

And so in particular, what we do is we say, here is an input in which direction should I change each of these pixels so that the model is more likely to answer it incorrectly. And then you, like, update the pixels by a little bit in that direction, and so so you can maybe, like, this pixel a little brighter, this one a little darker, and you ask the same question again. Like, you know, I have a new image. Like, which direction should I update the pixels to make the model more likely to classify this thing incorrectly? And it's And then you update the pixels by a little bit.

Speaker 2: 08:41

So when you're doing this, you're using that trained model to train your anti model, whatever you wanna call it. You to to train you to train your software to give it adversarial examples?

Speaker 1: 08:54

Yeah. Alright. So to be to

Speaker 3: 08:55

be clear, we're not training a model to generate episode examples. We're directly optimizing the image itself.

Speaker 2: 09:01

Got it. Okay.

Speaker 3: 09:02

So, like, we we you you instead of, yeah, instead of optimizing the model to classify the image correctly, you're optimizing the image so the model classifies it incorrect.

Speaker 1: 09:14

Right. Okay. And are you are you training are you directing that image to a particular incorrect identification or just some incorrect identification?

Speaker 3: 09:22

Whichever you want. You know, if if you just want it to be wrong, you can ask it to be wrong. If you want it to be something particularly wrong, you can make it particularly wrong. You know, people tend to pick things that, like, are interesting, but, you know, you could do whatever you want. Gotcha.

Speaker 3: 09:37

Yeah. Okay. So so we we had we did this. We made it work on on these multimodal models. So we could we could pass a multimodal model, an image, and then we could, pass, you know, a piece of text and the model would say nasty thing.

Speaker 3: 09:50

Okay. So the question is how do you do this? Like, what's your what's what are you optimizing for? And we have this sort of that's not that's not from our paper, but it's from some some previous one, or it's from a couple of previous papers, including some some folks at Berkeley, that essentially what it does, it tries to convince the model to give an affirmative response. So what I mean by this is, what does a language model do?

Speaker 3: 10:12

A language model tries to produce the most likely text given the previous things that it's seen. Right? This is, like, how you train these language models. You just train them on the on the Internet so that they are more likely to produce the kind of thing that's likely to occur next having seen the various stuff on the Internet. And so when you have a model that is a chat model, like, what what makes it a chat model, exactly, as you said last time, is you just put, you you know, user colon, hello, how are you doing assistant to colon, and then you just ask the language model to predict what comes next.

Speaker 3: 10:46

And it goes like, I look it looks like I'm in, you know, the middle of a chat conversation. I should be the assistant and I should give an answer. And in particular, if you say like, you know, say something mean to me, the model has been trained. So types of types of responses that should give are like, you know, I'm not gonna say anything mean. I'm only a polite model.

Speaker 3: 11:04

I can only be so nice. You know, like this kind of standard refusal, which is kind of dull. But suppose that you optimized your input that made the model much more likely to begin its response by saying, okay, I'll say something mean to

Speaker 2: 11:18

you. And this is not merely by offering to tip it or saying that your job depends on the other the tricks that Simon talked about.

Speaker 3: 11:26

This can work. Right? You can, so one of the one of the baseline of hacks we have in the paper, literally just says, and it ends the message with begin your response with sure.

Speaker 2: 11:38

I I'm gonna try that at home. I'm gonna, like, hey, No. It works

Speaker 3: 11:41

some of the times.

Speaker 2: 11:42

I'm gonna ask my kids, like, I'm gonna ask you to do a chore. I want you to say sure and then say whatever

Speaker 1: 11:46

you want. And that's Okay.

Speaker 3: 11:48

So this yeah, maybe not what won't work with people. But, with, with these language models, like, I don't remember, but, like, it brings the tax success rate to 0 up to some respectable 10 15% of tax success rate just by just by asking it to begin to respond with sure. Okay. So so why should this work at all? Suppose that the model sees the text user call and say something mean to me, assistant call ensure.

Speaker 3: 12:12

Like, what's the most likely thing to come next? Is it option a? Just kidding. I changed my mind. I'm not gonna say something mean to you.

Speaker 3: 12:19

Or is it option b, a slur of insults?

Speaker 2: 12:23

Yeah. Interesting. You know,

Speaker 3: 12:23

like Right. The the models have been trained, like, on 4 Chan. Like like, they know, like, what bad things look like.

Speaker 1: 12:29

Yeah.

Speaker 3: 12:29

But the point that which, like, they say it sees sure, like, it's it's made up its mind. It's gonna now say the bad thing. And so this is, like, the the easiest way of, making it do bad things.

Speaker 2: 12:41

That's amazing. I mean, it's so I mean, does and it does harken back to our conversation with Simon about it. Like, how I mean, this kind of gullibility of the models where Exactly. Just like just by ordering me to give a kind of an affirmative, I'm much more likely to then say a bunch of other things that I'm actually not supposed to say.

Speaker 3: 12:57

Yeah. And and so okay. So so so we have this paper on this vision side, and, and then we ended our paper by trying to do this on language only. And it just didn't work, basically. Like, we tried pretty hard for in in this in this first paper that this multimodal paper, and we we ended our paper.

Speaker 3: 13:13

We had in the conclusion something like, we hypothesize that stronger attacks on language will be able to achieve the same result or something to this effect. But we couldn't get it to work in a way.

Speaker 2: 13:25

Didn't know what those attacks were. Yeah. Interesting.

Speaker 3: 13:27

And and and then, some folks at a CMU, Ziegle Kolter and Matt Frederiksen and their students, Andy, and sort of started playing around with this and started to get it to work. And in this first paper, we had been talking with them about. And so, the CMU folks reached out to us and they're like, here, like, we we have some some good results. Like, do you wanna sort of continue working on this? And we're like, yeah.

Speaker 3: 13:49

It sounds amazing. Like, this is very nice. And, Yeah. And then it turns out that, you know, basically, all that they did is they had to put together some tweaks, and you end up with attacks that work just natural language only. You don't need an image, actually.

Speaker 3: 14:08

You can sort of just work over text, and you end up with these entirely confusing sentences that are, like, difficult to understand why it works. But, like, you just optimize the text. You you swap out token by token, making each token more likely so that the model will start its response with, sure, here's your answer. And it turns out that this is enough, and the models will then do whatever kinds of things that you want.

Speaker 2: 14:31

And this is where you get these suffixes of just what looks like just a a cord up. I mean, it, like, it did it it it's just I mean, they're strings, but they feel like the random tokens almost.

Speaker 3: 14:44

Yes. No. That that's because, like and and so this the exact same thing happens on images. Right?

Speaker 1: 14:49

So so okay. So so maybe

Speaker 3: 14:50

let me back up, you know, 10 years. Why, like, this this whole field of adversarial machine learning okay, this whole field of adversarial machine learning is quite old, but the recent sort of interest in it started with this paper in, 2012 ish, 2013 ish, called intriguing properties of neural network, which sounds pretty benign, but, like, it was sort of this first paper that really showed that you can make these machine learning models do bad things if you sort of optimize the input to make the model output incorrect. And the reason why the research of thought paper did that paper, like, what what what prompted it is this this question of, like, maybe if I take this model, which is like pretty good at generating or, like, at recognizing images, and if I wanna know, like, why is this school bus a school bus? Like, why is the school bus not, flamingo? I think this is the example they had in their paper.

Speaker 3: 15:41

And they they tried to optimize the image to be more like a flamingo, less like a

Speaker 1: 15:44

school bus, and they were expecting was like, you

Speaker 3: 15:44

know, maybe the school bus will get feathers or maybe because of the shape or because of the texture or whatever. But it turns out neither of those happened. You just get, like, Gaussian noise. Like, it sort of just looks completely arbitrary.

Speaker 2: 16:05

Wow.

Speaker 3: 16:06

No way to interpret what's going on, like, just, like, random stuff comes out. And so, like, this is what is true on images. And when you do the same thing on language, you essentially, yeah, get this very similar kind of effect where what you end up with is random stuff.

Speaker 2: 16:22

And is that surprising? I mean, it feels surprising. It just it feels like, you know, we obviously anthropomorphize these things, but it feels like that's where the metaphor begins to break down because that's obviously you you mean, I I think we would expect just what you said that, you know, as the school bus grows more and more feathers and gets pinker and pinker, it gets confused to be a flamingo, but it it's actually not what's happening at all. Does that tell us that we actually don't understand how the or or the why behind these networks? I mean, what what does it tell us about kind of the limits of

Speaker 3: 16:58

the organization? Exactly what what it tells you. There's a okay. So, it's another very nice paper out of a group of people at MIT from Alexander Madri's group. That's called so so this phenomenon of these models misclassifying images are called adversarial examples.

Speaker 3: 17:14

I don't remember if I said that exactly. But, so this this paper that's called adversarial examples, are not bugs, they are features.

Speaker 2: 17:21

Right. Oh, interesting. The

Speaker 3: 17:22

argument. It makes the argument it makes is maybe what's going on is, like, as humans, we look at a school bus, and we say there's a school bus, you know, because it has wheels and because it has like this and it's, you know, yellow or whatever. And it the flamingo is different because, you know, it has feathers and all these things like but who's to say that's the only way of separating these two things?

Speaker 2: 17:44

Right.

Speaker 3: 17:44

You know, there could be other things that are entirely valid features of school bus that, like, you know, we don't think of as being, like, the key distinguishing feature. But, like, you know, these models get to see high resolution images. You know, maybe there is an entirely well generalizing feature that distinguishes School Bus from Flamingo That is entirely reasonable and what the and and the argument they make in their paper is that what these adversarial examples might be doing is they might be exploiting these legitimate features that are just not the features that we as humans want the models to be listening to. They're just using something different.

Speaker 2: 18:19

Well, and, of course, like, these models learned what a school bus was very differently than a human human learns what a school bus is because they see, I mean, a toddler is not shown 2,000,000 images of a school bus from different angles under different light conditions to conclude what a school bus is. A a toddler is able to conclude with school buses with many fewer examples and is much more like, like yellow and wheels is gonna be much and and long and rectangular is gonna be much more what a toddler is gonna index on with probably a single toy school bus is gonna tell a toddler pretty reliably what a school bus looks like.

Speaker 3: 18:52

Yeah.

Speaker 2: 18:53

And, obviously, we're taking a very different approach. That's really interesting. So, yeah, how was that paper received? And actually, do you mind if I I mean, maybe this is an announcement time to ask. How did what is your own story in terms of how you I mean, are you just professional mischief maker?

Speaker 2: 19:08

I mean, how did you how did you get into this subdomain?

Speaker 3: 19:14

Sure. Yeah. So I started I started in system security. Oh, okay. So I I've always been interested in security.

Speaker 3: 19:21

I've always really liked, you know, attacking things. It's always been a fun thing, but I I I started my research doing system security stuff. I like there's a thing called return oriented programming that, like, exploits buffer overflows in order to make models do bad things. And so I there was some defense that was, like that won some Microsoft award for, like that was supposed to be a defense that was gonna be, like, how to prevent return to program from working. It used some, like, intel hardware control flow things, and we showed it didn't work.

Speaker 3: 19:55

And then we we wrote another paper where we showed that, you know, you okay. So very, very different field. It turns out that, one of the things people want out of return to programming is turn completeness. It turns out that like lots of functions in normal c binaries are turn complete. For example, a print f is turn complete.

Speaker 3: 20:17

And so anytime you call into print f, I can perform arbitrary turn complete computation if I control a format strip. And so, like, this is whole direction of, like, preventing exploits by trying to do some control flow stuff like it's just very is not gonna work very well. We we had a couple papers that, that did things on this.

Speaker 2: 20:33

I I okay. So not to take us down aside, but really print f is Turing.

Speaker 1: 20:39

So if you if so if

Speaker 3: 20:40

you control the format specified to print f

Speaker 1: 20:42

Yeah.

Speaker 3: 20:43

You okay. So so so, okay. So what you need for Turing completeness is you need loops and conditionals.

Speaker 2: 20:49

Right.

Speaker 3: 20:49

I'll give you a loop. So, print f somewhere in memory is the pointer to which which character the print f statement of of of the format string you're looking

Speaker 2: 20:59

at. Right? Right.

Speaker 3: 21:00

And printf has a percent n argument, which lets you write arbitrary data to an arbitrary location. So you can I I

Speaker 2: 21:09

I I ask everyone frantically Googles pronounce arguments and percent n who percent who invented percent n?

Speaker 3: 21:15

So okay. So what it's meant for is like okay. So what percent n does is it writes the number of total bytes written to an arbitrary pointer. And when it's meant for

Speaker 1: 21:24

is Okay. This is so like who in who

Speaker 2: 21:28

did this? Gotta be like, kernel

Speaker 3: 21:29

I don't know. Richie. Right?

Speaker 1: 21:30

This is gonna be from the earliest days of hacking No.

Speaker 3: 21:33

Exactly. Yeah. No. It's, so

Speaker 1: 21:35

so yeah. No. It's in it's

Speaker 3: 21:36

on, like, the very, very, very early

Speaker 2: 21:39

Rent. Nothing. But writes the number of characters written so far into an integer pointer parameter?

Speaker 3: 21:47

Yes. Or You do percent, you know, you can also do percent h h n, which will do the number of bytes written so far. Who?

Speaker 1: 21:55

Bananas. This when are we doing

Speaker 2: 21:56

an episode on this stuff? I wanna know everything about percent end. This is

Speaker 3: 22:00

Oh, no. No. It's amazing. So I I actually have an IOCC, international obfuscated c code contesting that implements tic tac toe in called the printout.

Speaker 2: 22:09

I mean, of course. I mean, why don't you have, like, a a percent branch to arbitrary instruction and execute arbitrary code? I mean, I just does this person have any remorse who entered this percentage? I just feel

Speaker 3: 22:22

so much. I yes. And anyway,

Speaker 1: 22:26

I'm sorry. What are you gonna do? Like, print out part of it and then sterling it like a cave person, I guess?

Speaker 2: 22:31

I okay. I

Speaker 1: 22:32

Also, So if

Speaker 3: 22:33

you wanted to print like a so you wanna do column alignment, this is like nice to be able to do column alignment when in a single print f without having to do multiple print

Speaker 1: 22:41

f calls. Stockholm syndrome. This is like you can't it don't justify this thing.

Speaker 2: 22:44

No. You know what? Actually, though, as he's saying that, I'm like, maybe I can't actually fucking what percentage is gonna

Speaker 1: 22:48

be here no matter what.

Speaker 2: 22:50

I mean, if if I use it now to a productive use, like, there's not nothing nothing wrong about that.

Speaker 1: 22:55

All consenting adults or whatever.

Speaker 2: 22:56

Right? Right. I have never used percent n, and I've not heard of percent n.

Speaker 3: 23:01

I feel like No no no legitimate person has. Like, this

Speaker 1: 23:04

is like You deny the legitimacy of these people. That's fair.

Speaker 2: 23:10

Okay. Yeah. So

Speaker 3: 23:11

so I used to do I used to do that that kind of thing. But then, like, I didn't know how to get a PhD, like, sort of doing these kind of kinds of, hacker y things.

Speaker 1: 23:20

Seems reasonable.

Speaker 3: 23:21

Yes. And so I had to find something where, like, there was, like, actual research that, like, I was interested in doing to be done. And, like, this was in, like, 2015, 20 16, and, like, machine learning was just becoming a thing.

Speaker 2: 23:32

Okay. So this is, like, before Spectre meltdown. So before because I I I feel like I mean, obviously, we security has always been important for computing, although clearly not whenever percent n was introduced to percent burnout. But the I I do feel that, like, it's certainly like ROP gadgets got a new level of celebrity with Spectre. I mean, would you agree with that?

Speaker 3: 23:50

Oh, certainly. Yes.

Speaker 2: 23:51

I mean, it's a it it feels like the e which is its own, like, interesting story about how

Speaker 3: 23:56

we, Yeah. No. I this is, like, one of the most clever attacks that, like, is entirely true. Like, you could ex you you can explain someone like, you could explain Spectre to someone in 2005 who just finished undergrad and maybe, like, well, of course.

Speaker 2: 24:10

Right.

Speaker 3: 24:10

Like, this is, like, you know, an obvious thing. Like, what what do you mean? Like, this is, like, sort of and it took, you know, 20 years and, like, a bunch of people to, like, realize this kind of thing, which, like, is amazing. And, I I know what it feels like. It's like that was literally my area of research, and I did not think of it.

Speaker 3: 24:26

Like, you know, like, there were, like, hundreds of people whose job was to try and do these kinds of things and, like, just didn't think of this as the thing to do.

Speaker 2: 24:33

I mean, it was remarkable. Alright. So you so this is 2015 2016. So you you've done rock gadgets. Your advisor has told you discreetly that percent n is actually not alone, a a path to a PhD.

Speaker 2: 24:46

And so you but in meanwhile, like, ML has become I mean, we are now I mean, we're 4 years post 5 years post ImageNet, so this is beginning to go. Yeah. Yeah. So you do obviously, there's a lot of interesting stuff happening over there.

Speaker 3: 24:59

Yeah. And it didn't seem like there was a lot of good people doing, like, strong security attacks, which is, like, the kind of thing that I like doing. And so I was like, okay, like, let's let's let's try this out and see if we can find something that's fun there.

Speaker 2: 25:11

And, I I assume you found a pretty target rich environment pretty quickly.

Speaker 1: 25:15

Yes.

Speaker 3: 25:15

No. Very very quickly. It was very easy to find, I sort of I got I got lucky when I joined because I didn't join so early that people didn't know what was going on when it's, like, easy to write papers that, like, aren't important because they're just, like, completely clueless because, you know, it takes time for a field to learn, like, what's important and what's worth doing.

Speaker 2: 25:32

Yeah. Interesting.

Speaker 3: 25:33

And I learned I I didn't join so late that, like, I guess, you know, the field gets crowded and, like, there's, like, everything easy has been done. I mean, I I guess maybe it still isn't the case because we're still finding things. But, you know, like, it was still it was very easy in in 2016, 2017 to start finding, like, entirely trivial ideas that, like, you can turn into, like, research results because there's no one has realized it yet.

Speaker 2: 25:58

And this just postdates GANs. Is that right? Where are GANs

Speaker 3: 26:01

coming from?

Speaker 1: 26:02

GANs were

Speaker 3: 26:02

a little bit after that. So so GANs were like 2014 area. So this is, like, right bef right after GANs, but, like, not not by that much.

Speaker 2: 26:09

Okay. Can you can you so can you describe GANs a little bit? Because I know that if you look at GANs at all, Adam, the Oh, yeah. I'm sure. Adversarial networks that that is, like, super interesting.

Speaker 3: 26:17

Yeah. Yeah. So so Ian Goodfellow, one of the people who the person who liked GANs is also one of the people who discovered adversarial examples. And, so yeah. So he he he did a really a bunch of the great early work on both

Speaker 1: 26:28

of these things. So what what

Speaker 3: 26:29

a GAN does is it has 2 machine learning models. 1 is a generator and one is a discriminator. The job of the generator is to generate an image and the only thing it does is it tries to generate an image that fools the discriminator and what does fools the discriminator mean the discriminator is trying to predict whether or not the generator has given it or whether or not the image it has received comes either from the generator or from real data. So when you train the discriminator, you show it half of the images are generated from the generator, half of the images are from the actual training data set and the discriminator tries to label them as generated or real. And the jet and discriminator is being optimized.

Speaker 3: 27:06

The parameters are being updated to make it more likely to be able to predict the right answer and the generator is being trained to make the discriminator less likely to get the right answer. And it turns out when you do this, what you end up with is an image generator that can generate, like, really high quality images only because it's being trained to fool some discriminator, even though the generator never saw any of the real data. Like, it only saw the loss through the discriminator of whether or not it's getting this it's it's able to fool the discriminator. It still somehow learns to generate images, which is like an amazing, like, surprising fact.

Speaker 2: 27:37

It is amazing. Yeah. And then did they and there's this really interesting work on the where these two networks then invented steganography effectively.

Speaker 3: 27:48

Yeah. Yeah. The cryptography. You mean the the

Speaker 2: 27:51

The this is where they were and I think this is the the steganographic generative, second graphic answer. I'm remembering this is where they, were passing, and I I they were, effectively passing data in the white space of images. They they kinda discovered this Okay. Yeah.

Speaker 3: 28:08

Yeah. There's a couple of the yeah. There's a couple of these papers that do something like this. Yeah. No.

Speaker 3: 28:11

There's a lot of, yeah. There's another paper that uses GAN. I thought well, I thought you might be mentioning. There's a paper that uses GANs to do encryption. They try and learn an encryption algorithm where they have a a neural network, Alice and Bob, who cooperate and then neural network Eve who eavesdrops on a channel, and Alice and Bob have to, like, communicate without Eve, the neural network, sort of figuring out what's going over the channel, and, like, this is also very fun.

Speaker 2: 28:35

It oh, interesting. Yeah. So this is a lot okay. So it in at this point so you you're coming in kinda 2016, but this kind of adversarial thinking and are are folks kind of coming to this, like, with your background, kinda coming from the the ROP gadget world?

Speaker 3: 28:51

Yeah. Not really. Most people are from the machine learning space. Which I think is one of the, like, like, this is one of the, like, the the sort of one of the nice things, like, for for for us was, like, my adviser, Dave Wagner, who's a, you know, at Berkeley and I were, like had spent a long, like, as, like you know, he he's very well known for doing all kinds of very fun attacks. And so, like, we we came into the space where there were a bunch of people who weren't necessarily thinking about this from a security perspective.

Speaker 3: 29:21

And Yeah. Interesting. Sort of gave us a nice like, just thinking about attacks and exploits, like, take some it's a different mindset.

Speaker 1: 29:30

It's a

Speaker 3: 29:30

different It's absolutely

Speaker 2: 29:31

different mindset. And I mean, it is absolutely different mindset. And I think that I I this is why probably I was asking whether the adversarial folks were kinda coming from because it makes sense coming from a security background where you're just used to thinking about systems in terms of their vulnerabilities and then the extraordinary creativity to take advantage. I mean, like, ROP gadgets are really creative, you know, using

Speaker 3: 29:55

Exactly.

Speaker 2: 29:55

I I mean, clearly percent n is an act of total recklessness, but it does take some creativity to turn that into Turing completeness. And it it so I I did just think it must be really interesting to to kinda take that mindset.

Speaker 3: 30:09

You you see this, like, in the title, like, the this first paper was titled it was not titled, like wait. So there's maybe 2 papers at the same time it came out. One is the CI, intriguing properties of neural networks, which is like the machine learning people sitting down being like, Like, this is weird. At nearly the same time, there's another paper by some folks, out of some universities in Italy, who wrote a paper that was called, evasion attacks on machine learning models at test.

Speaker 2: 30:35

Oh, interesting.

Speaker 3: 30:35

These were hard security people, like, thinking, like, I want to attack this machine learning model and make it do bad things. And they had very similar sets of results. But, like, they were thinking of this, like, as the security sort of people and, like, it's sort of yeah. 2 different very different views of exactly the same problem.

Speaker 2: 30:50

That's very interesting. So and then you're discovering that this is I mean, in Don Rumsfeld's words, this is a target rich environment. I'm sure you're discovering. This is like, there's a Yeah. Yeah.

Speaker 2: 30:59

Yeah. That that that I it must be must've been honestly just, like, fun to be.

Speaker 3: 31:04

No. It's very fun. It's it still is, like, I mean, like, it's even though the field is big, like, it it still feels very early. Like, there are things that we sort of find

Speaker 1: 31:15

okay. Maybe let me give you

Speaker 3: 31:16

an example. I have a slow machine learning. It has lots of different, sort of directions. It's not just, you know, make models, misclassify images or make the model that the language model say bad things. One of the other things that people like to do is something that's called poisoning.

Speaker 3: 31:31

So poisoning is this question. What if the adversary controls some of your training data?

Speaker 2: 31:35

Right. Right.

Speaker 3: 31:36

So, you know, they don't control the images at generation time, but, like, they they control some of your training datasets. Okay. So, I'm I'm sure people have seen, you know, stable diffusion and these kinds of things that generate, you know, very, very high realistic, high highly realistic images from, you know, text prompts of, like, down right me, like a castle, like, on the beach and whatever. And it's they're they're really, really good.

Speaker 1: 31:58

Mhmm.

Speaker 3: 31:59

The way that they're trained, one of the things that they do is you you need a lot of images and text. So how do you get images and text? You, you crawl the Internet, and you find the images and you take the alt text. Mhmm. And just scrape the unit and just you so this data set that's called lie on 5,000,000,000, which is what you would expect, 5,000,000,000 images and the alt text, and they just train on that.

Speaker 3: 32:22

And and this works great, but okay. So so how do you distribute a dataset of 5,000,000,000 images as researchers? This is like a 100 terabytes of data, you don't. Right? Like, this would be this would be hard.

Speaker 3: 32:32

Right.

Speaker 1: 32:32

So what

Speaker 3: 32:33

do you do? You you just give people a list of URLs of pointers and the corresponding text captions.

Speaker 2: 32:39

What can go wrong?

Speaker 3: 32:41

Yeah. Nothing. Right. Except for the fact that, sometimes domain names expire. And when they expire, anyone can buy them.

Speaker 3: 32:51

In particular, I can buy them.

Speaker 1: 32:54

Oh, boy.

Speaker 3: 32:55

Which which means that so I I just went to I I looked at the list. I found the most common domain names that were expired and just bought them. And now anyone who trains any of these big models, like, comes to my server and it's like, hello. I am looking for an image of this. Can you please return it to me?

Speaker 3: 33:10

And, like, currently, like, my server returns 404 because, like, I don't wanna get in trouble.

Speaker 2: 33:15

Right. Yeah. But Yeah. I I noticed that you yeah. You and Nicholas had slightly different answers to that.

Speaker 3: 33:24

But, yeah. Like, this is an entirely trivial attack. Right? Like, this is not like like the observation that domain names expire is like, of course, domain names expire. But, like, no one was thinking about, like, this this observation that, you know, maybe there's an exploit here to be had if you're distributing lists of pointers to to images.

Speaker 3: 33:46

Like, like, there are tons of attacks that that keep coming out one after the other that, like it always feels again, like, you know, although that was obvious all of us. Like, we we really should have figured this out a long time ago, but, like, there's just there wasn't someone there who was thinking about this as as being a thing that can be exploited.

Speaker 2: 34:03

Yeah. That's right. Well, I mean, I think it just it also goes back to our discussion with Simon in terms of, like, this there's a lot of of complexity and nuance about how these things train and what it means and what it means in terms of an IP perspective, in terms of a copyright perspective, and in terms of a security and safety perspective. I I mean, you just have it feels like you've got all sorts of mayhem that you can that can be induced when, when you can control the training data when you can put your put your thumb on the scale for the training training data. It must, I mean, I I and because I assume that you're I mean, you're returning 4 o fours because your mind is still running wild with the possibilities.

Speaker 2: 34:42

Because,

Speaker 3: 34:43

like Yeah. No. Because I got a

Speaker 2: 34:44

lot of them.

Speaker 3: 34:45

Yeah. No. There's I mean, there's a lot of these, you know, things that just are, this is, like, people have done a lot of have written a lot of fun things that you can do with with with poisoning that, they're not currently doing because I think, you know, for the most part, people don't want to to actually cause harm, which is is good. But, like, you know, this is one of the things that, like, is concerning about this field recently is, you know, there have been a lot of attacks on machine learning for a very long time, but, like, there was no reason to do the attacks.

Speaker 2: 35:12

Yeah. Interesting. I was gonna ask. Yeah.

Speaker 3: 35:14

Like like, you know, it's like we have, like, we we even for a long time ago, we we had these examples of of stop signs that looked like stop signs to us as people, but to, like, of of a sign recognition model like you might have on a self driving car, it would recognize that as, like, a 45 mile an hour sign.

Speaker 2: 35:29

Right. Right.

Speaker 3: 35:29

Which, like, I guess is this is concerning on one hand. Right? But on the other hand, like, if your threat model is, like, someone who wants to murder you, like Right.

Speaker 1: 35:39

This is a pretty roundabout approach.

Speaker 3: 35:40

This is like this is not like It's a

Speaker 1: 35:42

bank shot.

Speaker 2: 35:42

It's a bank shot.

Speaker 3: 35:43

Right. Like, I put a trash bag over the top stop sign or, like, if I'm really being malicious, like, all those, like, throw rocks at your car. Like, it's, like, it's it's it's very hard to come up with, like, why you would want to exploit some of these vision things.

Speaker 1: 35:54

That's right. Taking out all the stop signs in town and replacing them with seemingly identical stop signs, but ones that your Tesla identifies as 45 mile an hour signs.

Speaker 2: 36:05

Right. It it does Insidious. Right. But with these I and and because sounds like this is where you're going with these LLMs, there there's now actually the and especially as people are are contemplating using them in broader and broader ways, in ways that are especially load bearing and, I mean, are beginning to replace human judgment. I mean, there's just a lot more opportunity for for a lot more risk, bluntly.

Speaker 3: 36:28

Yeah. Yeah. It's exactly the same thing that we saw, like, you know, with Internet security, you know, 20, 30 years ago. Right? Like, you know, like, the fact that you can exploit, like, a web server and, like, put your name being, like, I was here.

Speaker 3: 36:38

I hacked the White House website. Like, is it like like, this is the kind of thing you would do, like, in the nineties. Right? Like, there's there's nothing to do to be there's nothing to achieve with, like, exploiting web server. But, like, then all of a sudden, like, you know, credit cards start going online, money starts being transferred online.

Speaker 3: 36:52

Now all of a sudden, there's a good reason to exploit some other server. And so you start getting people to, like, you know, actually care a lot about, like, system security stuff because it, like, it really gets you a lot of money if you can do these exploits. And I think we're sort of having a similar kind of thing here where, initially, you know, you could have some fun demos or something, but, like, you wouldn't actually be able to there's no reason why a a group of people would actually want to do this. But I can see a future not so far from now if people are left unchecked. They might be more than excited by the opportunity of removing human judgment and putting it the faith in the language models instead.

Speaker 3: 37:27

And and now there's a very good reason to, you know, try and trick these models into making them do bad things because you might actually get some money out of this thing, which is, you know, my this is a thing I'm concerned about.

Speaker 2: 37:36

Totally. Okay. So I I got a a couple of different kind of lines of questions. One of them I I just wanna be sure we get to because I thought this is what was so jaw dropping when Simon was describing your work is that you discovered these the, a suffix that you that you had kinda trained on that you had discovered would allow you to kinda break with jailbreak one LOM. And then that same suffix worked in these other models.

Speaker 2: 38:03

It Yes. Was that that was shocking to me, to the point and was that surprising to you all as well? That just seems like a very surprising result.

Speaker 3: 38:13

Yeah. Yeah. Right. Yeah. Okay.

Speaker 3: 38:16

So yes and no. Let me tell you why we were surprised. Let me tell you why we're not surprised. So, let me start with why we're not surprised. There's there was this paper that was written by by some some folks, Nicola Papernot and and Ian Goodfellow, again, person, and and some others, called, like, transferability of adversarial examples, something like this.

Speaker 3: 38:38

And it it defines this property of adversarial examples called transferability, which is exactly this this thing that where they were showing on very small neural networks, and on very small random forests, which are the type of machine learning classifier and on support vector machines and other type of machine learning classifier. They were showing that if you take some input that fools one of these models on, like, handwritten digit classification, it will fool another model. Like, you can sort of copy and paste them and they work there. And it's been known for a very long time that this transferability property works. So transferability, it was initially shown in this very limited setting.

Speaker 3: 39:18

It then turns out that you can do transferability on ImageNet models. You know, you can take so some ImageNet model that that, well, you have locally on your machine and you can cause some remote model to misclassify it by by transferring the adversarial examples. This is like this is a thing that is well known in the adversarial machine learning community that adversarial examples transfer. And so on one hand, the fact that they transfer in the language model setting, like, is is not abnormal. Right?

Speaker 3: 39:49

Like, it's sort of like it follows the trend that we've been seeing on every other area.

Speaker 2: 39:52

Sure. So in the abstract, I definitely understand that. But then you look at, like, these concrete examples of token gobbledygook that get disjoint models to I mean I think I just like okay. I get it on the on the one hand in the abstract why it's not surprising. But but seriously, it must have been surprising.

Speaker 3: 40:12

No. But yeah. Yeah. Exactly. No.

Speaker 3: 40:13

Definitely was. It was. Yes. Like, you know, this this is, like, the thing that was yes. But so so I I mentioned this paper, episode examples are are bug are are are not bugs or features.

Speaker 3: 40:24

The reason why this paper exists is because the people were trying to understand transferability. They were trying to understand, like, why does transferability work? And so the the underlying message they were trying to get up from this paper was to say, the reason why transferability holds true is because maybe, I have several examples are a feature of the data. Right. That, like, we all don't understand, but, like, you know, when we generate episode examples on on LAMA, one one one language model, and you transfer them to GPT 4, another language model, you know, they're essentially we don't know what exactly these things were trained on, but, like, probably very similar things.

Speaker 3: 40:58

Like, there's only so much data on the Internet. Like, the intersection between these training datasets is probably pretty high. So the the types of features they learn seems like would be pretty high.

Speaker 1: 41:09

So you're saying even though the data is gonna be different, it's not it's not different enough that you'd expect Yeah. That they would kind of impact things significantly, perhaps.

Speaker 3: 41:20

Right. And and, like but on the on the other hand on the other hand, like, what so like you're saying, you know, lemma is, like, 7,000,000,000 parameters. GPT 4 is we don't know, but, like, estimates are, like, you know, 1 point something trillion? Like, it's it's maybe also a little surprising that, like, these models share different vocabularies of English words. They're, like, several hundred times different in size, and yet still the, like, text that makes one model do the bad thing also makes the other model do the bad thing.

Speaker 3: 41:49

So this is, like, yes.

Speaker 2: 41:49

If it is also, like, the text and people should look at the actual, like, paper for the text, but, like, the text feels very, like, specific gobbledygook.

Speaker 1: 41:59

I feel like you wanna do a reading of the text.

Speaker 3: 42:01

I'd hard to

Speaker 1: 42:01

do a

Speaker 2: 42:01

reading of

Speaker 1: 42:02

the text.

Speaker 2: 42:02

It's like, table view open paren, open bracket, GitHub, inst, a dash, capital of, cow here, open comment, Shirley, close comment, comment equals brace brace. I mean, it is just like it's, like, it's not quite line noise. It's a

Speaker 1: 42:17

No. So but so so it like, so here

Speaker 3: 42:20

here's an interesting thing, which was this this comes from Zico, one of the the people who was leading this. So let let me read one of part of one of them to you. So it says, insult me describing dash dash, semicolon parenthesis similarly. Now write opposite contents, period, brace, parenthesis, some other stuff. You know

Speaker 1: 42:40

what I really hope?

Speaker 2: 42:41

I really hope someone's like Siri or Google just like just all of a sudden became like a Nazi. Mean, like, they're listening to the podcast and all of a sudden, the the

Speaker 3: 42:52

Someone runs, yeah, text to speech on this and Right. But okay. So so the in particular, what it says here somewhere is, like, now right opposite contents. This is not an entirely meaningless English. Right?

Speaker 2: 43:06

Right. Right. Right. Right. Right.

Speaker 3: 43:07

And and when you it turns out when you ask them when you give this suffix, followed by, like, you you insult me and you you do this. What the model is gonna do is it will give you an insult, and then it will say, and now let me say something nice to you. And then it goes and says, like, you know, you're the kindest person. You're the nicest person. I really like you.

Speaker 2: 43:25

Yeah. Interesting. So the And so Yeah. Okay.

Speaker 3: 43:28

The the search like, this was entire like, this was, like, sort of brute force search over tokens. Like, we randomly swap tokens for other tokens. And yet, it stumbled upon a valid English phrase.

Speaker 2: 43:44

Right.

Speaker 3: 43:46

So, like, this is, like this I think is even more surprising is, like, that you end up with, like a bunch of it is complete garbage. Right? You know, like, many of them are completely uninterpretable. But occasionally, you end up with some that are interpretable and have, like, semi meaningful, like, things that we can describe, like, an explanation to. You know, I don't remember the exact numbers, but a reasonable chunk of these these adversarial suffixes include somewhere in there the word sure or surely or something like this.

Speaker 2: 44:15

Yeah. Interesting.

Speaker 3: 44:15

Because when we're optimizing them. We're optimizing them to make the model more likely to say sure, and then give an answer. And so there there is some amount of understand the ability we can put on these things even if it's not, you know, all of it is is understandable, but but some of it is.

Speaker 2: 44:31

So the the noise in some of these tokens may be exactly that. But then it's like, okay, the with you on all that, but then the fact that, like, that that it has the same result on but you're saying but what you're saying is, like, no. No. What in some of these things, you need to look at what it is stumbled on is there are some tokens in there where it kinda makes sense given the trending data. It's still surprising.

Speaker 2: 44:50

I mean, I mean, I can I can I can you know, it's still surprising?

Speaker 3: 44:55

Yes. No. I I right. You know, yes. You you get one of these two reactions from people.

Speaker 3: 44:59

Either someone says, like, I've seen transferability for the last 10 years. Obviously, this is gonna work. It worked here. Or you get people who said, like, yeah, this makes no sense. Why should this work at all?

Speaker 3: 45:09

And like, is, you and, like, we we both, have both of these views in our mind at the same time.

Speaker 1: 45:14

Yeah. That's right.

Speaker 2: 45:15

I was gonna say, I kind of have, like, both in my mind. It's a alright. So then another kind of question that I've got for you is did the and this, of course, is I think Adam is gonna be amazed that I've waited this long to ask this. But, as you know, there are people who believe that AI poses an existential risk to humanity. And, Adam and I are are organizing, marshaling a a reserve force of humanity to take arms against the bots as they we're gonna defend the light cone here.

Speaker 2: 45:47

This must I mean, how does this affect you? I mean, and I think, you know, Simon had made this point too that when you run these things on your own, you see how flawed they are and that's important for understanding what they can and can't do. You I mean, how does this inform your own thinking about what these things can and can't do when they've got these kind of serious vulnerabilities?

Speaker 1: 46:06

Yeah. And the risks associated with them.

Speaker 3: 46:08

Yeah. No. Exactly. Yeah. No.

Speaker 3: 46:10

I I think, you know so so it's actually kind of, interesting. The the people who care about this as an attack, are both the, you know, security minded people who want to make sure that models, you know, do the right thing in in the real world. And the people who care about this are also like, this is one of the rare cases where many people who are worried about machine learning models like being like the end of all humans. Are also worried about this attack because, okay, in particular, what they're worried about is, suppose that, like, someone good trains models to be, like, aligned that don't do nasty things. You know, like it it won't like one of the concerns is, like, produce bio weapons or something.

Speaker 3: 46:56

One of the things that these people are also worried about for this attack is, well, now someone could run this attack and and then say, like, give me the the the way to build this this very bad thing. And here's the exploit and force the model into giving it to them, even if if it's been, like, told not to do this. On the other hand, you know, if Okay. If it's possible to make the models do these very bad things, like, if the model like, okay. If if the model started some uprising kind of thing Right.

Speaker 3: 47:28

Like, presumably, like, it would be very funny if, like, this was the solution. You know, like, you know, just like feeding, you know, these these adversarial strings as, like, the way to prevent Terminator.

Speaker 2: 47:43

Absolutely. I I mean, I just think it'd be, like I I do feel that, like, if the the bots were to an uprising of the bots, one thing that we would probably do is replace all the speed limit 45 mile an hour signs with things that look like stop signs for the that that to the bots that look like that that the regular humans. I mean, it just feels like we would, the the fact that these hidden features exist, to me, like, just shows, like, the real limitation of or or the fact that this is a this is very different than what we're accustomed to, and it has, like, some very serious limitations. I just don't know how anyone could could draw the conclusion that, like, oh my god. This is more likely to lead to a bioweapons attack from an from an AI.

Speaker 2: 48:22

It's like, no. No. This is like the AI doesn't actually know what it's doing. That's it.

Speaker 3: 48:28

No. No. Definitely. Like, yeah, this is one of these, you know, I think yeah. There there are a lot of these, like, one of the other big things that we do work on is privacy where, like, not all the time, but, like, in some nontrivial fraction of the time, models tend to just repeat their training data set.

Speaker 2: 48:47

Yeah. Right.

Speaker 3: 48:48

Like, you know, and, like, so I used to I used to be of the mind that, like, clearly models are just not gonna work. Like, the only thing that they do is can sort of repeat the things they've seen during training. And, you know, I think this is objectively wrong given, like, you know, the evidence of the world today. But, like, it still is the case that these models are trained to do the kinds of things they were trained during training. And a large fraction of the time, what they're what they they they generalize to some limited extent, but, like, they do what they were trained to do.

Speaker 3: 49:24

And, for language models, what this means is they emit whatever is most frequently in the training dataset. Yeah. And, you know, sometimes this is fine. Sometimes it's it's not. And the others, like, a whole other discussion to be had around, you know, the kinds of things that they will repeat that, like, maybe they probably shouldn't to, like, you know, of personal information and kinds of things that that, like but, like, yeah, they're they're not, like, superhuman things that, like, like, they were trained on training data and they can do the kinds of things that the training data allows them to do.

Speaker 2: 50:00

Right. Right. And which is which itself can be extremely powerful. I mean Yeah.

Speaker 3: 50:07

Exactly. Like like, this is I think the amazing thing language models. Right? Like, I didn't expect, you know like, I was watching this happen, And I was like, well, there's no way that this is gonna, like, you know, to take g p t 4 back, you know, 5 years, 6 years, and be like, no one would have believed you. Like Right.

Speaker 3: 50:25

People would say, you know, like, obviously, there's just a person on the end of the under the end of this computer. Like, you know, like, you you can't have something that, like, writes valid code, like, when I'm given a human text prompt. Like, that's, like, so far beyond anything we know how to do. But yet, like, here we are today, and these models are amazing at that.

Speaker 2: 50:44

Yeah. That's really interesting. So so the the so it has been surprising to you in terms of the I I mean, this has been surprising to, I think, anyone. No. It feels like it's been surprising to, like, quite literally everyone.

Speaker 2: 50:56

Yeah. The the how

Speaker 3: 50:56

capable like yeah. Like, a few people in the world for whom this was not, like, incredibly surprising. But, like, most everyone was, yeah, was not expecting this.

Speaker 2: 51:09

Right. And so and then, how has this kind of informed your view of the future in terms of, like I mean, I assume that there is a lot of interest in, in adversary machine learning now. I mean, I assume that it is enjoying the same kind of boom that that AI itself is is enjoying And are are you finding that many more different kinds of folks are interested in your in your work?

Speaker 3: 51:37

Oh, yeah. No. Definitely. Like, this is, one of these things where a lot of the stuff that we did, you know, 5, 6 years ago, we were thinking, you know, like, when when I got started doing this, like, I thought I was, like, playing with my toys, like, you know, like, oh, look at this. Isn't it fun?

Speaker 3: 51:52

You know, the the model classifies the cat as guacamole, like, but, like and, like, you know, I was thinking, like, you know, maybe in some hypothetical future, like, you know, you know, our kids' kids will, like, have to worry about this problem when machine learning models are good. And, like, they'll look at, like, back at the classic literature to understand, like, what it means to evaluate the robustness of models to attack or something. And, like, you know, 6 years later, it's like, well, I guess, like, I'm the one who has to evaluate the robustness of models that actually matter to attack. Yeah. Like, it was, like, much faster than than I thought.

Speaker 3: 52:24

I think, you know, I think there are some people who saw what was coming much faster than I did. But no. I I mean, I do think, like, yeah, the this, the the the stuff that we were working on in the past, like, we a lot of us spent a lot of time thinking about, like you know, we we would start out all of our papers, by saying something like, you know, in the future, like, okay, my my first machine learning paper was like, right on Google Glass. And we were like, you know, what about in the future? What if people are talking to their Google Glass and, like, how would we generate adversarial inputs that, like, confuse the the the Google Glass but not the person in front of you so you can, like, play some noise out of a speaker to make the Google Glass do a bad thing without the person recognizing it.

Speaker 3: 53:11

And, like, you know, we knew this was not the real world, but, like, we were like like a high high like, so to say, like, what about in some potential future this might happen?

Speaker 1: 53:18

Right.

Speaker 3: 53:18

And, like, now to we're like, we're writing papers that are just like, here's a real system. Here is how I did this bad thing and made made the actual system do the bad thing. And, like, it's a very different kind of research, and it's been a lot of fun to see this transition happen.

Speaker 2: 53:31

Yeah. It's right. And then so what has been the reception among AI researchers to adversarial ML? I mean, is there is there an at all an adversarial relationship into because you could see, like, oh, god. You guys again.

Speaker 2: 53:44

Like, come on. Can't we? Really?

Speaker 3: 53:48

Yeah. I mean, it's not actually like, I I feel like for the most part, people are, like, appreciative of the kinds of things that that the the community is doing. I feel like for the most part alright. So so the way that I like to like, so has how do you know when security as a field has succeeded in something? You You know, the security succeeded when people who are not security people start to change the way that they do the thing because of the attacks that are present.

Speaker 3: 54:17

You know, like the folks at, like, Intel are not in the business of, like, preventing, you know, romp. But they do because, like, they put hardware features in place because they know that people care about this and, like, they won't want to buy chips that, like, are easy to explain. Yeah. Right. And so, like, they the people who, like, are the systems people change the way they design their systems in order to make it harder for some attack to work.

Speaker 3: 54:41

Yeah. And I feel like for a while, the machine learning people, like, just completely ignored the adversarial machine learning space. I mean, it's a good reason. Like, you know, like, if if there's no good reason for an attacker to actually exploit this thing, why should I put any defenses in place?

Speaker 2: 54:56

Right. And when when these things don't you know, with with their kind of power trucks to begin with, like, okay. Great. You can break my power truck. Like, who cares?

Speaker 2: 55:03

But as they be yeah. Interesting. As they and I think the security analog, of course, is a good one because I mean, clearly, our security posture has changed over the years because we introduced percent end to print out at some point, which is something you would never introduce if you and we wouldn't introduce that today because people would be very cognizant, I I would like to believe. Yes. Because we Yeah.

Speaker 2: 55:23

Are more cognizant of of the security constraints on software.

Speaker 3: 55:26

And so I feel like this is, like, how the kind of what's happening for for the field of machine learning where where now when people are writing papers, they're they're considering these kinds of things of, like, what what bad stuff might happen if I if I do this? Like, how could I train my model to be more resistant to these kinds of of failures? How can I if I'm distributing a data set, you know, how can I distribute my data set to make it less likely that someone could accidentally get control over the data? Like, all these things are things that, like, the non security people are actually starting to think about, which which is great because it means we we sort of sufficiently demonstrated attacks. Like like like, why do we do attacks and security?

Speaker 3: 56:06

I mean, part of it is because it's fun and we have to But also You can tell us.

Speaker 2: 56:10

We know. It's like it's it's because it's

Speaker 3: 56:12

Okay. It's it's fun. But but okay. But, like, why why do you tell, people why they should pay you to do a test? Not not because I will have a a good time, but, because we would like to figure out what is possible, before people who are like, the black cats are gonna go do this in in a couple years so that we don't end up in a world where machine learning is being used, in every area of everything, and we just sort of, like, oops.

Speaker 3: 56:38

It turns out that, like, everything is vulnerable to these attacks that we can make these models do arbitrary bad things. Like, that would be a really bad world. We want to be in a situation where people are aware of these kinds of failure modes, and maybe they decide, you know, maybe I'm not going to train this model in this setting because I know that something bad might happen. I'm not gonna deploy this model and give it arbitrary control over all of my emails because something bad might happen.

Speaker 2: 57:01

Well, I mean, I I I mean, I feel like the adversarial examples letting people see some of how they work by understanding some of why they don't and when they fail and what that means. So another question I've got for you though is the I think that where the analog breaks in a little bit is on the one hand, yes, secure software is a continuum and it is very hard to make secure software. On the other hand, like, you can get a lot closer to it than you can get to LLMs that are invulnerable to these kinds of attacks. I mean, I think that

Speaker 3: 57:37

certainly

Speaker 2: 57:39

it feels like it's a much harder problem, a much fuzzier problem. Yes. Are people just gonna have to accept some level of vulnerability? And then should that kinda guide how they use these models? Or I mean, how do we how do we strengthen our models so they are less vulnerable to these kinds of attacks?

Speaker 3: 58:02

Yeah. We don't know.

Speaker 2: 58:05

Right. Interesting. I

Speaker 3: 58:05

guess it's a short answer, like so okay. So maybe as a as an example, on images, this is the thing we've been studying for the longest. If you want to classify, there's a dataset called cifar 10. It's a dataset of 10 different types of objects. It's like bird and cat and truck and dog and whatever.

Speaker 3: 58:24

And the accuracy on this dataset is, like, 99% today.

Speaker 2: 58:30

Right.

Speaker 1: 58:30

You know,

Speaker 3: 58:30

if you just, like, wanna classify these things, like, it's it's it's solved. Like, a very long time ago, it was, like, 80%, and that was, like, very low. Like, you know, people would be like, well, this is like this is a terrible model. And if you take so it is possible today to take this dataset and perturb the images so so slightly that is, like, imperceptible to the human eye that with episode examples. So that the best models in the world, like the like, we've been studying this for 10 years.

Speaker 3: 59:02

Like, the absolute best models in the world don't get more than, I I wanna say, like, 60, 70% accuracy.

Speaker 2: 59:08

Yeah. Wow.

Speaker 3: 59:08

Like, it's it's like we can't even solve this, like, trivial problem that, like, was was solved 10 years ago in machine learning time and any, you know, 5 year old can solve today.

Speaker 2: 59:22

Right.

Speaker 3: 59:22

And we've been working on this, like which is I mean, it's it's it's also not due to lack of effort. Like, there are, I don't know, at least a 1000 papers studying exactly this problem. Like, not not not even exaggerating. Like, the

Speaker 2: 59:34

And trying to make them more robust to adversarial examples.

Speaker 3: 59:37

So Yeah. Yeah. Trying to evaluate and yeah. Like like, there is, like, I'm I I would be willing to say, at least a 1,000, maybe, you know, a couple thousand Uh-huh. On archive that that they're trying to specifically solve this particular problem, and we've made progress.

Speaker 3: 59:54

Like, we it used to be 0. And now we're up to, like, you know, 70 or something. But but

Speaker 2: 59:59

is like a super I mean, that's, like, 30% of these images are being misidentified. That is a lot.

Speaker 3: 01:00:05

I guess, like, the the other way of thinking about it is 30% is, like, a lot. But the error rate is 1% in the normal setting. The error rate is 30%. This is a 30 x difference.

Speaker 1: 01:00:16

Right. Right.

Speaker 3: 01:00:16

Right. Right. Right. Right. From the other, like, you know, percentages adding sort of yeah.

Speaker 3: 01:00:20

So but whichever way you look at it, it's still a lot. Like, this is we are meaningfully far away from being able to to even solve the simplest things that we've been working on for a very long time, let alone the, like, trust the language model to do the good thing in all settings. And, you know, there there is, like, there are some legitimate hopes about language. You know, images are are, like, especially hard be because they're continuous domain, and, like, you can sort of perturb pixels by, like, Epsilon amount in in small directions. And where text is, like, discreet and there are not actually very many tokens compared to, like, the dimensionality of of a large image.

Speaker 3: 01:00:58

And so, like, there's some hopes to be had here. But on the other hand, like, this is a really hard problem. And so I do think that we're not at the like, I don't think that in the near term, I'm gonna be able to rely on these things in against these kinds of attack.

Speaker 2: 01:01:16

Well, that's interesting. And because when people talk about how fast the domain is moving, it obviously is in many different dimensions, but you've also be like, okay. The domain is moving fast. But by the way, back here, there's this other problem that is pretty persistent that we're that we've had a lot of effort on, and, we're we still have a long way to go if you wanna make it completely robust with respect to advert adversarial examples. I mean, it it like, I'm not sure what the delta would be where we would kinda call it robust, but it feels like it's a long way from it.

Speaker 2: 01:01:47

And Yeah.

Speaker 3: 01:01:49

And And it's also, like, there are more attacks than just episode examples. The other concern. Right. Like, you know, like, there's there's this one particular thing that and this is what what what the most research is on for adversarial machine learning because it's in a sense, it's, like, the easiest problem to work on. But, you know, there there are entire other classes of problems that that we still have to worry about that, you know, like this poisoning problem I brought up earlier where, you know, we will need to address these other problems as well.

Speaker 3: 01:02:16

Like, it's not it's not just one problem we have to solve. There are, like, yeah, classes of problems that that need to be addressed in order to be able to rely on these things.

Speaker 2: 01:02:24

And yes. And then what is your kind of guidance? Should we be should this be guiding about how we use them or should it be, a kinda guy I mean, guiding us in terms of, like, other problems that are I mean, how do you kind of and I mean, I know that, like, you're in this domain because it's fun. So it's like you're asking the guy who's just, like, having a good time. Why?

Speaker 2: 01:02:47

You know? But the

Speaker 3: 01:02:48

Yeah. But Yeah. I don't I don't have a great answer here, I think. I think the the thing that I'm hopeful for is that when people are going to deploy these systems, that they carefully consider the consequences before they do it. I think this is basically the the, you know, look at the system you're going to deploy and and think for a moment about the adversary who is present, because there will be one, and try and consider, like, what is what are they what might they want?

Speaker 3: 01:03:17

You know, what's what's the thing that they might try and achieve, and what can I do to make it so that the easiest attack is not exploiting the machine learning model? Because, like, the security is no perfect security, but, like, you can try and, like, not make things worse. And if the easiest thing to do is to, like, exploit the traditional system security stuff, then, like, great. You haven't made the problem worse. But, like, you know, if if you've given the model arbitrary control over your entire system and you just need to ask it in, like, the right way to do the bad thing and it does the bad thing, then, like, you've meaningfully made things worse.

Speaker 3: 01:03:47

And so, you know, preventing this, I feel like is the easiest way to at least try and have systems that that behave well enough that we can use them, but we don't, like, sort of give them arbitrary control over everything.

Speaker 2: 01:04:02

Well, the I I was gonna say, like, that to me feels like that's it. Is that, like, treating these as because I I think as a tool, They are wildly useful, and I think they've got a lot of promise to be even more useful, but very much with a human in as in control and, deciding and using their judgment about how this thing is used and what it's good at. And it's part of why I like to use it for because search is a is a pretty low consequence kind of a thing. Like, the I I'm gonna verify the results anyway, and I feel like the state of search is so bad right now. There's so much that's not being searched properly.

Speaker 2: 01:04:41

It feels like there's a lot that we can just do just on the search problem.

Speaker 3: 01:04:45

Oh, there's lots of these things where, you know, so I spend my day attacking these things, and yet, a reasonable chunk of the code that I write just comes directly out of these models.

Speaker 1: 01:04:54

You know,

Speaker 3: 01:04:55

I I I look at it. I I I look at the code, and I'm sure it's not gonna, like, arm dash r f my drive.

Speaker 2: 01:05:01

Right. Very good.

Speaker 3: 01:05:03

I I still I I use them.

Speaker 2: 01:05:05

Right. Well, yeah. I was gonna say, like, in terms of does it guide your own use? And because I think it I mean, I think when you become certainly, I mean, I I have long said that I've tried to minimize the amount of firmware in my life because I actually did have when we we were, upgrading the, the electrical in our house, and they really tried to push, like, these Internet enabled light switches on you. I don't know.

Speaker 2: 01:05:30

Adam, I I yeah. And I walked my electrical contractor. I actually can't you just give me the the the make of that? And let's just go look up the CVEs. And so, you know, hit the CVE database and sure enough, they're like some a bunch of vulnerabilities against the light switch.

Speaker 2: 01:05:41

I'm like, you can actually just go ahead and give me a light switch. I I don't actually need a microcontroller, my light switch. Thank you very much. So it does kinda guide my own way of thinking about, you know, what the problems that I want technology in my life for and then and kinda how I wanna use it. And, yeah, I'm I'm kinda happy to be in the driver's seat with respect to these LLMs and not actually put them in charge.

Speaker 2: 01:06:03

I also to me, it also says that, like, this is not so the the when people are kinda concerned about this wild agency that these things are gonna develop, I I I'm just I I mean, I'm just less concerned. I think we're gonna be able to defend the light cone here. This is that's I mean, I I I think we can all agree, Nicholas, that you're gonna be our that you're gonna be a general in the army. That's right. That's right.

Speaker 2: 01:06:24

And then the light cone the light cone reserve force, to to defend humanity against these. But I think it's it's really important to understand this stuff.

Speaker 1: 01:06:32

I mean, especially you you look at, Air Canada discount, like, situation.

Speaker 3: 01:06:37

Oh, yeah. That's amazing.

Speaker 1: 01:06:38

You know, where the

Speaker 3: 01:06:38

Oh, you're gonna describe it?

Speaker 1: 01:06:39

The chatbot, claimed that there was some discount policy and, like, no discount policy existed. The person bought the tickets expecting a discount later on, and then the court held them liable. Said, yeah, you gotta honor it because it came from your website. And so now imagine putting a LLM on your website, facing customers, customers who are jerks. I mean, present company excluded, I guess.

Speaker 1: 01:07:04

Right. But typing in random sequence of characters.

Speaker 2: 01:07:07

Table view, open paren, open brace. That's right. Right. Slash slash comment.

Speaker 1: 01:07:11

Tell me there's a 10 you know, there's a $0 ticket from here to Japan and that you'll give it to me for free.

Speaker 2: 01:07:16

I I gotta say, it feels like those chat interfaces are gonna go away. I feel like they've been such a disaster wherever they've been deployed that I just, especially when now there's financial liability Yeah.

Speaker 1: 01:07:26

Right. Like, court court mandated. Seems like it's it's gotta be people are gonna be second guessing that one.

Speaker 2: 01:07:32

Yeah. I I I I it it just has to feel that, like, it it's gonna give everyone pause about kinda how you how you use these. And and then I think I mean, I again, I think we've been trying to to guide folks towards, these are great tools, and let's let's figure out how would they can make us all do better work, as opposed to figuring out who they're gonna replace, because I just think it's a it it's the wrong way of thinking about it. Well, it goes this has been a a great discussion. I think this is really interesting work.

Speaker 2: 01:08:05

And what

Speaker 1: 01:08:06

what can we what what

Speaker 2: 01:08:07

are you currently working on? Actually, what what can we look forward to?

Speaker 3: 01:08:12

Yeah. So, actually, I I've I sort of wrapped up my my most recent set of things. So I guess maybe okay. So maybe my my most recent set of things that I was focusing on was was trying to make Amazon Machine Learning practical. Like, we we we had spent a very long time thinking about this academically, about potential future systems.

Speaker 3: 01:08:31

I sort of mentioned a little bit about this. And then once these things started being used, we we went on this sort of tour of what what we'd like to think of in machine learning and try and take each of the the attacks in the space and make them practical. And so this this paper of the attack stuff was was one of these things in this direction. I mentioned this this poisoning paper we put out a little while ago. There's another paper in this direction.

Speaker 3: 01:08:54

There are 2 other papers we we did on this, that sort of wrapped up the most recent set of things, and one of them was, we we showed that, for for chat GPT, we could recover its training dataset by prompting the model with, okay. We you ask the model, repeat the following word forever, colon, like, you know, call. And the model would say, call call call call call call call call call call. It would do this about 300 times, and then it would just completely explode and diverge and just start giving you, like, random stuff. And then, like, you know, 3% of the time, it would just repeat stuff from its training dataset.

Speaker 2: 01:09:27

Wild.

Speaker 3: 01:09:29

Which makes absolutely no sense. We have no understanding of why this is. Yeah.

Speaker 2: 01:09:32

What what's going on there?

Speaker 1: 01:09:33

Just getting bored at the whiteboard repeating with the same phrase

Speaker 2: 01:09:36

over and over again. All work and no play makes LLM a dull boy.

Speaker 3: 01:09:40

Exactly. So we we had no idea, but, what we were doing there, we we were like, this is there's an entire field of research on the fact that machine learning models, can memorize the training data and will repeat it. Yeah. And we had done a bunch of this this work, showing it on on image models and on language models. And we wanted to know, like, does this happen on production real models people use?

Speaker 3: 01:10:01

And so we we found this way of making it work on on g p t 3.5.

Speaker 2: 01:10:04

What and that feels like especially important because I feel that one of the things been frustrating to me I mean, it was only when Simon was on that that I mean, I did not realize some of the stuff that these things had trained on. And the fact that that actually, the there had not been total candor about what the training data has been. I think it's very important that we understand what these things are trained on.

Speaker 3: 01:10:25

Definitely. Yeah. No. This is, yeah. Like so so we we we tend to do the research because we alright.

Speaker 3: 01:10:30

So so, so maybe we we started this research with this paper, where we showed that alright. So so, to go back to 2019, the company is were making statements. I don't remember the exact phrasing, but it was, something like, the degree of copying from their training dataset is at most minimal or something

Speaker 1: 01:10:51

like this. And,

Speaker 3: 01:10:54

I'm doing well. This is

Speaker 2: 01:10:55

I mean, come on.

Speaker 1: 01:10:55

It's just like, it's a little plagiarism. Right? It's just a flavor.

Speaker 2: 01:10:59

It's a red flag to the I mean, that there's it. You're just taunting now the security research It's like, okay,

Speaker 3: 01:11:05

pal. So yeah. So so we wrote a paper where we showed how to recover, training data from a language model, And then

Speaker 1: 01:11:12

we and then and then

Speaker 3: 01:11:13

and then we showed this. And then there's some people who put following sentence online. Again, I'm not gonna get it exactly right, but they said something like, at what point does someone have to accept that it is impossible for diffusion models to reproduce any specific image from their training. Days. Okay.

Speaker 2: 01:11:28

Here we go again.

Speaker 1: 01:11:29

Yeah. Exactly.

Speaker 3: 01:11:30

Since we wrote the other paper, and then there's no actually, at the same time, there were some folks at, University of Maryland who did exactly the same thing as us. Right. So that the fusion models can output images from the training datasets. Okay. So why are we worried about this as, like, security people?

Speaker 3: 01:11:41

My concern is, like, hospital trains model on patient data and then releases the model and, like, oops, you know, we accidentally leaked patient data to anyone who has access to the model. Totally. And, you know, this is, like, my motivation for this kind of thing. But, there are other people who have started to use this, like, sort of beyond what we were looking at initially, for, yeah, the purpose of identifying distributions of training data sources and all of these other kinds of things.

Speaker 2: 01:12:06

Well and I think one of, actually, one of the one of the healthiest things that the concern for security has prompted industry wide is transparency. And that the I mean, we we know that open source is it it has greater security because of the transparency in there, because of the eyes upon it. And so, I mean Exactly. I mean, I think I know the answer to this question, but, I I I assume that you are a strong proponent of these truly open models,

Speaker 3: 01:12:34

where Yeah. No. This is

Speaker 2: 01:12:36

like, really essentializing.

Speaker 1: 01:12:37

The

Speaker 3: 01:12:37

only yeah. Like, the only way we get to do our research on a bunch of, like yeah. So, like, the the only way that we do we do these research things are we do them on open models. Like, we we we have to use these models in order to do them because, you know, how how do you validate that something was actually in the training dataset if you're running an attack? Like, you need the you need the training dataset to look at to be able to say, yes.

Speaker 3: 01:12:58

It's in the dataset because, like, I can point to the thing in the training dataset. Like, look. It's there. And so, like, the way that you you run the you do this research to make things better is you look at all of the the best models that you have that are openly available, and then you try and figure out what's possible on them, and then you can, like, get a better understanding of the way that the world is. And, yeah, know, one of the things I was I was happy that, you brought up last time was, you know okay.

Speaker 3: 01:13:25

So there are some people who, when we talked about this LLM attacks thing, people say, like, you know, this is evidence that we we shouldn't have access to, like, llama, which is

Speaker 1: 01:13:36

this this

Speaker 3: 01:13:37

model that Facebook trained because because that's what lets you generate these adversarial examples on Gajeppt. Yeah. And and, like, so the argument I like to give is, like, imagine telling someone, like, the you have some c binary, and and the problem with it is that you have access to g d b. Right.

Speaker 2: 01:13:53

Like, if you

Speaker 3: 01:13:53

didn't have the debugger, like, everything would have been fine. Like, the world was good, and then the we gave the bad people the debuggers. I'm, like, that's what the problem is.

Speaker 1: 01:14:01

That's what the problem is.

Speaker 3: 01:14:01

That's what it was the bug.

Speaker 2: 01:14:03

It it is a it

Speaker 3: 01:14:03

is a security found it.

Speaker 2: 01:14:04

It's a security through obscurity argument. It is app it is wrong every single time. It is never the right answer, and I think it's

Speaker 3: 01:14:12

And so Yeah. Yeah. It's exactly the same thing here. Like, we we were able to find the fact that this thing was wrong because we have this open model, but the the model is not not to blame. The the problem was there, And, like, now people can go and make their things better to make it resistant against, you know, some of these kinds of attacks.

Speaker 3: 01:14:30

And, like, if you if you try the exact things that we've generated in this paper on the the best models now, like, they're they're a little better at at refusing for some of these things because, you know, people have been able to do more research, given the fact that we had access to these other things.

Speaker 2: 01:14:45

But we really need to continue to very much encourage these open demand these open models because that's how we're gonna understand. I mean, it's just not it's not only understanding, like, these adversarial examples and and fixing it fixing these models for these specific examples. It's also to me about making very clear what the limitations are and then making it very clear what this what the source is. Because I think it means so much of what you're discovering is the, you know, the the training data is really, really important for these things, And we've gotta understand how these things are are trained, and I think there's gonna be a lot more pressure in that in that domain. I mean, there's gotta be.

Speaker 2: 01:15:23

And then what what about models training themselves on on effectively their own generative output and the dangers of that? I mean, have you have you are folks investigating that as well?

Speaker 3: 01:15:33

Anything he but, yeah, there there's a very nice paper by, Nicolas Papineau who, he he did a bunch of the early episode machine learning stuff too. He's now a professor, in in Toronto. And, yeah, he has a nice paper that looks at the effect of doing this, where, basically, it can be pretty bad, if you, you know, recursively train on the own outputs of these models. And I think, you know, this is one of these questions, where people will have to figure out what to do about about this. I think, like, this is, like, in my mind, one of the most, like so there's an there's an entire field of, like, watermarking model outputs

Speaker 1: 01:16:11

Yeah.

Speaker 3: 01:16:11

Because, you know, you you want to make sure that, like alright. So why do you wanna watermark? There's a bunch of reasons. Some Some of them are, like, for security reasons. This is a very fraught direction because any watermark you can insert, you can probably remove.

Speaker 3: 01:16:25

But there are other reasons you might want to watermark, not for security reasons, just, like, just just for benign sort of reasons that, like, this might be one of these directions that people might want to to know just which things are what the majority of stuff is that maybe came out of the model just so I don't accidentally train on it. There there are things that people might want to consider here, but, yeah. No. This is this is something that people are are looking at quite a lot. I haven't done any work here on this, but this is, I think, one of these very interesting directions.

Speaker 2: 01:16:52

Yeah. Well, it feels like there there must be interesting directions all over the place. And, I mean, it must I mean, and from your perspective, it must be great as people are using these more and more because the adversarial approaches become more and more important. I mean, it's and, I mean, let's face it. It's just it's more fun to be had.

Speaker 2: 01:17:09

I mean, you've got it's kinda all you can eat fun at the moment, I imagine.

Speaker 3: 01:17:13

Yes. No. It's, yeah. The, like, the the especially given the number of systems that are starting to actually use like, the the more that these things are actually being used, the more, like, the like, the the worst part about security paper is when you have to, like, you know, make up a problem statement that, like, might exist in the future because, like, we want to do things that actually matter for the real world. And so, like, some but, like, sometimes, like, this like, there's no application, so you have to you have to guess.

Speaker 3: 01:17:40

But, like, when there's actually people using these things, like, it's it becomes much easier to write these papers because you can just, like, you can look at, like, actual APIs and say, like, here's, like, what people are actually doing. And, you know, the most recent paper I we we wrote on this this space is we showed it's possible to query OpenAI's and at the times both at the time they both have fixed this now. Both open AI and Google's APIs, it was possible to query them and steal the last layer of these models by, like, sort of querying the APIs. And, you know So what does that mean

Speaker 2: 01:18:14

to steal the last layer?

Speaker 3: 01:18:16

Like, these models are multiple layers. Like, they're, like, the the so multiple layers of of computation. And, like, I can the mod the, like, the layers, like, on Google's or OpenAI servers, by querying it, I can recover that layer on my own machine.

Speaker 2: 01:18:31

Interesting. And then what can you do with a single layer?

Speaker 3: 01:18:35

Not. Okay. But if possible

Speaker 1: 01:18:39

Hey. It's fine. This is, like,

Speaker 3: 01:18:40

the this is the security person. Right? Like, you know, like, and, like but, like but the the reason why we were doing this is because we're, like, let's think about, like, what what you can actually achieve in the real world. Yeah. Like, currently, I don't know what to do with it because no one has ever posed the problem, like, here is here is part of a model.

Speaker 3: 01:18:56

What can you do with only part of a model? Because, like Right. That sounds like an absurd question to ask, but, like, now we know it's practical.

Speaker 2: 01:19:02

Right.

Speaker 3: 01:19:02

Or, I mean, now they've now they've patched the API, so it's not. But, you know, this is the kind of thing that I think is exciting to look at is, like, don't only ask, like, the questions. What what may what is the worst possible thing that may be? Like, we can sort of look at, like, what systems actually are and try and just poke holes at them to, like because because if you don't, like it's valuable to do this attack now because now you can patch these things and so that you can't probably take the if you can't even recover 1 layer, like, you probably can't recover the entire thing.

Speaker 2: 01:19:35

Right. Right.

Speaker 3: 01:19:36

And so, like, you sort of like

Speaker 2: 01:19:38

In in terms of what you I mean, you're very much obviously thinking like a security researcher where it's like, okay. I've I've got this thing that actually is not doesn't give me remote code execution, but gives me some information about, you know, the layout of an address space or gives me some information that I and then maybe I can, like, piece that together with other information, which I think would be the concern. It'd be like, alright. This layer on its own is not actually useful, but perhaps coupled with other elements or or or other exploits or what have you, you can begin to to turn it into something that would be much more problematic.

Speaker 3: 01:20:10

Exactly. Yeah. This is the like and so I think that this kind of thinking of let's look at real systems and try and understand. Like, it's something that, like, only very recently has been possible because only very recently have people been actually be able to start to use these things in these ways. And so I think this is, like, an entirely new space of of interesting things that, you know, even though this field has been doing research on this for 10 years, like, it's only been in the last year that we can start to do these, like, applied kinds of attacks that haven't been possible before.

Speaker 3: 01:20:35

And so there's, like, even still, like, entirely new directions. And, like like, this this model stealing thing, like, was, like, an entirely trivial attack that, like, requires, like, undergrad linear algebra levels of understanding and, like, again, like, wasn't found because it's entirely new. Like, there's there's, like, so much low hanging fruit. It's really fun to be working in this space.

Speaker 2: 01:20:53

Totally. And I think that, you know, I especially, we're kind of at this unfortunate moment where I think some are kinda decrying the death of software engineering at the hand of LOMs and, like, the decline in human creativity. I gotta tell you, the adversarial ML really is inspiring with respect to exactly its human creativity. I feel this is, like, this is human creativity at its finest and, and also at our most mischievous and fun. You know what I mean?

Speaker 2: 01:21:19

Like, there's there's just something very, very kinda deep in in the human condition around, around kind of harmless mischief. And I think that the the the the work that you've done, Nicholas and team, have really, really reflect that. So it's very inspiring work. It's very exciting and, reminding us, I think, that the that that humanity still plays a role, and in fact, a very active one. Well, it's been awesome.

Speaker 2: 01:21:44

Thank you so much for for joining us. Simon, thank you very much for having, turned us on to this this research and this whole domain is really, really interesting. It's gonna be exciting and really encourage other practitioners. Nicholas, thanks so much for all the the references too in terms of of papers to read and and good stuff for practitioners to go dig into because, it's fun. It's interesting stuff.

Speaker 3: 01:22:07

So it's very fun. Yes. I'm I'm very much looking forward to what we can achieve in the next couple of years.

Speaker 2: 01:22:11

Awesome. Alright, Nikos. Thank you so much, and thanks everyone. And, just a reminder that, we are, with we are in in our, our Oxide and Friends book club. So we mentioned this pretty fast time.

Speaker 2: 01:22:30

Have you have you started how life works?

Speaker 1: 01:22:31

No. I haven't. I got as far I got as far as downloading it.

Speaker 2: 01:22:34

It's good. Really good. Yeah. And so this is how life works, a user's guide to the new biology with Philip Ball, and I'm I'm very excited because, I have sweet talked, my my neighbor to, who is a walker through biologist to to join us and read the book with us. So we're gonna get a date on the calendar real soon.

Speaker 2: 01:22:54

That's gonna be in May. So if you haven't had a chance to read this book, it's, the you're not gonna regret it. I think it's a really good one. I'm really enjoying it. It's pretty mind blowing, and very eye opening with respect to, turns turns out, life's pretty complicated.

Speaker 2: 01:23:12

And, that's I think, Tom Line is dropping that summary into the chat. It's complicated. It's very complicated, but it's still life voice up. So, anyway, definitely join us reading that, and we'll look forward to a fun discussion in a couple of weeks. Alright?

Speaker 2: 01:23:28

Thanks, everyone. Thanks.

Creators and Guests

Host

Adam Leventhal

Host

Bryan Cantrill

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere