Oxide and Friends | Transcript: Adventures in Data Corruption

Adventures in Data Corruption

July 10, 2025 / 01:43:36/S5 E23

Adam Leventhal: 00:00

Hello, Bryan.

Bryan Cantrill: 00:01

Hey. How are you?

Adam Leventhal: 00:02

Good. How are

Bryan Cantrill: 00:03

you doing? I'm doing well. I'm doing well. Every time I log into Discord no. You know, tyrannologically, every time I log into Discord, I am getting older, obviously.

Adam Leventhal: 00:15

It's the oldest you've ever been using Discord.

Bryan Cantrill: 00:18

Every time I've used Discord, I have older children. So that is just a statement of fact. Yeah. That said, it feels like the average age of Discord or at least the people in product of Discord is getting younger faster than I'm getting older.

Adam Leventhal: 00:32

It's They're still the same age. Yes. They do.

Bryan Cantrill: 00:35

But, like, some of these updates are like, am I it's just is this me? I mean, it just feels like I don't know.

Adam Leventhal: 00:42

I I Feel like if you click enough layers deep, I feel like I always get to a screen which is like, basically, like, I don't like what I just saw. Please don't show me this again. And and then it shows me it again, but it it seems to have enough buttons for the oldsters.

Bryan Cantrill: 00:58

Yeah. I just kinda feel like, you know, I always I I always tell, parents with the scout troop that you should never linger by the tent of a bunch of 15 year olds. One, it's invading the privacy. But two, you may hear some things that you can't unhear. And I I just frequently feel with Discord that I am accidentally lingering by a tent that I don't I'm like, why what is I just I don't know.

Bryan Cantrill: 01:21

I I I get a get a good it's Yeah. Creeps me out it. That's what I'm saying. Go like Discord. Like, Discord, we're not all teenagers.

Bryan Cantrill: 01:29

Like, we we there are some adults using this thing. Right? Maybe that's it. Are we the only ones? Is that possible?

Bryan Cantrill: 01:35

I mean, we I'm

Adam Leventhal: 01:37

not sure this is, the perfect intended demographic in use, but

Bryan Cantrill: 01:43

but

Adam Leventhal: 01:43

they are getting, like, you know, dozens of dollars out of us per year. So Dozens. Yeah. You know, maybe

Bryan Cantrill: 01:51

Okay. I didn't or okay. Maybe maybe maybe only one dozen. Yeah. A dozen dollars, that ought to go somewhere.

Bryan Cantrill: 01:57

Let's say

Adam Leventhal: 01:58

That's right. It's a market.

Bryan Cantrill: 02:01

This one's overdue. This today's topic is overdue.

Adam Leventhal: 02:05

I I It goes without saying I'm sorry. I'm so excited. I'm so excited because I didn't realize that the events we're about to talk about occurred, like, two years ago almost to the day.

Bryan Cantrill: 02:15

Two years ago almost to the day.

Adam Leventhal: 02:16

Yeah. And I yeah. Then two years ago plus one day, I started hassling you about having an episode about it.

Bryan Cantrill: 02:24

I don't know that it was two years ago minus a day. Okay. Okay. Because so part and we're gonna get to this this wild tale, but we were legitimately concerned that we had stumbled across a very serious microprocessor vulnerability.

Adam Leventhal: 02:44

Yeah. I guess it's fair. I mean, I think I think I asked about it.

Bryan Cantrill: 02:47

Follow a AMD's p cert process to make sure that we were engaged in responsible disclosure. Yeah. But then they responsibly told us to go pound sand. Yeah. That's a feature.

Bryan Cantrill: 03:00

Right? Yeah. Exactly. So, I guess we could have done the podcast at that point. But then then then we needed the then we needed a cooling off period after that.

Bryan Cantrill: 03:06

So, you know, you you you know what I'm saying? You add all these things up, it adds up to about two years.

Adam Leventhal: 03:10

So here we are.

Adam Leventhal: 03:11

That's right.

Adam Leventhal: 03:12

That's right. Perfect.

Bryan Cantrill: 03:13

No. It is that obviously you did the same search that I had kind of mentioned to you this morning that if you search back for if you look for TOP speculation in actually all of Oxide chat, certainly in at least in in our in our personal chat history, it is basically you over and over again suggesting that we do this. And we, not even really dignifying it with a response, although occasionally laughing at it.

Adam Leventhal: 03:34

There well, I I do like, at one point, we were sort of like, what should we do? And I said, TLB speculation or OMDB? And you're

Bryan Cantrill: 03:41

like, oh, OMDB. I'm like OMDB. Other one. Why? The other one.

Adam Leventhal: 03:44

Why did I give them two choices? And then I switched to just trying to convince you it was your idea, but like a classic maneuver.

Bryan Cantrill: 03:51

Oh oh oh, that is classic maneuver. Is that what worked? Is that what finally got us here? Because, you know, I did I was so suddenly struck by this. We should do the episode on this.

Bryan Cantrill: 03:59

And I I know where that thought came from. Like, I am I tried to piece together where that thought came from, and

Adam Leventhal: 04:04

I I didn't know. Accepted you. Yeah.

Bryan Cantrill: 04:06

Exactly. It occurred to me completely independently of anything else. Yeah. Yeah. I well, I mean, of course, you're right.

Bryan Cantrill: 04:13

And There we go. I okay. So I think that and then there was, like, the the this this piece of reason and that for sure. But I think then then it was like, okay. I there's just a lot of context I've gotta go remind myself of and ramp up on.

Bryan Cantrill: 04:27

Yeah. But as I was ramping up on the context, I'm like, you know what? This really does merit a podcast. Like, I mean, how is it that you you know, actually, exactly. You know, people think that, like, we are chronic overshares, which we are, but there are actually some things that we have not had a podcast episode on.

Bryan Cantrill: 04:45

And the although there'll be there'll be one fewer such things after Yeah.

Adam Leventhal: 04:49

That's right.

Bryan Cantrill: 04:51

And I'm I'm really glad that John and Rayn are here, because I think part of what makes this so wild is that the manifestations are are the the the kind of the the symptoms of this are so far removed from the underlying root cause, and it just we just never would have imagined. It was that I mean, it and there were a couple times along the way where I think that we had pretty strong conventional wisdom where where the bug was emanating, and, we were all wrong. So, John, do would you mind kicking us off? Because I think you're the first person to run across, the ramifications of this. Is that right?

John Gallagher: 05:36

Yeah. Yeah. That's right. I I was gonna ask you if context is okay, but it sounds like context is exactly what you want.

Adam Leventhal: 05:42

Yeah. Exactly. It's it's strongly desired. Weird.

John Gallagher: 05:44

Okay. Yeah. So I I have I took notes because there's a lot there's a lot of stuff even before we get into the actual problem. Okay. So have we talked about the MUPDATE sled recovery process on the podcast?

Bryan Cantrill: 05:54

I was gonna say I don't think we real if we have it bear it merits review. I don't think we have. I I so I would actually like you to go into, like, kinda what this thing is doing actually. What the actual problem that we're solving here.

John Gallagher: 06:06

Okay. Alright. So we have a rack, and let's say that we have a sled in it that we need to make sure has the latest version of all software. It is possible that that sled can't boot, like say one of its M. Two drives, the one hosting the OS has died.

John Gallagher: 06:22

It's been replaced with a new one from the factory. Or even suppose the sled, like the software on the sled itself is ancient and it has no hope of possibly talking to anything else in the rack. So we want to assume the sled itself is completely unable to help us and we need to bring it up to speed. So we have a sled recovery process. I'll probably call it an update a bunch because that's what we call it internally for reasons that are probably not relevant here.

John Gallagher: 06:48

But the way that this works is we upload a TUF repository, which is a zip file containing sort of like all the software that we might possibly want in a given release. And it we we upload that to yeah. Go ahead.

Bryan Cantrill: 07:04

Yeah. Just to be clear, just because, you know, the kids do say like the highest praise that the youngs have for anything is tough, t o u g h. So this is tough, t u f. This is I mean, it is TOUGH in the the in the the kind of the gen z sense, but it's also TOUGH, t u f, the update framework. So sorry.

Bryan Cantrill: 07:20

Yes. Give me the interject.

John Gallagher: 07:21

No. That's good. It's it's hard to to pronounce acronyms, I guess. The so we upload we upload a TOUGH repo, t u f, which is a zip file with some It contains some hashes and metadata and signing and all this sort of thing. We upload that to one of the sleds that is functional.

John Gallagher: 07:37

So the sled that we're driving the process from, this goes through a service called Wicket. It has a very cool terminal interface. I know we've talked about that on the podcast before.

Bryan Cantrill: 07:48

Oh, yeah.

John Gallagher: 07:49

Wicket. Wicket. Wicket is great. So Wicket unpacks the zip file and then keeps the contents of it in memory for the duration of this sled recovery process, which is relevant to more of the context later. So the way this works is we've the sled in the rack that that we assume can't boot.

John Gallagher: 08:07

We we do need to be able to talk to its service processor over the management network. That is a pretty safe assumption to make. We can't talk to the service processor or the slit is just completely dead. So we can set a bit in when talking to the service processor, we can set a bit saying when the OS boots, tell it instead of looking on its local m dot two drives for the OS image to ask you for it instead. And you, by extension, you service processor are supposed to fetch it from me over the management network.

John Gallagher: 08:36

And I will serve it to you like a block at a time over UDP, over the management network. And you then turn around and stream it to the host CPU over a UART running it like three megabit. So this process is slow. Right? We don't want to like we did some sort of napkin math at the very beginning for the size of our like real OS images.

John Gallagher: 08:59

We're gonna be sitting there for hours waiting for this thing to be streamed over the UART. So instead, we send over a very stripped down recovery image or trampoline image. If we we start posting links to code, you may see both of those names. That is sort of like the minimal thing required to bring up the networking, like networking stack and then run a program called Installinator. And Installinator's job is to ask the service processor sorry, Brian, did you?

Bryan Cantrill: 09:29

I I I was gonna say that one thing I love about this is you you're talking about this as the recovery path, and I love the fact that we deliberately use the recovery path all the time. Because, know, one of the I think one of the many object lessons of this is not of this bug per se, but of this design is to is a recovery path that is unused will also be untested. So it's a great thing is that we know we can recover when we have a a sled that is a has never had anything on it because we are required to do this recovery path all the time.

John Gallagher: 09:58

That's right. I I neglected to mention this, but this is this is the way that we have we did and are still doing most of the time updates to SLEDs, just like normal updates. We wanna update the software, we'll treat you like you're dead and go through the recovery process. At the time that this bug happened, that was I mean, we had started doing that, but it was pretty early in the life cycle of, like, actually using this process regularly. Yeah.

John Gallagher: 10:22

Alright. So we've so we've we've put this lead into recovery mode. We're streaming in the stripped down OS image that contains the networking stack plus Installinator. Once that thing boots up, it takes something like on the order of fifteen minutes to stream in this very stripped down OS image. Installinator starts up, it then asks the SP also over that same UART that itself was streamed over to ask excuse me, it asks the SP what are the hashes of the real OS image, like the full size one and a hash of like a tarball of all the control plane software, right?

John Gallagher: 10:56

All the services that run that are not part of the operating system. Then it uses some other stuff from our networking folks, Magamite, etcetera, to find peers on the Bootstrap network and ask them, hey, which of you have, like, an OS with hash blah blah blah and a control plane software tarball with hash blah blah blah? In practice, the only one that does is Wicket, the place where we uploaded the TUF repository because it's driving this whole process all along. So Installinator finds Wicket on the Bootstrap network, says please give me OS with hash blah blah blah, and Wicket sends it over. This is where we're like the real network, so this takes seconds, right, regardless of the, you know, the fact that the real OS image is considerably bigger.

John Gallagher: 11:37

It does the same thing for the control plane tarball. It then writes the OS image to the boot partition of both the m dot two drives. It writes the control plane software to the right place on both the m dot two drives. And then it'll tell Wicket, you know, I'm done. I've written everything.

John Gallagher: 11:51

I'm recovered. And Wicket then powers the sled off, reboots it, you know, unsets the recovery bit and tells it to just boot normally. And at that point, we're on our merry way and we've got a recovered sled through this whole update sled recovery path. If Installinator fails at any point along the way, it will tell Wicket that it has failed and Wicket basically stops at that point. And this is sort of important.

John Gallagher: 12:17

There are a bunch of ways that it can exactly. There are bunch of way Like there are a bunch of ephemeral errors that, you know, don't cause a permanent failure. Like if we try a peer on the bootstrap and it doesn't have the data, we just try the next peer, etcetera. Right? But there are a lot of ways that it can fail that Installinator just by design is like, I don't know what I'm supposed to do here.

John Gallagher: 12:34

I'm just gonna give up, report the best error message I can. And a support operator who's already involved, they had to they have to be on the tech port anyway to have uploaded this tough repo. They will have to look at this error and then log into the sled and see what's gone wrong. Okay. So that's background part one.

John Gallagher: 12:51

Is that the sled recovery process in five minutes? Is that

Bryan Cantrill: 12:53

Yeah, that's great. Yeah. Excellent. Okay.

John Gallagher: 12:56

All right. So background part two is that we started to use this process regularly to update all of the sleds in the rack that we have in the office in Emeryville. And we had a I think Sean ran into this. I'll post a link. So Sean was on the, we call it dog food duty that week where he was updating all of the software in the rack.

John Gallagher: 13:20

And what happened was Installinator had reported success and then Wicket rebooted the sled and the sled did not boot. And this this is sort of weird because Wicket claimed the sled recovery process is complete. Installator told me it was successful, but then the sled never came back up. And when Sean got on the console, the the error message there indicated that the OS image was corrupt. So the OS image has a hash in it.

John Gallagher: 13:46

It's split into phase one and phase two. Details aren't super important, but the OS image sort of does a check at the very beginning. It says, did I find like the exact SHA two fifty six match of the image I expect to be here? And if not, I refuse to boot because I don't know what I'm supposed to do in that case. So I saw this and assumed naturally, I think, that I had screwed something up in terms of writing to the disc, right?

John Gallagher: 14:08

So like the very first time we developed this feature, I haven't even opened the pull request yet. And I run it for from a test the very first time. And I like, Installinator finishes, Wicket powers it off and it comes back on and like, there's nothing there. Like the OS image didn't finish writing. There are no control plane zones at all.

John Gallagher: 14:27

I'm like, oh yeah, I'm an idiot. I like, I did all the rights and then I immediately powered the sled off without like f syncing the rights. Right? So like all this was like still in development. So I

Bryan Cantrill: 14:37

Which you should not have to do, by the way. I mean, I think it speaks highly to you that you'd highly of you that you think that your first thought is the problem is in my code. Adam, I don't know about your first thought. My first thought is like, oh god. The problem is in either we have either data corruption.

Bryan Cantrill: 14:55

Somehow we've got data corruption in ZFS, which is bad. ZFS is not the business of data corruption. And the only time I have seen data corruption in ZFS has been when it has been handed corrupt data, or we've got something much more pathological that's happening in the hardware or system software. So I John, my mind is going to, like, much, much darker places. But in the same vein of, like, this somehow there there was corruption that happened between writing this data to disk and writing the state of the stable storage at disk, but writing the state of the stable storage and then arriving on stable storage.

John Gallagher: 15:31

That's right. So I I went back and double checked. And when I was developing the feature, I'm like, I'm I'm not a CFS expert. Let me go ask someone who is. So I'd asked Robert, like, what should I do if I'm writing these images?

John Gallagher: 15:42

And then I'm gonna immediately, like the very next thing that happens is literally like the slug gets powered off. And Robert gave me a list of things to do, like f syncing the drive, like flushing the disk cache via an iOctl, which I never would have thought to look for on my own. This is why we have people around like Robert, I can ask, which is awesome. But I went back and double checked and like, all of that is there. Like I did all the stuff Robert said to do.

John Gallagher: 16:04

So I did the thing that Dave has said this to me probably a half dozen times in the last two years is like, if we a problem and we don't have enough information to fix it, what other information do we wish we had? So like go add extra logging or details or whatever. So I opened a pull request and I was like, okay, Installinator gets this package of data from Wicket and then it writes it to disk. So there's like, there's the first time that we get it from appear and there's sort of like the last time that we've sort of written it and we're gonna claim success. I'll put checks at both ends of that, right?

John Gallagher: 16:39

So after we fetch data from Wicket, I'll just double check the SHA-two 56 hash, right? Like we ask for it by hash, so we're supposed to get a payload that exactly matches that hash. That's what Wicked is supposed to give us. So I put a check-in at the very beginning, after we download the thing. And then at the end, after we write it to disc, I just go back and reread the disc and compute the hash again and make sure it matches the thing that we already checked that we pulled from Wicket, which, again, like we asked, this whole process is, like, based on on SHA two fifty six, hashes matching.

Bryan Cantrill: 17:11

Yeah. So With assumption that we will catch this corruption, we will we we wrote something that we thought was good, then we read it back. I mean, my assumption is when we read it back, it's gonna be corrupt. But I I do think just to put a really, I I think, fine point on this because this is a one of the most important lessons from this whole adventure starts kinda here because we did not have enough information to debug the problem. We had a corrupt file system, which is very troubling.

Bryan Cantrill: 17:40

We we had corrupt data. And too often, you kind of one can create for oneself a false dichotomy of I'm going to root cause it or I'm going to close it as I I wish this were not a bug. And there is actually a very important essential in this case and in many cases, middle ground, which is what information do I wish I had that I don't have? And how can I take a half measure? It's not satisfying because we're not fixing the problem.

Bryan Cantrill: 18:11

We're modifying the software without fixing the problem. But we are now going to hope that this bifurcates the problem space in some way. That we are now going to take an implicit failure and make it an explicit failure, or we're gonna have an assertion failure trip. We're gonna in some way, we're going to when we see this again, we're going to have more information. I think this is like one of the most important things about this.

Bryan Cantrill: 18:33

Sorry, John. Not to again, not to put too sharp a point on it, but this is really essential.

John Gallagher: 18:37

No, it's great. Like one of the one of the best things to me about working at Oxide is that when something goes wrong, especially if it's nondeterministic, like the sled thing that Sean ran into, right? He had updated like, I don't know, a dozen sleds. Two of them failed with this error. And when he re updated them, everything was fine.

John Gallagher: 18:55

And it's sort of tempting to just let those things go, right? Like, oh, if it's like a problem, but you just like try it again and everything's fine, just let it go. Like, I've certainly had coworkers in the past where like, that's enough to move on. Right?

Bryan Cantrill: 19:08

I wait. We can just I don't know what you're talking about that other thing that happened. Like, let's just unsee that.

John Gallagher: 19:13

Exactly. Like, it's it's just we hit some weird edge case. I'm sure it's fine. Right? But no one at Oxide has this attitude, which is fantastic.

John Gallagher: 19:20

Right? It's like, no, no, no, that should never happen.

Bryan Cantrill: 19:22

If it

John Gallagher: 19:22

did, we need to figure out why that happened. Right? Yeah. Yeah. Okay.

John Gallagher: 19:27

So that led to this PR. I'll just link to the PR. And in review, Rain said so in the initial version, just put the checks in. And in the in PR review, Rain said, hey, should we retry if we if we if we, like, try to fetch the thing from Wicket and fail, should we retry? And I was like I said no.

John Gallagher: 19:46

And I put a comment, a block comment in the code. The PR is very short. I basically put my reasoning for why we should not retry. And that is that Wicket has unpacked this ToughRepository that has its own like cryptographic checks on the contents. It's now holding all of those contents in memory.

John Gallagher: 20:06

We have been told the exact SHA-two 56 of a thing we're supposed to fetch from it. And we ask for it by that SHA-two 56. We're like, hey, Wicked, give me the OS image with this SHA-two56. And it streams it to us over the Bootstrap network, right? At this point, we're not going over the management network and the UART, right?

John Gallagher: 20:23

This is just regular old TCP between like two closely connected computers going from its memory to my memory, and we've addressed the thing by hash. Like any way that this could go wrong is like super disturbing. Right? So I put this check-in basically because I was afraid that Wicket was like serving the wrong artifact, right? Like our Tough repository has multiple OS images in it.

John Gallagher: 20:46

Maybe somehow, even though we're indexing by hash, maybe it just screwed up and unpacked the wrong thing and put it in the wrong place, whatever, right? Like that's my expectation is that if this one fails, it's because there's actually a bug in Wicket. It's giving the sled the wrong data in some nondeterministic case or something.

Bryan Cantrill: 21:02

Right. Also, it's great that you're thinking about like the failure modes in your own software. I mean, my mind is just running wild. I mean, the concern I mean, we built our own computer. Right?

Bryan Cantrill: 21:13

We we trained these dims. Like, this is

Adam Leventhal: 21:16

feel like anything could be. Like, spin the wheel. Right? Could be anything.

Bryan Cantrill: 21:19

Spin the wheel. And in particular, like, you know, it is ECC memory, and we don't like, we hadn't had problems with memory corruption yet. But it's like begin that is, like, where you get, like, really, really terrifying, where we have something that is deeply wrong in the in the hardware or the lowest level of the systems. We're just like anyway, I'm so glad you did this because, like, I let's go I I'd I'm pretty much braced though for there is something that is that this this is something in the IO path that is corrupting this on the way down. It's kind of my thinking.

Adam Leventhal: 21:52

And, John, this is all TCP. Right? This isn't UDP?

John Gallagher: 21:55

Yes. So UDP was used to so the management network runs exclusively on UDP. So, like, streaming this recovery Installinator image over goes over UDP and then over UART. But at this point, Installinator is running on the host. It's brought up the real networking stack and it has found Wicket and is just talking TCP to Wicket to download this stuff.

Rain Paharia: 22:15

HTTP, TCP, like all

John Gallagher: 22:17

of Yeah, that's right. Yeah. So I put these checks in, I put the one in at the front and the one at the end, I'm expecting the one at the end to fail, right? We have written sort of like Brian, you said earlier, I'm expecting we've got some kind of like data, you know, file system corruption. We wrote a thing, something got screwed up along the way, and then we read it back and the data's missing or has corruption or whatever.

John Gallagher: 22:39

So I merged this and then I, like my part in the story basically ends because I literally went on vacation for the very next week. So, I will I will hand it over to someone who was around that next week, I think.

Bryan Cantrill: 22:50

Are you your part of the story takes a hiatus, John. It does not end. I I the I having don't know how much of the of the the footage, the security footage A lot us debugging you ended up watching. But, John, you were very active in in what was actually we came to know as data corruption week, I'd like to say. So we saved the good bits for for when you returned from vacation.

Bryan Cantrill: 23:13

But so so you disappear on vacation. And yeah. Then I guess, Rain, I I guess you're up next. Right? Because now John has disappeared, and we this thing that we assumed was happening very rarely, as it turns out, not happening as rarely as we thought.

Rain Paharia: 23:31

Yeah. If I don't remember the exact statistics, but it would like happen to at least I remembered like two or three sleds every time you did like a recovery of like, you know, the full rack. Like you would encounter two or three sleds which had something, you know, some kind of data corruption going on. And like, you know, the nice thing about John added was that, you know, we'd kind of get like an error message in Wicket. So we would get a message a bit earlier, right?

Rain Paharia: 24:00

So, you know what, John didn't fix it, right? But John did add like, so that, you you get an error message inside Wicket while you know, while Installinator is running as opposed to getting an error message all the way at the end. So I think at this point, think Brian, did you pull me in? I think somebody was like, Hey, know, this is like, Installator isn't downloading the right data. And I think I, at that point, kind of, you know, I was like, okay, I think I should, like, take a look at this.

Rain Paharia: 24:36

Bryan Cantrill: 24:36

remember So so which itself was a major surprise. That I think that again, I think that the the assumption that more or less we had all had, I think, is that the that that that there was something happening between Installinator getting this data and actually getting it to stable storage. But Installinator's like, when it's getting the data, it's wrong. And we're like,

Rain Paharia: 25:01

Yeah. Okay. Yeah. So, right. So John added two separate checks for hashes.

Rain Paharia: 25:05

It was actually the first check that failed, not the second one. And yeah, I mean, was, I think, just like, you know, incredibly surprising. And I think what I remember is that I think there there were definitely a lot of calls, but, like, I think, you know, one of the first things is, like, could this be something inside Tokyo or something inside installing?

Bryan Cantrill: 25:33

Because I think if you just say that, like, hey. There's corruption happening.

Rain Paharia: 25:37

Right.

Bryan Cantrill: 25:38

Like, when you've received this, it's corrupt. Our and and we kind of our thought is like, alright. Well, when do we we we know that, like, the image this image is a good image because other nodes are able to get this. So it's like Right. There's gotta be something in the and I think this thought now, at least my thought I don't know why my thought goes, like, super dark on the CFS path and and less dark on the networking path.

Bryan Cantrill: 26:00

But my thought is like, okay. There's gotta be something in user land, in Togus. You're saying rain. There's gotta be something in there that is somehow corrupting data, which definitely seems strange. But I don't know.

Rain Paharia: 26:11

I think, you know, one of kind of one of the things we decided to do pretty early is that we decided to add some instrumentation to Tokyo to try and rule out Tokyo as much as possible. Brian, did you wanna talk about, like, some of the d trace probes and stuff that we added?

Bryan Cantrill: 26:27

Yeah. Did we add that as part of this? I got I kind of and this is where it gets a bit murky for me too. Because, Rain, at some point

Rain Paharia: 26:36

Yeah.

Bryan Cantrill: 26:36

You get this to be you're able to separate Installinator from this kind of this this kind of difficult environment that it's running in Right. Or a little one that's a little harder to bug, and you're able to get this just kind of reproducible on a Gimlet. I think because I think that that and that was a major lurch towards towards getting this thing understandable.

Rain Paharia: 26:57

Yeah. Yeah. So so, you know, like, Installinator is this automated process, right? Like it's part of this bigger process and it just kind of is this autonomous thing that runs. So I think, you know, one of the first things we did was that we can actually run Installinator as like an independent binary.

Rain Paharia: 27:16

And so, one of the first things we did was, okay, all right, we have a server running WCAD on another host. So, we're getting something like the production use case that is failing. But then over here, we're logged in over SSH and we run Installator by hand over SSH doing the same things that the autonomous Installator would be doing. And we get to see these errors happen even with that. And like, you know, you run-in a loop, like sometimes it works and sometimes it fails.

Rain Paharia: 27:55

It was failing like I think 10 to 15% of the time or so, which is Right. Know, is Which

Bryan Cantrill: 28:03

is a very, I mean, a very tough number because it on the one hand, it means you 100% do not have a product at at and yet it's like but it is actually working most of the time. Like, whatever is happening is nondeterministic. And then, Rain, sometime in here and I I don't think I'm getting too ahead of things, but stop me if I am. We're like, okay. You well, you're the we need to actually we know that there is no, I think, run of zeros in the actual tough binary that we're getting.

Bryan Cantrill: 28:42

So we begin to modify the code to be like and and we know that, like, one of the the the we determined that, like, when we were seeing corruption, we're seeing the wrong thing. One of the manifestations of this, which is not like not all of them, but one of the manifestations is we were expecting to see data and we got zeros.

Rain Paharia: 29:01

Yeah. And and I think the way we figured that out is that we told Insolivator to write So rather than to write it out to M. Two as an image, we write it out to a file on disk, right? And then we SCP the file over and then we did like a hex div of the contents of what we actually want to see and what we don't, and what we were actually seeing. And like in all the cases we saw that there was like some weird run of zeros inside the image that we installed it was actually fetching.

Rain Paharia: 29:38

And, you know, we knew that the actual artifact because it's a GZTAR ball did not have any such round of zeros. So, you know, one of the things I did was that I think I went in and tweaked Tokyo source code because, you know, at this point it's like, okay, you know, is the bug in Tokyo? Is the bug in the solenator? Is the bug in like Rust async somewhere? Is the bug in like, you know, the network stack?

Rain Paharia: 30:00

Where is it? And so one of the things I ended up doing as part of this was that I went into Tokyo source code in like the net library and I wrote a little loop, was like, you know, like if I think it was more than 32 bytes and like if the whole like thing is like 32 zeros or more, then just panic, right? And, you know, like seeing that and like actually being able to replace that was I think a major step because what that did was that because this was deep inside Toker's internal, so this was data that we had just fetched off the wire, right? Right. Doesn't have any Tokyo machinery.

Rain Paharia: 30:45

Like we've stripped it all down to literally like receive some data into a buffer and read that buffer and see if that buffer has a lot of zeros. And if it has a lot of zeros, then panic.

Bryan Cantrill: 30:58

Then panic. Part of the reason that you were pulling this closer and closer to pulling off the wire is because the software that we were running through to go from between the wire and actually Installinator, like, actually doing something with it, you I know because I I know you had written some or modified some of the software in there. Because I just remember at some point you being like, I like, how like, I I did I I did property based testing for this. Like, I don't see how this could be broken. Like, you were genuinely, like, frustrated at the gods of, like, I accept that this is a bug in my code.

Bryan Cantrill: 31:32

And I, like, I and I just know that feeling where, like, I'm beginning to break down because, like, I I think this code is correct, but it's obviously not correct. And so you're just, like, chipping away at it to be like, okay. It's gotta be in my code. Gotta be in this code. And then it gets to point where it's like, okay.

Bryan Cantrill: 31:46

Wait a minute. Now I'm like, before my codes my code hasn't executed yet. Yeah. And we're getting a run of zeros. And we

John Gallagher: 31:54

I I should add something here. I think this this was this was sort of important that that we put the checks on so earlier I said we pull two artifacts over the network. We pull in the OS image and we pull in this gzip tarball of all the control plane artifacts. We were seeing the control plane artifact fail at a much higher frequency than the OS image fail, even though it's not even it's not different code paths at all. It's just like the same code path with two different hashes.

John Gallagher: 32:20

But once we figured out the zero thing, the host OS image is not compressed at the point at which we're downloading it, and it actually does have long runs of zeros. So if we were saying corruption that zeroed out buffers as they were coming in, like a lot of the time as we're pulling the OS image, that's actually fine because the real data was also all zeros. But because we were because we were also checking this gzip tarball control plane artifacts, that one failed much more often.

Bryan Cantrill: 32:49

Yeah. Right. Right. Well, so we get to the the point where, Rain, you write code that says, if we've got a run of zeros here, we're gonna panic. Yep.

Bryan Cantrill: 33:04

Right. And then at some point, we decide to also, remember, like, one of the moments when you really begin to feel the fabric of reality start to tear is you wrote code saying, if we've got a run of zeros here, we're gonna panic. And then we're gonna eprintlin like the actual buffer. Yep. And then panic.

Bryan Cantrill: 33:29

Yep. And what we were seeing was that we would hit that code path and just say we detected a run of zeros. But then the eprintlin would print the correct non corrupted data. And Yeah.

Adam Leventhal: 33:45

It was super crazy. This is about when I The the the the this

Bryan Cantrill: 33:48

when when we yeah. Yeah. This is when we begin to, like, slip. Just like slip. And you begin to realize, like, oh oh god.

Adam Leventhal: 34:00

And and I know you just said it, but to reiterate, the the code was literally, like, if zero, print value zero.

Bryan Cantrill: 34:07

Print the value.

Adam Leventhal: 34:08

Print the value, which like, you're like, hey, dummy. You just said it was zero. Like, why are you printing it? It's gotta be zero and it's not zero.

Bryan Cantrill: 34:16

Actually the it's good data now. Now it's good. Now it's been uncorrupted. It is it was corrupted. Now I have decorrupted it back to its original state.

Adam Leventhal: 34:26

And I just remember single stepping through that, and we're like, yep. We see it. We see it. We see it. Not a bunch of zeros.

Adam Leventhal: 34:31

It's not wait a minute. Why is it not a bunch of zeros? Like, we just checked that it was a bunch of zeros, and it was a second ago. It was just, you're right, tearing at the the fabric of reality.

Bryan Cantrill: 34:41

Fabric of reality. Okay. So the and the and actually and maybe it was before we got the eBrief. Because then the other thing that was very odd is like, okay. So we've got, like, why are we we x ing this conditional incorrectly.

Bryan Cantrill: 34:54

We're using DTrace a lot heavily, obviously. And in particular and also I would say that, you know, in in our strangest reference to a previous episode, Adam, We in all of this, we record an episode of Oxide and Friends on manufacturing the first rack.

Adam Leventhal: 35:11

That's right. That's right. Right. Like like, you have to leave the call and like

Bryan Cantrill: 35:16

We we gotta like we gotta take a quick break from data corruption week to go talk about manufacturing the rack and pretend like we still have a product, because by the way, I'm pretty sure we don't right now because all it does is corrupt data some fraction of the time. So let's just put on our bravest possible face. Let's just, you know, splash some cold water over those tears and, let's do it, Steve. Let's let's record the podcast everybody. So, I don't know.

Bryan Cantrill: 35:42

Have to go re listen to that podcast knowing that I was a part of me was like actually terrified that we had no company. They so that ends. Great podcast. Thank you for all the operational operations folks that, Eric Anderson, much of other folks that that joined us for that. And then we get back on to continue to debug.

Bryan Cantrill: 35:59

And we just we just go further and further down the rabbit hole of just, like, nothing. Everything is weird. And, Adam, did you really listen to any of that, by the way? Because we get to the point where we're using Dtrace all of the time to try to understand this. We are so one of the things that we're doing is like, okay.

Bryan Cantrill: 36:22

So there is some like, are we seeing like, we're we're snooping the wrong line. We got cache corruption. We've we've got if there's something, some CPU structure is somehow bit is incorrect. So what we would do is we would, stop the process after it hit that conditional. And then we would, we would stop it for some period of time and resume it.

Bryan Cantrill: 36:47

And it would then the ePrintLAN would print the correct data. And you're like, what? And then you would go in and you would inspect it. If you would if you inspected the if you took a dump at that point when you hit that, you know, if it's zero, if you actually took an actual if the system aborted and panicked, you would see one thing. If you went in with trace mem, you would see something else.

Bryan Cantrill: 37:11

So it's like the we we begin so trace mem is a is a a a detrace action that traces memory, and we begin to think that actually trace mem itself is wrong. Yeah. And the we like, maybe that's for the problem. And then we have actually like two bugs here. We've got like a data corruption bug, and then we've got like a trace mem bug where it's showing that which it just nothing made sense in there.

Bryan Cantrill: 37:38

It just absolutely nothing made sense. And we are trying we're finding, like, okay. So what what what if we actually, like, when we hit this when we hit the corruption because the thing is, like, once you started hitting the corruption, you would continue to hit the corruption, which is very odd. If you didn't hit the corruption, you wouldn't hit the corruption. Once you started seeing it, you would continue to see it.

Bryan Cantrill: 38:00

And the the so that was kind of a a a strange artifact. We did the experiment of like, okay. When you hit when you hit this conditional, I detected the corruption. And the so instead of panicking, we continue to operate. Let's stop the process.

Bryan Cantrill: 38:18

Let's move it onto a a CPU that it's never been on before. So we're actually gonna create a processor set with a couple of cores over here. They're never gonna run this process before. This process has started seeing the corruption, and it started kind of flickering with the with the with this data is, like, flickering between corrupt and and not corrupt. We'll move the process over to the CPUs that is never executed on.

Bryan Cantrill: 38:42

And I'm like, okay, this is gonna be then the flickering will go away because there's some there there is some some state, some machine state that is incorrect. Yeah. And then but you would move it to Steve using it seen, and would continue to flicker. And you're like, what's going on? What's my

Adam Leventhal: 38:59

And and if and if some of this feels like what was the hypothesis there? I think we have departed from, like, scientific method to sort of what if we randomly perturb the system and then see what happens.

Bryan Cantrill: 39:14

Okay. Because Actually, we are the the this is true. And so I think that we the the the the methodology I mean, we're definitely bewildered. Yes. I think we are and this so the an an adage of mine is that bugs can be psychotic or nonreproducible, but not both.

Bryan Cantrill: 39:33

And when I say psychotic, I don't mean just like difficult. I mean, like, when I say psychotic, I mean ripping at at the the fabric of reality. The fabric of reality that is created by the computer, by the operating system, it creates these abstractions that we view as kind of as bedrock abstractions. And when those start when when those start to break, that's a psychotic bug. You know, when you are when you have a thread that is executing on two CPUs at the same time, it's like, it can't be.

Adam Leventhal: 40:06

It's like Right.

Bryan Cantrill: 40:07

Well, no. It it I know no. It definitely shouldn't. That's right. But it's like like, you know, these and the abstraction is designed to prevent that.

Bryan Cantrill: 40:16

And if that does happen, it you begin to have you and Adam, we we used to call these can't happen panics. Right. You have a you have state in the program that, like, this state can't happen. Like, the the the program is it is not the the program's a victim here. And the the perpetrator is much deeper in the stack of reality.

Bryan Cantrill: 40:38

And those bugs don't have like, those bugs have to be reproducible. If those bugs are nonreproducible, you won't debug them. So you when you have one of these things that you just see once or is nonreproducible, the actual act needs to be just like, John, your first act of, you know, of of just, like, checking this stuff at different phases, see if we can get this to fail in a different way, see if we can get to be more reproducible. Oh, good news. It's happening all the time.

Bryan Cantrill: 41:05

Oh, okay. Great. Then, you know, being able to so we're trying to, like, tighten the cordon around this thing. Yeah. But as we're trying to tighten the the cordon, it's like, it keeps, like, spilling out and, like, showing up another it's just like, the fuck is going on?

Bryan Cantrill: 41:21

And at some point in this, Adam, you and Dave are like and this is I mean, know you you you I couldn't find the exact line. I know it's said there's, like, whatever, you know, ten hours of video of us Yeah.

Adam Leventhal: 41:32

I just found it. Yeah.

Bryan Cantrill: 41:33

Oh, did you find it? Yeah. Yeah.

Adam Leventhal: 41:35

Yeah. I think I think, you know, Dave and I I I I think we had, as as it's becoming clear, many hypotheses, like none of them satisfying in in any way. And I think this is this is before we stumbled onto the phraseology of, like, haunted virtual memory or anything like that. But Dave and I were wondering if this was a virtual memory bug of some description. And, like and and to be clear, the thought was exactly as deep as that, which should say not particularly deep at all.

Bryan Cantrill: 42:07

Not particularly deep. Yes. Not particularly deep. And you were like, we think it's a virtual memory bug. And I'm like

Adam Leventhal: 42:13

You're like How? You're like, what? Okay. You can create a bug with your mind. Like, describe the bug.

Adam Leventhal: 42:20

And and I mean, which is totally reasonable if a bit acerbic. You know? But but, like, the idea of being like, what? Okay. Like, let's say there's a VM bug.

Adam Leventhal: 42:30

Like, just describe anything about this VM bug. And, yeah, we were we were not we did not have a much more satisfying answer than that.

John Gallagher: 42:39

This sort of gets to what you said earlier about, like, you gave me credit for assuming there was a problem in my code. Right? Like, as soon as we got to the point where rain injects the, like, if it's all zeros printed and it's not zeros anymore, that killed, like, every reasonable hypothesis that we had from that point moving forward. Right? Somebody would throw out a hypothesis and I'm like, that that can't be it because that that can't possibly explain what we're seeing there.

Bryan Cantrill: 43:02

That's right. That's right. And I think and and now, I mean, this is the other important thing is that when you begin you know, you're kind of trying to get this cordon around this problem and you're beginning to get a lot of the symptoms and the symptoms are very, very bizarre. And now anything has to ex mean, you need to be able to explain all this stuff. And, you know, I think that one of the pathologies, I don't think we see it oxide, but I think one sees in the broader world, is where people find a bug that's not the bug.

Bryan Cantrill: 43:27

And, you know, in in this case, like, it's not even if you were to find a other data corruption in the program, you'd be like, well, okay. That's great. Like, that's a bug. But that does not that can't be this bug because this bug has got these very, very odd symptoms of this this kind of flickering on and off of going oscillating between our actual correct data. This is what is so bizarre about this thing, which was unlike honestly anything I'd ever seen, where it's like you're seeing the correct data, and then it's like, nope.

Bryan Cantrill: 43:59

Now you're seeing like just wildly wrong data, like not even close to correct data. And then, oh, how there's your correct data again. It's back. It's like Yeah. What the fuck?

Bryan Cantrill: 44:06

And so you're like, you know, it's a VM bug. It's like, oh, the okay. Walk me through it. Like, the I mean, because, like, the the the I mean, it the the VM system is not a magician. It's not, like, responsible for it's not the memory system.

Bryan Cantrill: 44:21

It's it's the VM system. Oh, what it does is, like, it establishes mappings. It loads those into hardware translation, and then it gets out of the way. It isn't like, oh, let me just go, like, create a thread that, like, fucks everyone's page tables. Let me go do that.

Bryan Cantrill: 44:33

It's like and actually, to make it really funny, I'll oscillate between, like, memory I make up and their actual true memory. It's like, it doesn't do that. It does I mean, I guess you could envision a system that did, but, that ends I I think the closest I got to coherent close yeah. I think the closest

Adam Leventhal: 44:47

I got to coherent on that was like, what if there's some sort of page aliasing? Right? Like, the this, you know, physical page is like mapped in two different locations, and you're sort of fighting over it. But of course, that does not explain why why like, it explains corruption. Doesn't explain why, like, you would pull a rabbit out the hat and suddenly it would be the right data again.

Adam Leventhal: 45:07

So That's right. You know? Well

Bryan Cantrill: 45:10

and I think this so then this is where you do get to the thing. So, like, there are and we don't you don't have these issues on on our on and on this particular microprocessor. But there are if you have a virtually the a virtually index physically tagged cache, you can end up with aliases in that cache, right, where you can have and you can have the same you can and there are things that you need to go do to make sure that you don't actually have you can get data corruption if you don't manage those things properly. But the problem is, like, one, like, a vac alias does not symptomatically resemble us at all. Two, when we move that process to a different CPU and it continued to flicker and if you take one that's, like, not flickering and you move that to a new CPU, it continues to not flicker.

Bryan Cantrill: 45:56

It's like, what? I I no. I just like again, this just makes nothing makes sense.

Adam Leventhal: 46:05

So we were starting to think we had it reproducible on, you know, pretty pretty readily. Right? Like, we we I think at this point Which itself is solace, actually.

Bryan Cantrill: 46:14

Yeah. Absolutely. Not not just solace from because now, like, my like, it's not just solace from the perspective of, okay, this is, we we do have a cordon around this thing, like, we're be but it's also, like, the another thing that is so strange about this is this is like so reproducible over here. If we had something again like my fear of like, you've got a dim issue where you've got like corruption in like the memory system of the the where we have, like, an SI issue. Right?

Bryan Cantrill: 46:43

It's like, it would not be this reproducible. Right. Or or if we had it's like, what's going why is this and the fact that it was, like, so strangely reproducible was flawless to me that we were not and we're not seeing, like, cord ops all over the place. Right? Which is what we would see, you know, if you really screw up.

Bryan Cantrill: 47:00

If you have truly truly rampant data corruption, like, you will see this is when LS dumps core. You know? Right. And these are these are not good days. Exactly.

Adam Leventhal: 47:12

So so so we had reproducible. We but we had a we still had lots of theories. Like, we thought k p t so KPTI was potentially involved. We thought may as you said, maybe trace mem was lying to us, although we couldn't really get our heads around that one particularly.

Bryan Cantrill: 47:30

Well, and then you know, was like, okay, trace mem's lying to us. Like, I'm gonna I one point, you're like, I'm gonna go look at the code for trace mem.

Adam Leventhal: 47:36

And I'm what All six lines of it are whatever.

Bryan Cantrill: 47:39

I'm like, I don't know what you think you're gonna find. But like, Trey oh, yeah. Because I mean, it's it's tracing memory. It's really not like you're gonna find what you it's like, you're like, this is like six lines. I don't see how this like, Like, oh, yeah.

Bryan Cantrill: 47:52

Sorry. Did you miss the flickering code I put in the trace mem where I give you the wrong answers some fraction of the time where I roll the die and give the wrong answer? Yeah. Sorry.

Adam Leventhal: 48:00

Yeah. Was that a mistake? Should I not have done that?

Bryan Cantrill: 48:02

Exactly. And

Adam Leventhal: 48:05

and we also we still had not let go of the fact that or or the thought that it might just be I think we described it as sort of like a pedestrian user land bug. Like, basically, we've got multiple threads in this process that are somehow conspiring

Bryan Cantrill: 48:19

to fucking sell. I am gone. At the by the time we are flickering, I'm like, this is not like, this process is a 100% victim. I don't understand how yet. I just I just watched

Adam Leventhal: 48:30

you and Keith having a chat about that. We're we're we're, like, wrapping up for the day, and we're like, yeah. That's that's a real possibility. Maybe you're just maybe you're just

Rain Paharia: 48:37

I think one of the things we did see looking at some of this history was that, I think if we had pinned it to one specific CPU, we wouldn't be able to reproduce it. But then if you pin it to like, say eight out of the CPUs, you would be able to reproduce it. So I think that was that was like, you know, one of the concerns about that. But but, I mean, I don't I personally like yeah. I I I think I don't see how you get the flickering from there.

Rain Paharia: 49:05

Yeah. Like We we also

John Gallagher: 49:07

like, the this this thing that would pretty reliably reproduce did not reproduce on any other system. Like, we we ran it under Linux. We ran it under Mac OS and never saw it. Right?

Adam Leventhal: 49:16

Particular, not on a Lumos on stock here. This was just Yes. On the on the homegrown stuff.

John Gallagher: 49:22

That's right.

Adam Leventhal: 49:22

Which made it even more terrifying, of course.

Bryan Cantrill: 49:25

Yeah. Exactly. Which is yeah. I'm glad you find that relaxing, John. That definitely caused me shortness of breath.

Bryan Cantrill: 49:29

The the but, no, it it did make it harder. And I think, Adam, you're right that in that we I'm like, if you suppose that we have a there's a bug in trace mem and there's a bug in the program. So the the the flicker is somehow being caused by trace mem and the truth but then you're like, but why is the the program is actually detecting the corruption? Also, we're here because the program detected corruption.

Adam Leventhal: 49:53

That's right. And we just did a if zero print value, and it's not zeros. Right.

Bryan Cantrill: 49:57

It's not zeros.

Adam Leventhal: 49:59

Anyway, I I think it's safe to say we did not have a a solid hypothesis.

Bryan Cantrill: 50:03

Though and then I also love the fact that you could stop the process for like a minute and walk away, then resume it, and it would start flickering again. Yeah. So you could stop the process for a minute, just move it on to a CPU it's ever been on, and it would flicker. And you're like, what is going on? Alright.

Bryan Cantrill: 50:18

So that is like but again, I think we we end that night being like, we're gonna get this thing. And whatever it is, it it this is super weird, but it is, it it the the fact that it's so isolated is actually and also, you know, the other thing that is actually that I took solace in, the operating system is not dying. Because the operating system itself is a stress test of many things. The operating system itself really cannot tolerate, like, having its memory corrupted.

Adam Leventhal: 50:44

Yeah. It's pretty sensitive to that really.

Bryan Cantrill: 50:46

Do you remember? One of our many dumb lunch conversations at Fishworks, we're like, if we start corrupting memory, how much memory can we just go corrupt before the operating system panics? And then and then we like we get started corrupting memory and of course, we didn't really have the patience to do this. And so we just like started corrupting memory randomly, nothing panics. We get a little bit bored.

Bryan Cantrill: 51:11

Like, okay, let's do something different. Let's just actually like let's change mutex enter to like not.

Adam Leventhal: 51:16

I do remember that. Yes.

Bryan Cantrill: 51:18

And then let's see how long it then I and then like you get like you write the knob and like the system immediately blows up on the interrupt. It really does it it does not it

Adam Leventhal: 51:27

does not last. Mutual exclusion, like the the operating system takes pretty seriously.

Bryan Cantrill: 51:31

Pretty okay. Mister Eddent about mutual exclusion. Yeah. Exactly. Yeah.

Bryan Cantrill: 51:37

Exactly. But, so the fact that the operating system is not dying is also like this is a big program. It runs all the time. It very much relies on the the correctness of the hardware at the least. But goddamn, this is weird, whatever this is.

Bryan Cantrill: 51:52

So then we get to the next day. And, Adam, this is when and I I I'm I'm sure you went to to rewatch the Yeah. Us getting back together that next morning.

Adam Leventhal: 52:02

Yeah. I and we did and and not initially, but like, I think everyone everyone we had pulled in more folks at this point. We were trying lots of different theories on lots of different systems.

Bryan Cantrill: 52:12

John is definitely back from vacation because, John, I I I was Yeah. You were I don't know if you yeah. Well, John's like

John Gallagher: 52:20

Look. I got back from vacation, and Brian is like, so while you were gone, we saw this comment that you wrote. Right? Which is like this the one that I put in there, why we should never retry this because the only possible explanation is something horrible has gone wrong. And Brian basically said, hey, by the way, something horrible is going on.

John Gallagher: 52:36

Right?

Bryan Cantrill: 52:37

Oh my gosh. I forgot that. Yeah. Yeah. Well, welcome back from vacation.

Bryan Cantrill: 52:42

It's all broken. Unfortunately, you thought we but the so we're yeah. We're all kinda in there.

Adam Leventhal: 52:47

Yeah. And I I feel like, you know, there's this this aphorism that, like, all systems problems are either kernel problems or compiler problems. And I feel like in the back of my mind, I was trying to turn this into a spreadsheet problem. So I was, like, just doing sort of random analysis on the virtual addresses. And the weird thing that I kicked over was, like, all of the virtual addresses of the buffers that we were that were corrupted were really highly clustered.

Adam Leventhal: 53:15

In fact, we had this a heap that was almost, like, I think, like, two gigabytes in size, and all of the addresses came from a range that was, like, 600 k or something in size. So it's just a teeny fraction of the heap. And I had no idea what to make of that. I almost hesitated to bring it up just because it it was such a, like, I mean, clearly meaningful, but also like, yeah. So what?

Adam Leventhal: 53:42

Like, I have no idea what this could possibly mean.

Bryan Cantrill: 53:45

Yeah. Yeah. And and my, my comment that this might be a VM bug was so, like, so poorly received that maybe I'll just keep my comments to myself.

Adam Leventhal: 53:52

My smart ideas to myself. Right?

Bryan Cantrill: 53:55

I was but no. So this I mean and you and I don't I I don't know. And actually, even relistening to it, I honestly can't tell if it's just like false modesty or if you genuinely don't know if it's relevant. But, you know, I think we get I I think you can just live in that kind of ambiguity.

Adam Leventhal: 54:09

That was definitely not false modesty. That was definitely like, I am a little embarrassed that I was just, like, looking at numbers. Like, I felt like I was doing astrology on virtual addresses.

Bryan Cantrill: 54:24

You know, these are all Sagittarius. Have you just used this? There's one that's a Virgo. Never mind.

Adam Leventhal: 54:30

Actually, Scratch that. Oh, what a Sagittarius thing to say.

Bryan Cantrill: 54:33

Yeah. Exactly. That's right. No. But I am like, that is definitely relevant.

Bryan Cantrill: 54:38

I think that is definitely relevant. And oh, okay. That and you look at it and it's like, okay. They're not like they're not so tightly clustered that they're I mean, this is part of what made this is like, it's not like they're in the same cache line by a long shot. They're within the same 100 k range.

Bryan Cantrill: 54:54

Turn turns out it was actually like it would have been a two meg range.

Adam Leventhal: 54:57

Right.

Bryan Cantrill: 54:58

Spoiler. Spoiler alert. But we but but 600 k when you've got a multi gigabyte heap is like, okay. That is that is definitely interesting. And this is like, okay.

Bryan Cantrill: 55:12

This is and we we started calling it the haunted VA region. Yes. And and it's like, we have a haunted VA region. And this is where so there are couple of things that are kind of interesting.

Adam Leventhal: 55:22

And to be clear, VA meaning virtual address.

Bryan Cantrill: 55:24

Virtual address. Yeah. And the thing so one thing that to me was really about that. First of all, the and we've said this in the past, but debugging is something that really scales very well, kind of surprisingly well, actually. Yeah.

Bryan Cantrill: 55:37

And and then it's also, like, really amenable to remote work. You know, the you know, we've talked about it, you know, in our our RTO or GTFO episode in terms of, like, the importance of being remote oxide. But the Us, the fact that we are and we're obviously recording it as well. But, like, the fact that you can have people be in the conversation at zero cost to the conversation. You know?

Bryan Cantrill: 56:02

And there's no whereas like in an office, you're like, okay. Like, now, like, you've 30 people gathering around, like, one person, which inevitably is gonna be like, do you really use Dvorak? You know, or something like that. You know I mean? Like, there's there's gonna be and you're gonna be like, you know, actually, there's a faster way to use that.

Bryan Cantrill: 56:16

And you just like it just and it feels like inevitably there's, like, a level of, like, backseat driving that's just in the monkey brain. Yeah. Well, there's I mean,

Adam Leventhal: 56:22

so much value yeah. I mean, it's just the recording it because then I remember Keith joined in, but late, but then he was able to catch up by, like, watching part of the recording. Yeah. I mean, the the the ergonomics of it, as you're saying, you know, while I'm watching you type into the shared screen, I can be off, you know, piping, you know, my my astrology, you know, results, you know, through through, know, set or whatever.

Bryan Cantrill: 56:44

Yeah. That's right. And I think and then you can also it it paralyzes very well because people can just go off and like, I'm just gonna kinda quietly go off and explore this idea that I've got. And I like, I'm kinda getting all of the data, but I'm not and there's no kind of drag induced from that. And then so once we are at that point of like, oh, wow, this VA region is interesting, then John and Dave both kind of in parallel and talking to one another, but begin to do work on different ways of reproducing this.

Bryan Cantrill: 57:11

I think Dave was actually writing maybe even a c program. John, you were doing this ballooning where you're like, okay. Let's take this thing, and I'm gonna inject a balloon, and I'm gonna force its VA range to be into the so it's in other words, I want I wanna force this thing to allocate out of the when it allocates in the heap, I want to to be able to force it to allocate in a certain address range.

John Gallagher: 57:33

Yeah. I completely forgotten about that. I I had just run across a commit. So we ended up I mean, maybe this is is jumping ahead a little bit. We ended up actually committing a variant of that balloon exercise to Installinator itself as a workaround until we could get this thing really fixed.

Bryan Cantrill: 57:47

Oh, wow.

John Gallagher: 57:48

Do you remember this? So this

Bryan Cantrill: 57:49

I actually forgot. I'd forgotten that. Wow.

John Gallagher: 57:51

So this commit still exists. Basically, we just we allocated a Vec of two gigabytes at the heap at the very top of main as a way of sort of forcing us past the haunted VA region. And we merged this the day and as soon as we realized that's what the problem was, I don't know how long it's going to take to actually fix it, but we can we can just avoid the this region completely by just sticking a vector there that we never touch and just guarantee everything else is above it.

Adam Leventhal: 58:20

I rewatching the recording, I love how quickly we latched onto the term balloon because it's such a fucking stupid idea of, like, hey. Why don't we just create a VAC? And then I think you had to, like, print out the last element of it to, like we were we were sort of worried about the compiler outsmarting us. And, yeah, Balloon kind of sanitizes what is kind of a dumb idea that that we all came to and thought was the best idea we had.

Bryan Cantrill: 58:47

Yeah. And I and, actually, you know, just rewatched that. I'm not sure what did John I think John, you came up the term balloon. But certainly, you know, VMware's got the term balloon

Adam Leventhal: 58:57

Yes.

Bryan Cantrill: 58:57

To force a guest. Yeah. You you they kind of to force a guest to to get rid of memory. So and Adam, I don't know if you if you just rewatched that recording today. When we we got to the funded VA region, you may have you again, Keith is on there.

Bryan Cantrill: 59:13

You may have caught me saying, this looks like Rich, and I stop. I'm like, I won't say the rest of it. And so that that's a reference to Richmond sixteen, which is a bug that we had at Joiant. And this is where the in different but different symptoms, but but kind of rhymes in that there was a PA region that would get stepped on. And as it turns and minutes after boot.

Bryan Cantrill: 59:41

And this is actually Okay.

Rain Paharia: 59:42

Is physical address.

Bryan Cantrill: 59:44

Physical address. And you can you can argue that the the headwaters of Oxide can be found in Richmond 16 because this was the b m we believe it was the BMC that was doing this. So the BMC was just like just stepping on our memory a long time after we had booted. Like, a long time being, like, you know, seconds, minute. It's just like, actually, I'm still using that memory, I think.

Bryan Cantrill: 01:00:09

And it would and it was one of these things where it we were chasing this thing for a long time before we realized, like, oh, it's actually hitting the same region, the same physical region. And we just ultimately ended up just being like, okay, we're just not gonna allocate out of that region. But, it was so I stopped short of because I didn't wanna, you know, that, the Richmond 16 cast a long shadow, cast a long shadow at joint as did OS ten twenty eight, which was another data corruption bug we had, that actually ultimately was data corruption that that could manifest itself in on disk data corruption because we pushed the corrupt data to the disk. So the the we're like, we're beginning to get all these flashbacks. But when we were talking about the oh, we also did I don't know.

Bryan Cantrill: 01:00:51

Again, I don't matter if you if you if you watched all that recording, We did have an interesting, like, so turn into Scooby Doo, a little Scooby Doo discussion there

Adam Leventhal: 01:00:58

about Yes. We did. Yes.

Bryan Cantrill: 01:01:00

And I and I I I feel that this bears reemphasis because I feel that Scooby Doo on the one hand has left the generational chasm. My kids all are like are Scooby Doo scholars, which I appreciate. They definitely I mean, kids watch Scooby Doo, which you can with with you can drop a Scooby Doo reference to basically any living human at this point. I because even the boomers gonna get the because it was on he was on prime Scooby Doo is ultimately like the fact that combines us.

Adam Leventhal: 01:01:29

Third nation Shakespeare.

Bryan Cantrill: 01:01:31

Yes. It is. It is. It It really is. It really is.

Bryan Cantrill: 01:01:34

But with this very important shift where Scooby Doo and I feel it it does represent a shift in the zeitgeist. When Scooby Doo when we were kids, you know, there is there's the ghost at the amusement park. And as it turns out, it's like, actually at the end, it's, you know, it's it's the crooked operator who's trying to scare off the kids or whatever. You know? Always like I feel like real estate scams.

Bryan Cantrill: 01:01:55

Is that what? We're like real estate. That's right. It's like 30%. Don't you think that's right?

Bryan Cantrill: 01:01:58

I feel it's like the but there there was always a a a rational answer. And at some point between then and when my kids started watching the remade Scooby Doos, like, no. The answer is it was a ghost. That's the conclusion. I'm like, what?

Bryan Cantrill: 01:02:18

Where's the end? It's like, no. No. Dad, it was a ghost. I'm like Who

Adam Leventhal: 01:02:21

was the ghost?

Bryan Cantrill: 01:02:22

Acceptable. Who was the ghost?

Adam Leventhal: 01:02:24

Part the post credit scene.

Bryan Cantrill: 01:02:25

Right? Where's the crooked real estate deal? This is obviously and I'm like, no. This is a real problem. This is like a decline in science the scientific education in this country where it's like, no.

Bryan Cantrill: 01:02:36

Now it's just like ghosts happen. And I'm like, you

Adam Leventhal: 01:02:38

know Sometimes the answer is ghost.

Bryan Cantrill: 01:02:40

Straight line to, like, anti vaxxers from this. I gotta tell you. I mean, I just, you know, not to not to turn into one of those podcasts, but anyway, it's a problem. Love it. The I I I'm old school Scooby Doo.

Bryan Cantrill: 01:02:50

We don't the answer is not a ghost. The answer is a crook, and we're gonna find the crook. But in the meantime, we have a haunted VA region. Until until we actually find the crook and unmask them right now, it really does feel like a a lot like a haunted VA region. And then while we're looking at the number, Keith is like, okay.

Bryan Cantrill: 01:03:15

Wait a minute. I actually recognize that that number. They in terms of the actual, like, hex address. And the hex address is where we we map the RAM disk when we boot. Yeah.

Bryan Cantrill: 01:03:29

So we end we end up with a mapping that we use to map the RAM disk that we intend to discard. This ends up being a this is a this is a temporary mapping. And Keith discovers pretty quickly that, oh, there's a bug. We don't correctly discard this mapping. And if we do correctly discard the mapping, then the problem goes away.

Bryan Cantrill: 01:04:00

And on the one hand, you're like, problem solved. And you're like, what? No. No. It's like the ghost the ghost can't be the answer.

Bryan Cantrill: 01:04:09

That's actually that's still having a ghost be the answer.

Adam Leventhal: 01:04:11

Because it's it's not a mapping, like, in this process. Right? Like, it's a it's a mapping in the kernel.

Bryan Cantrill: 01:04:19

It's a mapping in the kernel. It's a mapping in the kernel. And it is it like, that is it's a kernel mapping, and this is a process. Like, things It's a process. Right?

Bryan Cantrill: 01:04:34

Yeah. And it's like and so it's like, oh, well, it's a stowaway mapping that's getting from the kernel to the process. And then it's like and then flickering?

Adam Leventhal: 01:04:43

And then leaving and then coming back.

Bryan Cantrill: 01:04:45

Leaving and then coming back? It's like, that's not making sense. But we do have, like, a a kind of important breakthrough in that, okay, it does. If the mapping is present, this problem happens. And at the and in particular, we could see that if the if the mapping is present and and at this point, by the way, we are actually, this is another, like, kind of important breakthrough.

Bryan Cantrill: 01:05:14

He's like, okay, we've got the haunted v a region. We don't need an install inator at all. We don't need any of this stuff. If this is like a and in particular, I'm like, actually, I can just act and I I had kind of forgotten that we had done this. I can just use DTrace to show the bug because I can do a using the copy in action, copies memory in from a user region into the kernel, and we can go into one of these, the haunted VA region, which we don't expect to be mapped in a process.

Bryan Cantrill: 01:05:47

So we would not expect this to to that that copy and to be correct. We would expect to get an error. And just an artifact Yeah.

Adam Leventhal: 01:05:54

Effectively like like a like a safe segve almost. You you you can imagine from DTrace.

Bryan Cantrill: 01:05:59

That's right.

Adam Leventhal: 01:06:00

Reading memory is not mapped, but it it's like a soft error rather than, like, crashing the process.

Bryan Cantrill: 01:06:05

Yeah. And something that's extremely important DTrace is, like, it's very important that when you are doing a copy in from a process, for example, you're just tracing memory. If you give it an address that is incorrect, it's very important that the I mean, obviously, we can't die. We can't panic the system clearly. So what Dtrace does, Dtrace gives you an error.

Bryan Cantrill: 01:06:22

It gives you an explicit error. Like, you tried to do this and that address is not there. But your D choice program continues to run. Like, the program's not aborted. You can but what is aborted is the clause that you're executing.

Bryan Cantrill: 01:06:33

So if, like, you have an error, it's like we know that we're we we can't we're not gonna execute this clause any further because we've had this error. But we're gonna allow the enabling to continue to operate. And so what that allowed us to do is like, alright. So what we're gonna do is we're going to attempt this copy in and kind of do a bogus copy in, and then that should fail all the time. And then we're gonna have an action after that that actually, you know, does whatever we want to record that we got here, you know, aggregating by process anymore and how about what have you.

Bryan Cantrill: 01:07:02

And then you you run that, you're like, I I I think I know what I'm gonna see here, and it's very scary. And sure enough, you run this, and you're like, oh, yeah. All these processes, like, oh, yeah. Like, you know, config d and all these other processes that are running on the system, like, yeah, they all they all have this little bonus haunted VA region some fraction of the time as well.

Adam Leventhal: 01:07:24

Right. Like, so in other words, like, if they had just constructed a pointer into that area, like, sometimes if they did a load or even a store, it might have worked.

Bryan Cantrill: 01:07:33

It might have worked. Okay. Yeah. That's the other thing is we realized. We were we were poisoning this region, and we by storing to it.

Bryan Cantrill: 01:07:41

And we begin to realize that, like, oh, if we store to it, another process will see the same corruption over time. And you're like, what the fuck? Okay. So the the the the so there's some and then, like, we like, okay. So this memory and Keith was like, I can't remember if, like, if this memory is being deallocated or not.

Bryan Cantrill: 01:08:00

Like, I think it must not be being deallocated because otherwise, we'd be just kinda corrupting random memory all those all over the place. But we like, okay. So we are we and we now so we now can see this in every process, which is and then we can also see that, like, okay, if I go into that page table and I if I clear the access bit in the kernel's page table, so clear the access bit, my DTrace program that is doing the copy in is setting the access bit. And this is like, okay. This is both good and terrifying.

Bryan Cantrill: 01:08:39

Good because, like, we're beginning to connect some dots here and, like, there's blood in the water. We're definitely get we're get this thing. Bad in that, like, this page table is, like, over here. And when we are doing the copy in, we are we are are on a c r three, which is kind the base of the page tables, but that is gonna be pointing into that that that the the user address space. It's like it should it somehow is accessing this other page table.

Bryan Cantrill: 01:09:11

And the other kind of interesting data point that we had across this is like it definitely you needed a context switch to actually to flicker.

Adam Leventhal: 01:09:20

That's right. The the oscillation between good and bad data required a context switch between them.

Bryan Cantrill: 01:09:27

That's right. You had to, like, go in and out of the kernel to see this. And you're like, okay. So then you you you begin to add all this up, and I become increasingly convinced that, like, this has gotta be speculation. And we have gotta be there's gotta be something that we are doing where we and in particular, this has gotta be so the and it should be said that this address and I'm gonna gonna upload the the doc that we sent ultimately to AMD, and that will go into to to the the details of and I don't know that he even yeah.

Bryan Cantrill: 01:10:13

There's actually a portion of the team say proprietary company. That's fine. But this was the the doc that that we submitted to them, and that this kind of describes now the entire the entire problem. But the thing that we that is important to know is because of kernel page table isolation, you kind of mentioned this in passing, Adam, but this is really important. So KPTI is was invented had to be invented by us all at the tip of a bayonet due to meltdown.

Bryan Cantrill: 01:10:41

So meltdown, people remember Spectre and meltdown were the the annus horriblis of 2018. So this was discovered in early twenty eighteen, January 1, I believe, 2018 is when that was kind of publicly disclosed. The and Spectre, these were both bad. Meltdown was extremely bad. So what meltdown allowed you to do is get the CPU to execute speculatively based on addresses that you weren't allowed to read, use a process.

Bryan Cantrill: 01:11:14

And you could then execute conditionally based on that speculation, and you could use that to actually begin to figure out what was in and then by using a timing attack, figure out what was in cash and what was not in cash. And that in turn allowed you by by kind of changing the way you do this based on every bit, you can actually begin to, like, map cash lines that you can't read, which is very scary. And Yeah. In particular and it was meltdown was very, very bad because it there's no, like, worker like, the the CPU's busted. The Intel CPUs, just flat out busted.

Bryan Cantrill: 01:11:54

And the only fortunately, they discovered that that you could prevent this by having true kernel page table isolation. So in other words, it used to be that you would have a user application and the kernel would be in pointed to from the same c r three, and then then there's a bit that would basically determine whether this is a user mapping or not. And but as it turns out, that bit is like, okay. Well, no. Apparently, we're gonna ignore that bit.

Bryan Cantrill: 01:12:20

So what we're gonna actually do is we're gonna move you. We're gonna have a user c r three that is for for for mappings only. For user mappings only, it can't see any kernel mappings. And then we had a kernel user c three. We have a kernel user C r three.

Bryan Cantrill: 01:12:37

That is a c r three that point that includes both kernel mappings and user mappings, because the kernel needs to be able to see user mappings to be able to copy into copy outs and so on. The thing that's wild is that the our our stray mapping that we have, our stowaway mapping, is in fact not in either the user c r three or the kernel user c r three. It's actually in a different c r three that that we that we call in this document, the kernel only c r three. But this is a very, very small c r three. And this was basically done I mean, this is like kind of funny.

Bryan Cantrill: 01:13:17

We actually did this as kind of a protection mechanism against we the the kind of belt and suspenders against future possible kind of attacks that we wanted to switch you on to a kernel only c r three before we switch on to the kernel user c r three. And it becomes pretty clear that we are when we're in that that kernel that that we are when we are in that kernel only c r three for that very brief period where we do have the stowaway mapping, the processor is executing speculatively. And in particular, it is loading a page translation to the stowaway mapping. And then when we later and then that is being loaded into the TLB. And this is where we get to, like, the difference of opinion, whether that should be allowed or But that gets loaded into the TLB.

Bryan Cantrill: 01:14:12

Then later when we were to do the Detroit's copy in, for example, that would hit in the TLB. And we would use the wrong mapping instead of the right mapping. And this would first of all, this explains the flickering because it depends on whether you have the wrong mapping in the TLB or not. And you have the wrong mapping in the TLB or not depending on if it happened to speculatively execute through the when you were in the state where you had this kernel only c r three. So apologies if that if that explain that explanation is

Adam Leventhal: 01:14:54

Oh, that's great.

Bryan Cantrill: 01:14:55

Well, you know, it's actually funny because at some point, Adam, you asked Keith to give a two minute explanation for

Adam Leventhal: 01:15:02

Yes.

Bryan Cantrill: 01:15:02

How the RAM disk is loaded. And Keith gives an extraordinary two minute explanation and then apologizes for how poor

Adam Leventhal: 01:15:09

it is.

Bryan Cantrill: 01:15:09

It's like, I'm really sorry. That was like that was like a discombobulated mess. I'm like, that was that was just like a it was scripted. But I I and so my explanation is nowhere near as good, but the the the the the document goes into a lot of details there. I will say that this is, like, major deja vu for me because now I'm like, wow.

Bryan Cantrill: 01:15:27

This is really reminding me of something way, way, way back in the day. So in the late nineties, I had a really pathological bug. I had one of these psychotic bugs. It's both, psychotic. It was non reproducible.

Bryan Cantrill: 01:15:41

Got it to the point with enough reproducibility, I was able to determine what was going on. And what was going on was that the the chip was not honoring a TLB shoot down. And in particular, the way we do a TLB shoot down in the operating system, which is to say like, okay, I should I probably should I probably should mention what the TLB is. Yeah. Yeah.

Bryan Cantrill: 01:16:01

Sorry. The the the Exactly. Sorry about that. So the the the TLB is the translation look aside buffer. So when you have a virtual address and you're gonna turn it into a physical address, you need to do those page tables are gonna be stored in memory.

Bryan Cantrill: 01:16:14

So you need to do multiple memory operations to get to determine what that physical address is. And you need to do that translation every time you have a memory operation. So it's really, really important that that translation is cached. And the TLB, which is a very old construct and dash surely dates, it's gotta be like an IBM model 85 thing or model I don't know. I'm I should go look up the actual origins of TLB.

Bryan Cantrill: 01:16:42

But the translation look aside buffer dates back to the earliest machines with virtual memory. Because if you don't you have to cache this. Like, you can't not have a TLB. So but the TLB, because it is a cache, you know, wherever that we have caching, we must have you immediately have coherence questions. And so you have to be sure that when you are taking down a mapping, that you're also making sure that that that you are taking down all of the cached associations between the virtual address and the physical address.

Bryan Cantrill: 01:17:16

And we call that a TLB shoot down in a in a multiprocessor system. So we need to actually because this thing could be sit this translation could be in a bunch of different CPUs, and we need to be like, yo, you need to actually get rid of this mapping. You need to invalidate this mapping. And microprocessors have different ways, but on x a six, it's called the info page. And there's an instruction that you execute, and you execute that instruction with a with an address, and that would actually shoot shoot the mapping down.

Bryan Cantrill: 01:17:46

And what I realized is that invovol page, in some very rare circumstances, was not actually shooting down the translation. And in particular, if on the next instruction, it would actually still be in the TLB. I I know, again, under rare conditions. And what I further realized was that this had to do with the fact that the mapping itself was still valid. And what was happening was the so because it actually, in the operating system, whether you the mapping is valid or not, shouldn't matter when you shoot it down until because I know I'm like, I'm not gonna execute on any more CPUs.

Bryan Cantrill: 01:18:30

The mapping no one is gonna be loading through this mapping or they would die. So I whether the mapping is valid when I shoot it down or I shoot it down and then rip down the mapping shouldn't actually matter. But as it turns out, it really did matter. And when because if you if you did the the shoot down with the mapping still being valid, as it turns out, the microprocessor would speculatively execute, and it would speculatively execute it would speculatively load whatever was in EAX. And whatever is in EAX is actually the address you're trying to shoot down.

Bryan Cantrill: 01:18:59

So it you would literally do the info page instruction. That would shoot it out of the TLB, and then it would speculatively execute, and it would load through EAX and load it right back into the TLB.

Adam Leventhal: 01:19:08

And and that that load through EAX, that speculative load through, yeah, was sort of just for funsies. Just like, I don't know. I've got something that looks like a pointer. That's right. People love to referencing pointers.

Bryan Cantrill: 01:19:18

Let's give it a shot. Let's give it a shot. And this was one of these where in hindsight, I I I was I was being given a small bite of the apple of the tree of knowledge of good and evil because I'm like, wow. That is, like, really aggressive speculation. And this is in, like, 1999 maybe.

Bryan Cantrill: 01:19:39

And this is just because the the kind of the dark side of speculation is that, like, speculative execution is really required to get around the memory wall for these microprocessors. And kind of like in

Adam Leventhal: 01:19:49

a way wall being the disparity in speed between the

Bryan Cantrill: 01:19:52

the memory. Yeah. Exactly. It's like the the if you if you don't actually hit out a cache, like, we're just gonna spend all of our time blocked on memory. And one of the way we got around the memory wall a couple different ways, but one of them was got very, very good at speculatively executing.

Bryan Cantrill: 01:20:07

And this is one these were like, oh my god. I don't think Spark is doing that. And of course, it's like, no. Spark's not doing that, pal. It's like, that's the problem.

Bryan Cantrill: 01:20:15

That's why Spark sucks.

Adam Leventhal: 01:20:16

Have you noticed how slow Spark is? Right?

Bryan Cantrill: 01:20:18

A 100%. I got at one point, John Masters is like, I wanna get a hold of some old Spark machines because I bet I can find a lot of spike in of execution.

Adam Leventhal: 01:20:25

I was in there.

Bryan Cantrill: 01:20:26

And I'm like, no, unfortunately. Good luck. Unfortunately, I don't think we were I wish. Maybe if we had speculated, we executed a little bit better to the microprocessor. We've been a little more more a little more competitive.

Bryan Cantrill: 01:20:39

So so we and like this And and with Intel, I, like, I get this bug totally. Like, I present this to Intel. And Intel's like, well, you really shouldn't be doing the shoot down with the mapping dial. And by the way, you're the only operating system that does it that way. And I was super sad because I'm like, okay.

Bryan Cantrill: 01:20:59

Well, I spent a lot of time debugging this and being like basically victim blame for it. And I'm like, okay, can we have and then I'm like, alright, this is fine. Give me a note in the architecture manual. I just want like a paragraph. And then I had this plan.

Bryan Cantrill: 01:21:13

Do you remember the the the the shirts that we had made that Roger had with the paragraph on it from the Spark Architecture? No. No. Okay. So this is like super deep lore.

Bryan Cantrill: 01:21:22

And actually, this is this predates me. We need to get someone to get so there was a slash proc bug that Roger could not find for years. And there was a footnote in the Spark architecture manual that basically explained why this data corruption issue that he was affected may have. Don't if this data corruption, but it was basically like a correctness issue. And a, an engineer, I'm trying to remember the name of the engineer at SMCC who ultimately found this.

Bryan Cantrill: 01:21:46

But Roger was so elated to find this and to find that it was present in the architecture manual that I either he made or someone made for him shirts that had that footnote on them. And Roger Bing, yes, Roger Falconer, the late Roger Falconer, the inventor along with Ron Gomes of slash prok. And so I am like, this is and I had one of those shirts. I I and, this is one of these shirts that I'd probably, like, catch the kids wearing. Every once in while, be like, you know, you're wearing like sorry.

Bryan Cantrill: 01:22:13

You're you're, you know, you're wearing like the shirt. The magnet card. That's right. No. Exactly.

Bryan Cantrill: 01:22:19

That's right. You just wrote someone's phone number on the back of the Magna Carta, by the way. The so the but I got that shit. I think I hope still somewhere. But the so I have the same idea.

Bryan Cantrill: 01:22:32

I'm like, I will. They will do the architecture note, and then, yes, I will accept the fact that, like, our operating system was, I guess, incorrect. I don't know why, but we will correct it, and then I will have this t shirt. At least I'll have the t shirt. And I'm like, so can we I'm like, it seems reasonable to, like, clarify in the architecture manual that at the time that that an info page happens, in order for that info page to be correct, there must be no valid translation in the the page tables.

Bryan Cantrill: 01:22:59

Like, that feels like a very reasonable right? Does that seem reasonable? I mean, another way I'm saying it, And they're like, no. I'm not gonna say that. I'm like, how about, like an email or something?

Bryan Cantrill: 01:23:13

Like a an erratum or like a could you maybe send a clarifying bulletin? They're like, no. No. Don't think doing that. Like, okay.

Bryan Cantrill: 01:23:22

Alright. Well, this has been fun. So that was that I'm like, wow, this bug is really, really similar. And so how did we ultimately determine that speculation was involved? Fortunately, we do there are ways to disable and these are like super deep in the innards of the CPU.

Bryan Cantrill: 01:23:41

It's not something that they really expose, but you actually have MSRs that you can use to disable certain speculative features, which is actually really valuable. And the and I really appreciate our a huge shout out to our folks, to Will Forman, Paul Grimes, a bunch of other folks at AMD have been working really hard to get us the necessary documentation. So we actually had the ability. Like, we actually know what these bits mean. I'm like, we can go disable the and I first of all, I I I tried I think I tried to disable, like, all speculative loads.

Bryan Cantrill: 01:24:17

The or I I did something it was like a big hammer. I think I did maybe I I I I That's what

Adam Leventhal: 01:24:23

you said on that demo day video. You disabled all speculation.

Bryan Cantrill: 01:24:26

Yes. Very big hammer as it turns out. Like, machine slows to a crawl. They're like, oh, okay. So let's not okay.

Bryan Cantrill: 01:24:35

Never mind. Let's not do that. But if I so you you gotta have a little bit more fine tune, but fortunately, there's a way to disable you can disable speculative TLB loads and then reloads, and then you can disable so you can in other words, the the the TLB being loaded speculatively, which interestingly, like, dramatically reduced it but didn't eliminate it, which is like, oh god. Okay. Interesting.

Bryan Cantrill: 01:25:06

But if you disable both that and you can also do it on the instruction side. So if you actually that completely eliminated it. And you're like, oh, wow. Interesting. So it's like something is also interpreting it as as on the I side, on the instruction side.

Adam Leventhal: 01:25:20

Speculating on the I side.

Bryan Cantrill: 01:25:22

Right? Speculating on the I side. Yeah. It's like, wow. This is amazing.

Bryan Cantrill: 01:25:25

And we did manage to reproduce this on we had this on this is on Milan, which is what development on the time. But we are also able to reproduce it on once we knew exactly what this is, like, okay, we can go reproduce this anywhere. We don't actually need to be on a Gimlet. We just need to to create the stowaway mapping, and then we can actually see this. And so we were able to reproduce it on a Gimlet, and we were able to reproduce it on on a Milan that is ethanol axon, like kind of their their Stock stock system.

Bryan Cantrill: 01:25:52

Then we're also able to to to reproduce it on a Genoa, which is their newest CPU at the time. And our the the first concern was this is potentially very bad because if you are speculating like, we are speculating across this protection boundary, like, is there any way for a user because one thing that's really important is that when we were copying out data, and when we were using copy in in Dtrace, those two things, those copies are actually happening in the kernel. They're not actually happening at user land. So part of the reason we were observing this when it first came out off the wire is because it is the kernel that had copied it out to us. So if that user process attempted to load from that same VA, and this is why we saw the pattern we saw.

Bryan Cantrill: 01:26:45

If the user process would load from that same VA, it would what it would see would actually be the correct data. So the which is like why you're like, okay. The the but if it if it it was the incorrect data was copied into it. The corrupted data was copied into it. So the you're like, okay.

Bryan Cantrill: 01:27:08

Is there a way, and this is kind of our question AMD, Is there there a way for a this to be seen in Usolent? And can you then if then our kind of assumption was no. But is there a way for this errant speculation to be abused in some way, to undermine KPTI was our other kind of concern. AFC's answer, pretty similar to Intel's answer from from a from a generation and a half ago. It's basically like, nah.

Bryan Cantrill: 01:27:44

It's fine. Nah. Nah. And then now I will say this about AMD. And I'm like, okay.

Bryan Cantrill: 01:27:50

So what about like a note in the because as it turns out, we are the all I mean, again, we get we're used to this, but we're apparently the only operating system that has because the issue is that we're going from a we're going between two c r threes in the kernel. And it wouldn't this wouldn't be an issue if we only actually had the more conventional thing of so like, okay. So what about a clarifying note that if you do this, you should be aware that speculation can occur across us? And actually to AMD's credit, they're like, yeah, sure. Send us the paragraph.

Bryan Cantrill: 01:28:30

And I just haven't done that yet. I hope we should. I I probably should have said so like AMD is much more accepted to the paragraph, and I really owe them the paragraph. So I think it's the the other reason I've been reticent to do this as a podcast episode. Feel like

Adam Leventhal: 01:28:41

I've got So instead, we're giving them this.

Bryan Cantrill: 01:28:44

We're giving them this. We're giving them the gift of a podcast episode. And, you know, I don't know. I guess this is just what makes me like an older and wiser engineer is that I'm no longer you know, I've grown out of printing errata on a t shirt and wearing it on my chest. Like, that's a it's a phase as it turns out.

Bryan Cantrill: 01:29:00

That's like a phase that you grow out of.

Adam Leventhal: 01:29:02

So It's good to hear.

Bryan Cantrill: 01:29:03

Parents of kids who are printing out architecture manuals as t shirts, like, just just be patient with the developing It will

Adam Leventhal: 01:29:13

You know, the one of the things we were talking about the other day, Brian, is like, what if we had not hit this in Installinator?

Bryan Cantrill: 01:29:19

Oh, okay.

Adam Leventhal: 01:29:19

Like, like, what if this you know, we had never used this virtual address range? Because because there's no reason why we needed to have. Like, what would have happened? Like, what

Bryan Cantrill: 01:29:31

would have all happened? Just gets very close to what if the confederacy wins the civil war, I feel. Mean, in terms of just like it's like, it's just not good. A 100% not good.

Adam Leventhal: 01:29:42

Short answer, not good.

Bryan Cantrill: 01:29:44

Long answer, not good. Long answer, of much historical fiction. And, you know, you can, you know, the man in the high castle, but for data corruption. I the the it's yeah. I mean, it's it's hard to know where we would have seen this.

Bryan Cantrill: 01:30:01

Yeah. I mean, we would have seen it. It feels like I mean, the the thing that I find chilling is that we would have seen it in a way that we wouldn't even known how to bifurcate it. You know what I mean? It's like we could have seen it.

Bryan Cantrill: 01:30:16

And it's also, like, corrupted. I don't know. I and then, oh, here's another question. Like, had we seen it before?

Adam Leventhal: 01:30:23

And because I think there were, like, some some tough to understand issues with with cockroach. And I know we've talked about cockroach debugging a couple times. Yeah. But, you know, not impossible that that thing had a big heap that could have wandered into this into the the haunted VA range.

Bryan Cantrill: 01:30:44

Yeah. I mean, the good news is that you do have you can eliminate this very quickly. That is like, if you don't have this VA mapped, not this problem. Because if, like, if you don't have the VA mapped, it means that you were in the business of just, fishing in addresses that you didn't have mapped. And it's like, what do what do you have?

Bryan Cantrill: 01:31:04

Six eight v hand? Mean, they just That's like kind of what you do for it. Is this is like we is this what you do for kicks? Like, if you like, you know, don't wanna kink shame a process, but like, if this is kind of what you do for kicks, like that's but so you have to have this thing mapped realistically for this to to be and now that said, I mean, you can unfortunately, the corruption happens at such a distance. It's like if you had it mapped in the I mean, it's like, yeah, it gets it gets ugly.

Bryan Cantrill: 01:31:30

Yeah. Gets ugly. Good that we caught it. Yeah. So I feel like, I mean, bunch of lessons.

Bryan Cantrill: 01:31:36

I do feel like the the one of the most important ones to me is, John, the the very important half measure that you took. And because the other thing that you did that I thought was really good that I I I can struggle to do is you did that pretty quickly. Do you know what I mean? Like, you didn't like, well, let's like analyze this for six months. It's like, no.

Bryan Cantrill: 01:32:00

No. Let's just like, this is scary enough that we just need to because that was pretty quick as I recall.

John Gallagher: 01:32:06

Yeah. Well, I mean, sort of like necessity is the mother invention kind of thing. Mentioned I was going on vacation, right? I actually merged that like Saturday morning before our flight, right? It was like, I'm going out of town.

John Gallagher: 01:32:19

Like I screwed something up in writing the OS image. Let me at least put some stuff in here to save everybody else the time of like, it's corrupt and you don't find out until after. Sort of like Rain said, right? Like it's much better to get an error during Installinator than to find out like a slid won't boot after. I'm like, I'm just gonna save everybody some time and and fail this thing so at least they can like know to retry it faster.

John Gallagher: 01:32:40

So it I I if I hadn't been going out of town, would I have still done that? I don't know. Right? That was two years ago.

Bryan Cantrill: 01:32:45

But Right.

John Gallagher: 01:32:46

Lucky break.

Bryan Cantrill: 01:32:47

Well, it was it was a lucky break. It was a very lucky break. And then I think as we began and then I so I think that to me is is is a a big lesson. I think another big lesson, I think we saw this a lot, it's like this really benefited from the paralyzation, and lots and lots of people contributed to the debugging of this. Yeah.

Bryan Cantrill: 01:33:09

And, you know, and sometimes it's like, I just was listening to the recording. Like, you know, I really appreciate it. Like, Niels is trying to reproduce this on a non oxide system as, like, our reproducer. Like, non oxide being, like, x 86 running Helios, but not a and that's a really valuable endeavor. Right?

Bryan Cantrill: 01:33:25

And even though it didn't and because it's like, if that yields something, it's really important. Other people are like, I reproduce this in a different VA range? Because that's that's really important if we can do that. So I think that the the the the the parallelization you get with debugging, it's kind of surprising to me that debugging is so paralyzable, but it it really is. Especially when you have an existential bug.

Bryan Cantrill: 01:33:48

It's always you know, that's

Adam Leventhal: 01:33:50

Yeah. No. But you're right. That, the the way that we debug with the way we do these these group debug sessions is very different. Not something we did before the pandemic.

Adam Leventhal: 01:33:59

Not something I hadn't done before the pandemic. Maybe you did at JoAnnt. But I don't know. It's a great way

Bryan Cantrill: 01:34:05

to do it. It's a great way to do it. I mean, I think it happened a little bit via chat and less over Meet, but it it's great. Yeah. I think it's I also love that you get, just watching the recordings of this.

Bryan Cantrill: 01:34:18

And, you know, because I think another lesson from this is I really appreciated the recordings. It was great to be able to kind of, walk through all this. Oh, someone's asking like, where did we land on this? What was the solution? So sorry.

Bryan Cantrill: 01:34:27

And I I did kind of gloss over it. The the solution is there was a bug in the operating system, and I realized, unfortunately, it got moved from Omicron to Saint Louis, which meant that because there's that we end that that is not publicly available, but I'll get this bug publicly out there so people can look at it. We'll get it in the show notes. There was a bug in the operating system that kept this errant mapping. The mapping should have been flushed completely.

Bryan Cantrill: 01:34:49

So that mapping should not have that that was a stowaway mapping should not have existed. So the fact that there was a stowaway mapping, that is actually the bug. And so c r three, the the the the kernel only c r three really should be very empty. And it shouldn't actually be a problem that the chip can speculate across this. So that is the actual problem.

Bryan Cantrill: 01:35:14

And then with this kind of, like, note of, like, if you have something in your kernel only c r three and you've got another c r three that is also kernel accessible, you can actually get speculation across these two. So if you have a mapping in the former that's not in the latter, You should be aware that you can load translations, which again is now a nonissue for us because that was a grave issue for us. So that's the that's the actual proper fix. And then, John, I I think we got rid of the balloon. I I I trust.

Bryan Cantrill: 01:35:46

Yeah.

John Gallagher: 01:35:46

Yeah. It lasted forty eight hours.

Bryan Cantrill: 01:35:48

Right. Yeah. Right.

Rain Paharia: 01:35:49

For me, I think, you know, it was really nice to be able to rule a lot of things out because again, like, you know, coming in, it's like, oh, all of a sudden you're seeing this corrupt data. Like it could be so many things, right? But being able to use Dtrace and like, you know, InstrumentTokyo and just being able to narrow it down to that like little section where we've eliminated almost all of our user land stack. We are like literally receiving bytes from the buffer. I think the other bit that was, I think is really helpful just kind of looking back at the issue was, I mean, this does use some unsafe code.

Rain Paharia: 01:36:26

This has, it uses uninitialized buffers and so on. So we did try to add a little bit of like, can we like force the buffer to be initialized and like, that didn't help. But it was really nice that because this is Rust and like, I think if this were in C, I feel like it would be so much harder to actually track what's going on, at least for me.

Bryan Cantrill: 01:36:47

Totally. Yeah. Yeah. Yeah.

Rain Paharia: 01:36:49

And I think, you know, Rust is kind of really narrows down the scope of unsafe code in that sense, which is really satisfying.

Bryan Cantrill: 01:36:56

Yeah. That's a very good point, Rain, that I think that just in general, the the the more robust one's foundation is, the like, there we don't expect odd behavior. So when we do have odd behavior, it's kind of like we need to stop everything and really understand this. And I I feel like, you know, we've had a bunch of these. You can, Adam, ring the chime as you see fit or not.

Bryan Cantrill: 01:37:19

Like, for the the debugging odyssey episode, which is another one where and, know, I feel we've done enough of these that we don't have really second thoughts about it. But if one did have any second thoughts about it, you're like, oh god. No. These are we've so often, you get this thing that has and this was like, this was very clearly of serious consequence. So, like, there would be no even internal debate about whether this is the right thing.

Bryan Cantrill: 01:37:47

But sometimes you get something where it's like, well, no. This isn't doesn't feel to be of serious consequence. Like, in in Dave's case, it's like, we are actually, you know yes. You know, I when I run this, I'm gonna get into random psych faults. We're not actually seeing it in production.

Bryan Cantrill: 01:37:59

Turns out it was random data corruption that could have bit us at any time. And I think it is the the but, Rain, to your point, when your when your foundation is is kind of sandy and then well, it's kinda like this thing kinda dies for a bunch of different reasons. And I don't know. If we stopped and debugged all the data corruption, we wouldn't do anything else. Yeah.

Bryan Cantrill: 01:38:24

And it just limits the scope of what you're looking at and you've got just so, I think that was also really, really I important to

Rain Paharia: 01:38:31

think there's only a certain amount of budget that, you know, folks have, like just a certain amount of patience folks have for like debugging these issues, right? And, you know, I feel like in many other environments that budget gets eaten up by And I've been in roles where like, yeah, you know, yeah, you know, we'll, just restart it, whatever. Yeah, corruption happens. But no, does not just randomly happen. It's worth getting to the bottom of it.

Rain Paharia: 01:38:59

And like, yeah, I mean, ever since we fixed this issue, I don't think we've really seen any instances of corruption. Right? Certainly not in insolvenator, but in general, like, I don't think we've seen that.

Bryan Cantrill: 01:39:10

No. Of course, I hope you're frantically knocking on wood because you're Yeah. You're just like actually summoning a tsunami of corruption. But No. The the the that's exactly right.

Bryan Cantrill: 01:39:20

And I think it is just the and the more solid your foundation, the the more you you know that this investment is something that's that's important. And so it's it's gonna pay off. So I think it's like when the and this is why, like, I mean, this is where you have this kind of, like, exploding tech debt where, well, everything is, like, broken. So now we've got a lot of these kind of mysterious behaviors and we can't stop and investigate them. It's like, okay, what if we built it the right way from the beginning?

Bryan Cantrill: 01:39:49

And we we which which is why it's really important to build it the right way from the beginning. And this yeah. It was and and then the conclusion from AMD is and obviously, is something we're talking about. This is not a security issue. So which is great.

Bryan Cantrill: 01:40:04

And as someone's saying in the chat, like, could you because these things both use PCID zero to and if we use a different PCID, yes, we could. The the the presumption is I can't remember if we actually test it or not, but the presumption is that this would not speculate across that. The fact that these are PCID zero on both of these. So but, yeah, this is is a this fun one. This is and I I love the kind of, like, the spec if it hadn't been for meltdown ironically, we wouldn't have had this to be like, you because if we can the the stowaway mapping would have been in, like we would have hit that earlier.

Bryan Cantrill: 01:40:46

The stowaway mapping was able to stowaway because it was in the kernel c r three. The kernel only c r three. So this needed a bunch of things to kinda come together. Yeah. And the net effect was well, terrifying.

Adam Leventhal: 01:40:58

Yeah. Well, Brian, I'm so glad we did this one. I had a great time watching most of the video of it. I found it like a the slowest paced episode of house I've ever watched, but but still still gripping.

Bryan Cantrill: 01:41:10

Yeah. I would so the video is funny also. Thought yeah. It was very funny to watch the video. There were a couple times because we would all just be on there in silence wondering what the fuck is going on.

Bryan Cantrill: 01:41:20

I'd be like, did I pause this thing accidentally? Oh, I go, nope. Nope. There we are. We're all just like staring at our screens.

Bryan Cantrill: 01:41:25

Staring into the middle distance. Yeah. It's pretty little distance. And then also you also feel like you begin to, you know, obviously, you know all these things because you have to deal with me, but me only dealing with me in the kind of, like, my past me, there's some things that are, like, very annoying. Like, there were a couple times I'm like, you know what?

Bryan Cantrill: 01:41:41

Silence. I'm like, wait. Wait. Who would say something like that? What do we get?

Bryan Cantrill: 01:41:45

What? Like, just say it it after, like, something to be honest, we're like, I don't know what. I mean, obviously, like, do you want us? Like, this is a bit. Like, are we supposed to it's like a call and response.

Adam Leventhal: 01:41:53

How big was it? Right.

Bryan Cantrill: 01:41:55

I don't know. It's like, could you mute yourself if you just wanna talk to yourself, please? But it was it was great to get in the, you know, I I great, terrific work all around really did require a bunch of folks to to work together on. And a reminder of like, man, these systems are they they really do fabricate a kind of reality. These things that are bedrock, you know, ultimately, they they are all artificial at some level.

Bryan Cantrill: 01:42:27

So it was fun. Hopefully, this this met your expectations, Adam.

Adam Leventhal: 01:42:32

Absolutely. Glad only two years in coming, but but it was awesome.

Bryan Cantrill: 01:42:36

Only two years in coming. Oh, that is great. And I don't I I and now we are we are left with with Morris Chang and Sarawan Williams as the

Adam Leventhal: 01:42:44

That's right.

Bryan Cantrill: 01:42:44

Next up. Next up. I was really, I was the think, you know, it's not the reason I was delaying on this one. It's like I still well, I wanted to, you know, buy myself a little more time before I can get Morrish Chang on here. But Yeah.

Bryan Cantrill: 01:42:55

Alrighty. Well, thank you everyone. Thank you especially thank you John and Rain for both your hard work on this problem, and thanks for for joining us again. Reminiscing. It was fun, and thanks.

Bryan Cantrill: 01:43:04

And in absentia, did a lot of work on this, and obviously Keith and Adam Ewer. I thought it was funny because I did think that you underestimated how much going back and rewatch that, I hope you appreciated that like No.

Adam Leventhal: 01:43:15

I did. I was actually

Bryan Cantrill: 01:43:16

right in there slinging. Yeah.

Adam Leventhal: 01:43:17

Yeah. I'm showing up to work almost every day. Yeah. No.

Bryan Cantrill: 01:43:19

For sure. Almost every day. Exactly. Yeah. And I assure you we're not about to go debug a company ending data corruption problem at the end of this podcast.

Bryan Cantrill: 01:43:27

Just like not as far as you know. I guess you have to stay tuned in two years to know

Adam Leventhal: 01:43:30

if we are or not. That's right.

Bryan Cantrill: 01:43:33

Awesome. Thanks everyone. Talk to you next time.

Creators and Guests

Host

Adam Leventhal

Host

Bryan Cantrill

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere