Oxide and Friends | Transcript: A Debugging Odyssey

A Debugging Odyssey

December 19, 2022 / 01:35:13/S2 E38

Oxide colleague, Dave Pacheco, joins Adam and Bryan to talk about an epic debugging journey. Everyone had something to learn from the struggle to find random data corruption in the Go allocator--Dave included!

Speaker 1: 00:00

I'm back. How do I sound?

Speaker 2: 00:01

Sound good.

Speaker 1: 00:03

I sound better?

Speaker 2: 00:04

Yes. Much better.

Speaker 3: 00:05

Are you

Speaker 1: 00:06

just saying that, or do I

Speaker 2: 00:06

actually sound better? No. No. No. No.

Speaker 2: 00:08

You sound better. Okay.

Speaker 1: 00:09

You sound better. But no. You sound better. I mean, you look you good. I mean, you look fine.

Speaker 2: 00:13

No. I mean, for you, you sound great. Yes. Absolutely. What are our expectations here?

Speaker 1: 00:18

Right. Look. We're late. Let's go. Get in the car.

Speaker 1: 00:20

Alright. Fine.

Speaker 2: 00:21

Yeah.

Speaker 1: 00:23

Well, Dave, thanks for, thanks for joining us.

Speaker 3: 00:26

Yeah. Glad to be here. Longtime listener, usually on delay. I usually catch the YouTube recordings, but it's nice to actually be here.

Speaker 1: 00:37

They'd be here live. It's actually for a live recording, and, I mean, this odyssey is extraordinary that you and if folks haven't seen it, the the the gist, from Dave can will take you through this path. But, Dave, maybe you could take it from the beginning of this thing. Like, where did all of this start? Because I feel it started, like, 6 months ago.

Speaker 1: 01:04

Is that right?

Speaker 3: 01:05

Yeah. That's about right. I went back and looked, and, it started while I was testing a different change to omicron, which is our control plane. And, I was running the test suite, and I just got a test flake failure, like, a spurious failure that was actually a different problem. Well, I don't know that it was a different root cause.

Speaker 3: 01:27

But, so I guess the the background is in the, let's see. Our control plane uses CockroachDB very heavily. And so as part of our test suite, we spin up and spin down these transient instances of CockroachDB so that all these tests run-in what they think is their own database with, you know, clean slate and everything, which is great. Works great most of the time. So this happens probably, like, a 160 times during run of our test suite.

Speaker 3: 01:56

And, occasionally, while running the test suite, and in this case, I found a case where a test failed because Copper Edge DB exited very shortly after it came up with this mysterious it it died on a go runtime error from this Illumo specific code calling port get in and getting, e and val from port get in, which is not really supposed to happen. Right? E e and val basically means you gave bad arguments to the Sys call, and, you know, we spent a little while looking at the code, and it was not clear how that could happen.

Speaker 1: 02:25

And then, Dave, when you first saw this, are you you know, you had this, like, initial urge or maybe you don't. But certainly, I had this initial urge to just, like, can I unsee this? Like, maybe this just didn't happen. Yes.

Speaker 3: 02:37

There there is definitely that. It was but there's also this urge that's like, if I run this again, I'm gonna see it again. Right? I I have to see this again. This can't just be the the only time this ever happened or going to happen.

Speaker 3: 02:50

Right? You know what I mean? And you've yeah. You've had that experience as well.

Speaker 1: 02:56

Well, and how so how reproducible was this at kinda out of the shoot? Where you're seeing these?

Speaker 3: 03:02

Well, so that's the thing. So then I started running the test suite in a loop, and I started running into all these other problems. So I I was basically like, surely if I run this again, I'll see it again. Right? There's nothing special about this one time that I ran it.

Speaker 3: 03:16

But I should also be clear that my change that I was actually testing had nothing apparently to do with anything in this neighborhood.

Speaker 1: 03:23

You it always starts this way. Does it like, I've got this innocuous change. I just wanna make sure that nothing breaks. Oh my god. Strange, crazy error message from the animal brain of the runtime.

Speaker 1: 03:34

Right. And so it begins. Like, okay.

Speaker 3: 03:37

So then I run it again in a loop. And it's like different strange error from the internals of the runtime. And I feel like, you know, there there were a couple of phases to debugging this problem. And in this first phase, I was kicking off a lot of these, loops of the test suite. And every couple of days or whatever, I would wind up getting a different panic from inside the go run time.

Speaker 1: 04:01

Oh, man.

Speaker 3: 04:01

It which was which was tough.

Speaker 1: 04:07

And and you're sitting

Speaker 3: 04:09

because you're sitting there

Speaker 1: 04:11

Several dimensions of tough.

Speaker 3: 04:13

Yes. Yep.

Speaker 1: 04:15

So what are some of those dimensions of tough?

Speaker 3: 04:18

Well, so some one of the problems that I had, which I didn't really appreciate until I'd gotten pretty far along, was that my my sort of bookkeeping of all these different failure modes was pretty was not very rigorous. You know? So I would see, you know, I'd see another bug and I'd file another bug sometimes. And if I saw another instance of 1, it'd be like, oh, I guess that was another instance of 1 of this one I've already seen or whatever. And I wind up filing, like, 5 or 6 bugs.

Speaker 3: 04:45

But as I've been working on this kind of in the background for a couple of months, I was kinda like, man, I I kind of wish I had been a little bit more diligent about organizing these. Because I wanted to ask questions like, you know, we'd be blowing some assertion based on the number of things allocated. Well, not to jump too far ahead, but I wanna know, like, is it the same size thing that we're dying on every time? And Yeah. I hadn't really collected that data very rigorously because I didn't know, like, this was going to be so hard to reproduce or that there were going to be 4 or 5 different failure modes that were very close, but not exactly the same.

Speaker 2: 05:20

So David, it it should be said that your lack of rigor, your, your claimed lack of rigor, I think, was like, other folks on the team were bumping into this. Right? Like, every once in a while, there would be some CI run. It'd be like, well, that has nothing to do with this. And I think to some degree, and I don't think this initially speaks well of us, or maybe I'll just throw myself under the bus.

Speaker 2: 05:40

It speaks well of me and me. You just say, okay. Look. This is a a flaky test that sometimes flakes, and that's bad. But, you know, onward.

Speaker 2: 05:48

Yeah.

Speaker 1: 05:49

It's And

Speaker 2: 05:49

you had taken the time to catalog these. It's a pretty

Speaker 4: 05:51

serious it's a pretty serious kind of flake, though.

Speaker 1: 05:54

It is. But this is shades of econ reset on running AK test item. Do you remember this?

Speaker 2: 06:01

Absolutely.

Speaker 1: 06:03

And so in at Fishworks, we back in the day at Sun, where Adam, Dave and I all worked together, we had a test suite, and there were these errors that would crop up every once in a while that were not that reproducible. And Dave's like, what's going on with all these econ resets?

Speaker 2: 06:20

They were all

Speaker 1: 06:20

of us just like,

Speaker 2: 06:21

I don't know.

Speaker 1: 06:22

It's like tastes like, was it serious? Is it like, don't we have a test suite so we understand the failures? And I just I I and, David, you spent a long I mean, this surely

Speaker 3: 06:33

spent a long time on that one.

Speaker 1: 06:34

This issue must have reminded you of that issue at some level.

Speaker 4: 06:39

Sometimes it's network weather.

Speaker 3: 06:42

So it is very similar in that, what was happening in that test suite was we had this management server, and we weren't restarting it as I recall. That would that would run for the duration of the test suite, but we had 100 or maybe thousands of tests that would spin up and make RPC calls against it. And very occasionally, like, 3 or 4 of the tests would fail with this econ refused, I think it was.

Speaker 1: 07:05

I think it was econ

Speaker 2: 07:06

no. No.

Speaker 1: 07:06

I think you're I think you're right.

Speaker 4: 07:07

I think

Speaker 1: 07:07

it was econ refused.

Speaker 3: 07:08

Yeah. So it wasn't it it wasn't most of the time, but it was definitely enough that it was annoying, I think. Like, if you were running this test suite as part of putting back a change, like, you would there's a good chance you were gonna run into this and you would, like, rerun those tests and be like, okay. I guess it's fine. And it but and it wasn't the same tests, so it's very hard.

Speaker 3: 07:27

It it was very similar to this one and that it's very hard to develop the instrumentation you want because you have no idea when this is gonna happen. So, yeah. Yes. It is very similar.

Speaker 1: 07:38

Well, and it Oh, but what

Speaker 3: 07:40

one funny thing about this, I actually don't know that we were hitting this one in CI at all. I couldn't find any record of that, even for me. Like, I don't think I had seen it in CI and filed the bug. I saw this on my test machine. And a big theme of the debugging process for this one that is really not in the write up because I I didn't get a chance to talk about all the blind alleys in the write up Was which machines saw this or didn't was kind of a big question for a long time.

Speaker 3: 08:09

Like, are we only seeing this on AMD machines? Is there a reason for that? Is it Helios specific? We're not really sure about that. Have we ever seen it in CI and all that stuff?

Speaker 2: 08:21

And, Helios is our Alumos distribution. You know, Alumos born of open Solaris born of Solaris just for folks who might not be familiar.

Speaker 5: 08:30

Great. So

Speaker 1: 08:33

right. So so you but you're seeing this, and you've run the rerun the test suite. Now you're seeing a bunch of new failures. And you I mean, you're just like, oh, brother. I just I I'm, like, I'm just trying to get this other change back.

Speaker 3: 08:44

You know what I mean? Well, I think I think I satisfied myself. You know? I did the thing where I reran the whole test suite, and, you know, there's no reason to suspect it was related. So I got that change back, and I kind of put this on the back.

Speaker 3: 08:56

I mean, what I thought was I'll try to reproduce this by running it in the background. And and going back to, like, what's tough about running into all these different failure modes is, like, it's easy to run into a failure mode and be like, okay. Maybe I'll add some more instrumentation for this case. And then you run the loop again, and you hit a different failure mode. And you're like, I have no more information about the thing that I wanted, but now I have a new problem.

Speaker 2: 09:18

Right.

Speaker 1: 09:19

That's

Speaker 3: 09:19

And then repeat that and then you're just like it feels like you're spinning a little bit because you are.

Speaker 1: 09:26

Yeah. So where okay. So then how do you proceed from there? How do how does it kind of unfold from there?

Speaker 3: 09:33

Let's see. I can't I'm trying to remember what the prompt was, but I think I don't remember if I started hitting this again, but at some point, I think it was late October, I felt like part of my problem was the fact that I was doing this in the background. And so I wasn't really holding a lot of state about this problem. You know? It could potentially be months before I would pick it up again.

Speaker 3: 09:56

So I filed your first bug May 27th going back and looking. This is bug 11:30 in the omicron repo. Yeah. And there would be sort of, like, because apparently, I have a pathological need to debug CockroachDB memory corruption problems even when I've got the other one figured out. But, at some point, I was just like, I need to focus on this and actually be more diligent and rigorous about the way or at least about the way I spend time on it.

Speaker 3: 10:33

So I started taking more careful notes and and all this stuff. But at this point, I was still doing what I've been calling this sort of heuristic approach to debugging, which I think a lot of people are familiar with, which is where you're like, you know, this is a thing where you're like, well, what's changed? Or is there something common to the systems that are hitting it? Is there something is there something else that's like, feels like a promising lead? I keep thinking of them as leads is, you know.

Speaker 3: 10:57

It's like, you know, there's the detective analogy for debugging in general, but there's just something especially, like, fuzzy about the idea that, like, oh, I heard that this might be a problem in this area, so I'm gonna go spend some time there. But there were so, so, so many of these possibilities that obviously all turned out to be blind allies. And, and in retrospect, some of them were totally nonsensical. So for example, one of the first things I did was try to get so I suspected memory corruption early on because we're the actual failure mode that I wrote about was blowing an assertion inside side to go run time about the number of things that were allocated from this data structure. And it was definitely, like, you know, I expected this to be 50 and it was 25 or something.

Speaker 3: 11:47

It was like a very bizarre and variant violation in in the memory subsystem. So I was like, Well, if it's if it's data corruption in memory, maybe I'll try libumem. I'll see if I can get libumem going. It's got all these tools for debug for identifying these problems early, and now I know that libhuman wasn't even on the scene. I mean, not even sorry.

Speaker 3: 12:07

The system memory allocator was not even on the scene. Malick wasn't there. So, like, all the time I spent on that, which is not a huge amount of time, but I was like, that was just a waste because I didn't actually look at what was going on. And then, similarly, you know, I knew I can't remember if it's go. I guess it's CockroachDB is using, Gemalloc.

Speaker 3: 12:26

And so I got that test suite going to make sure you know, to see if there be there was some problem with that on the Lumos that was causing corruption. And that was, you know, fine, but, like, these are just these are shots in the dark. Right? There was not a lot of strong evidence to indict either any of those things. And I'm not sure if those were mistakes or not, but but there were a lot of these and the thing that really I felt helped a lot was when actually, Robert gave me the kick I needed, you know.

Speaker 3: 12:54

One of these blind alleys was, whether this was AMD specific and I started looking at AMD processor errata. And Robert helped me look through that and he's like, I'm happy to help you do this. This is great, But this doesn't seem like a likely explanation based on the fact that you're seeing this on a couple of different generations and it's just it doesn't really feel like that. So, you know, maybe look at the error message. He said he said it a lot more nicely than that, but it was kinda like, why don't we look at this failure mode a little bit more deeply?

Speaker 3: 13:21

And that's when I started actually digging into the the code that was causing this in the Go memory allocator, which, you know, you if you guys experience this when you're debugging something where you're, like, kind of avoiding the big expensive step because you're hoping that there's gonna be a cheaper thing that's gonna turn out to show the answer.

Speaker 2: 13:39

I mean, absolutely. Yeah. No. I mean, it's like we're you you're you're looking looking where the street light is instead of where we drop our key is. Like, we we I feel like

Speaker 4: 13:48

it's an easy way to Feels like optimizing.

Speaker 2: 13:53

Really?

Speaker 3: 13:56

Yeah. You're like, I'm gonna check Yeah. Yeah. It is the street light file. It's like it's like there's 15 street lights over here, so I'm just gonna exhaustively check all those even though the keys are were dropped somewhere

Speaker 2: 14:05

else. Well, and in particular on this one, Dave, I think that the sort of hairy thing that you're avoiding was, like, imbibing this incredibly complex system of Go memory allocation.

Speaker 3: 14:19

That's exactly right.

Speaker 2: 14:20

Maybe not just because it was complicated, but also feeling like it you know, the kind of risk reward. If you're like, if in order to debug this, I need to have a complete proficiency and fluency in this subsystem, then, actually, like, I don't have those 6 months or 9 months or 12 months or whatever it's gonna take to develop that level of of fluency. So I better look

Speaker 3: 14:48

elsewhere. Yeah. And and it it is one of the hard things about a problem like this is figuring out which of these things is worth spending time on. Right? I think you described it at some point as a balance of yeah.

Speaker 3: 15:00

As probability of success for each of these. But, ultimately, I think that was really I think that was probably the best way to get where I got. So I mean, I guess you could argue differently, but Yeah.

Speaker 1: 15:14

I guess

Speaker 3: 15:14

that's helpful for me.

Speaker 1: 15:15

It so Robert is here. Robert, when when you were having that discussion with Dave, I mean, Dave is is saying that you phrase it very nicely, but that you as he's asking you for errata details, are you, how how are you vectoring him in a different direction?

Speaker 5: 15:33

Well, I think the biggest thing is that the biggest challenge is that while, you know, as Dave was saying, we're kind of looking across both Zen 1, 2, and 3. So 3 different microarchitectures. And it's not that Oratum don't go from generation to generation that don't exist, but that more often than not, they're at least if it's that bad, it's that bad. On the other hand, the biggest problem with the Radom is that, they're often even worse to pin down because, as the vendors like to say, a complex series of microarchitectural steps occurred. Like like like what?

Speaker 5: 16:07

And they're like,

Speaker 2: 16:08

well, I

Speaker 5: 16:08

don't know. Like, you tell us. And that's kinda all you get. So the the biggest challenge with that is just and I think the reason I probably looked towards the actual error message was trying to understand how can we work backwards.

Speaker 1: 16:31

Why are we dying? What we've got a you know, the we are doing something that ultimately the microprocessor is forbidding us from doing. And in this case, it was we're ultimately doing well, there are several different manifestations of it, Dave. But ultimately, we're we're doing an errant memory operation effectively and starting to work backwards. So, Davis, it sounds like that's what you you started down that path?

Speaker 3: 16:58

Yeah. And it yes. Absolutely. And, even then, it took a while and there were a lot more blind alleys in that direction. I learned I learned a lot about Go.

Speaker 3: 17:07

I've still only written, like, 30 lines of Go maybe in my life and most of those for this problem. But I know quite a lot about the Go run time now. The one in their mail, kiddo works and stuff.

Speaker 4: 17:18

It's a familiar feeling.

Speaker 3: 17:19

Yeah. I feel it. Yeah. I feel like I've I've become an associate member in this club that, like, a bunch of you were already in. I I And and with more senior memberships.

Speaker 1: 17:28

Yeah. I feel that I have definitely debugged the Go runtime more than I've written Go for sure. And it's a and it is a complicated runtime. Dave, at at any I mean, you must also be feeling like, wait a minute. I'm working for a company that basically has Rust in its name.

Speaker 1: 17:44

I mean, we're doing effectively this is the old I mean, we've got we're using what I guess, ClickHouse is in c plus plus. The operating system kernel is in c and and CockroachDB is in Go.

Speaker 4: 17:55

Not just Go, though. It's also, like, 300 megabytes of, like, c and c plus plus libraries.

Speaker 1: 18:01

Oh, interesting. It is it is Interesting.

Speaker 2: 18:04

Okay. So that yeah.

Speaker 4: 18:04

Well, like, this this is, like, a huge part of, I think, where our initial wheel spinning fear came from as well is that like this when we ported CockroachDB, it's not like it builds for Lumos out of the box. So, like, we add

Speaker 1: 18:19

a bunch of

Speaker 4: 18:19

patches to do that. And then unlike most go software, it is it has some substantial native code stuff jammed in the side, which is not memory safe in the same way. So, like, there's a lot going on in the process. It's a huge binary. It's like 100 of megs.

Speaker 1: 18:43

So so, David, there are there are a lot of usual suspects here, it sounds like. Big complicated system. But one's also extremely important for us. So the I mean, David, I know you're beginning to to, like, question, does this make sense to continue to debug? But it really Yeah.

Speaker 1: 18:58

I really felt like it did because this is such a basic error that we seem to be saying.

Speaker 3: 19:03

Yeah. I feel like I had individual conversations with you, Brian, and you, Adam, about this. And and it's kind of this question of, like, boy, this is pretty rare. Like, is this worth diving into? And we were like, wow.

Speaker 3: 19:14

But it is memory corruption, and it is the database.

Speaker 2: 19:18

Right. Right. It is it is memory corruption that is going to be persisted forever, potentially corrupting everything. So you sort of put in the stakes in those terms. You're like, maybe it's worth a little more investigation.

Speaker 1: 19:32

Yeah. And I kinda feel like this is how we know that we apparently, our future selves have not invented time travel because I feel we would have traveled back in time and slapped ourselves. And and just that, like, don't even do not question this for a moment. This is, because you just I mean and, Adam, I I think that certainly, Dave, you and I have have made, wish we had done more, with Postgres in particular. I just feel we I wish we had dug deeper earlier in our odyssey with with Postgres, and then maybe would have been in less pain, when it was murdering us in production.

Speaker 1: 20:10

So I mean, I think it was I felt like the right idea to dig into this, but, it also I'm sure it felt like, boy, I I hope this turns out to be something that's that's relevant.

Speaker 3: 20:23

Yeah. It's definitely something I struggled with. You know? Not every single day, but seeing so many you know, all of my colleagues working on things that were very obviously urgent and important for the company, and I'm off on the corner on this thing making pretty much no visible progress for kind of a while on something that might turn out to be nothing, but, like, could be a really big deal. And, you know, given the nature of the bug in the end, I think it's pretty important that we did.

Speaker 3: 20:50

Even I mean, even if we even if it hadn't been that serious, I think it would have been good that we confirmed it wasn't that serious. But it kind of is. Right? I mean, this could've this absolutely could've been causing database corruption.

Speaker 1: 21:00

Oh, this is I mean yeah. I guess, yeah, we should we should get through the story because this is this ended up being as serious as it could have possibly been, I think, more or less.

Speaker 3: 21:11

Yeah. Yeah. It's it's it's pretty accurate. I hadn't actually thought of it that way. Like, what would be worse?

Speaker 3: 21:17

Pretty bad.

Speaker 1: 21:17

It's pretty bad. It's pretty bad because it I think that it is as bad as it could possibly be, in part because I felt that this was likely gonna be confined to the go run time as you and I don't know what you what your kind of gut was on that. And, I mean, there's a level, which I guess it is because the go run time definitely likes to push push certain system facilities harder than others. But but the root cause of this ended up being generic across, I mean, arbitrary run times for sure.

Speaker 3: 21:50

Yeah. And the impact is basically, like, anywhere in the program, certainly in Go anyway, as far as I can tell.

Speaker 1: 21:57

So how so so you wanna start working back from the actual cause of failure, but you are also in a system that is pretty hard to understand and pretty opaque. So how did you proceed on that?

Speaker 3: 22:09

I mean, at some point, around that time, I think I had that conversation with Robert. I kinda bit the bullet, and I was like, alright. I'm gonna learn everything I can about this this assertion failure and the data structures associated with it. So, basically, the Go memory allocator. So I found some blog posts about it that were helpful.

Speaker 3: 22:25

I and then I just spent a while reading a lot of code so that I had a better working understanding, not necessarily, like, you know, fluency even to be able to modify it with any confidence, but to be able to reason about what that assertion meant and why why it was a problem that that invariant was violated and how that could possibly happen. Yeah. Then, of course.

Speaker 1: 22:49

Would it well,

Speaker 3: 22:50

that's a really interesting

Speaker 1: 22:51

kind of inflection point that you're describing where I feel like you're going from, I want this problem to go away to, I'm going to understand everything about the Go memory allocator. Like, I'm gonna understand everything about Go's GC. I mean, it just feels like you're taking a lot of agency over the problem, even though you actually don't have any guarantee that this understanding is gonna actually give you any understanding necessarily into the problem, but I'm gonna understand this system much, much better. That is what I can guarantee as I venture in here and being much more deliberate about it, which I think is actually important. Because I feel good.

Speaker 1: 23:25

I I definitely go to that that same inflection point where it's like, I just want this problem to go away in that the realization of, like, this is not going away. I actually need to attack it.

Speaker 4: 23:34

It's that step where you put you put down the other thing you're carrying. Yes. Use both both hands. That's right. That's right.

Speaker 3: 23:43

Yeah. Or maybe going from denial to acceptance or something like that. That's how it feels a little fit to me.

Speaker 2: 23:49

Well, it's not just that, but it's it's the blindness of the investment. Right? Like, as Brian was saying, it's or and is often the case with debugging. Right? You're not you're almost never charging down the right path.

Speaker 2: 24:01

You're charging down a path.

Speaker 3: 24:02

Yeah.

Speaker 2: 24:03

And the best you can do is kinda foreclose that path and and have that kind of mindset. But when and, you know, we like to debug in environments where you can go exhaust some particular hypothesis and then pop back out and start on a new one, but this was one where real deep research was required.

Speaker 1: 24:22

What I don't know. I like the phraseology too in terms of thinking of it as an investment because the dividend the sure dividend from this investment is understanding of the the go run time. That is what we're absolutely guaranteed gonna get out of that. And I think you also have to come to grips with the fact you have to accept that that dividend is actually valuable. This is a we are relying on Cockroach as our system of record for the control plane.

Speaker 1: 24:47

Understanding this thing better is a is a good use of time in the abstract.

Speaker 2: 24:53

Dave, at at this point, were you still running the the full test suite? Or because at some point, you switched to looking at just running the cockroach version command, which was also crashing, like, one time out of every 300, which was kind of astounding. Right? It was at that point that you showed me that that I thought, how is any of this working ever? Like, how is anyone running anything in Go or any of these Cockroach programs if it can't get out of bed to tell you what version it is?

Speaker 3: 25:23

Yeah. That's totally right. And, you know, obviously, there's the the common technique of trying to make simpler reproduction, situations. And it was challenging with this because I had a bunch of different workloads that could reproduce it with sort of different properties. So Cockroach version would reproduce it, but it would take upwards of a day and, like, tens of thousands of iterations.

Speaker 3: 25:44

And I eventually found that there was, like, a subset of the omicron test suite that I could run that would reproduce it in about 3 minutes. So it'd be, like, 6 iterations of that fraction of the test suite. So that was pretty reliable, but and much faster, but it was a lot harder instrument because there was, like, a zillion cock there's a whole zillion things going on in, like, a bunch of a bunch of different cockroach processes in parallel and stuff like that. And then at some point, I think James was trying to dig into this and was wondering, like, does Go actually pass its own test suite on the Lumos and ran that in a loop and found that that produced a lot of the same failure modes. And so then there was, like, that was another option on the table for, you know, in principle, isolating more stuff, but, there were some other complexities about trying to debug the problem in its own suite.

Speaker 3: 26:35

So figuring out what the right workload to use to debug this was also kind of tricky.

Speaker 1: 26:41

And then so in terms of of the the tactics as you're understanding the way the the runtime works and the GC works, memory allocator works, One of the things that I love that you did along the way is effectively adding your own type definitions so you could print them out from a from post mortem. You could print them out from the You wanna describe that that technique a little bit?

Speaker 3: 27:03

Yeah. Totally. And this is another example of, like, you know, the the problem is Fractal about, like, wanting this thing to go away verse versus investing in it. Because I found myself wondering, like, it would be really nice if I could look at these data structures, But MDB, our debugger, doesn't know anything about them. I don't know what's involved in that, and then I kinda, like, put it off.

Speaker 3: 27:21

And that question comes up enough times that I kinda wonder, like, well, wait a minute. How hard would it be to do this? And, and then I at some point, I remembered we had this, typedef command within MDB that allows you to give it, like, a c code, essentially, that describes data structures and just teach it about a type that it doesn't otherwise know about. And so I just wrote some c structs that looked like the go structs. I mean, the one nice thing about go is that there are some ways in which it's pretty simple, and this is one of those ways And it's pretty easy to figure out how that thing is gonna look, in memory.

Speaker 3: 27:55

And so I was able to write a c definition for that. And then I was able to poke at those structures, and and I was like, oh, jeez. Why didn't I do this a while ago? It's incredibly useful.

Speaker 1: 28:03

Well and I think it gets to another one of those sure dividends when you are really investing in debugging. Is that tooling dividend? And the not just the the gaining of understanding of the system, but, hey. I'm building some tooling such that when we do have another problem here, we can do it faster and better.

Speaker 3: 28:20

Yeah. That's a good point. And a lot of that stuff, I think, is in the DWARF as well. And I know, you know, we have had branches, I think, that haven't landed in over the most proper that would allow MGP to directly pull stuff out of the dwarf. So that might be another useful, area investigation that I think Robert knows a lot more about than I do.

Speaker 5: 28:42

Yeah. And I think actually, to add to that, Dave, I think one thing that is satisfying for some of the outside is, like, the colon colon type def is thing I added to debug stuff. I don't know. I feel like maybe almost a decade ago. But it's exciting to kinda see that stuff kinda come back and continue to pay dividends just to send the point of tools like I did that to debug something.

Speaker 5: 29:02

As I don't remember what It was something in 2012. Maybe KVM something in KVM was broken.

Speaker 4: 29:08

Yeah. We because you didn't have the there's no CTF in the in the binaries

Speaker 2: 29:13

at the time. Right?

Speaker 3: 29:14

You had

Speaker 4: 29:14

to side load it in so that you could recover some of the debugging information you needed, I think.

Speaker 1: 29:21

I don't know. CTM is

Speaker 3: 29:23

I there was something

Speaker 5: 29:24

there was something before that. Yeah. That that There

Speaker 4: 29:26

was it was q QMU. Right? Like, it didn't have the

Speaker 5: 29:29

Well, that was when I traded. That's when I made the mistake of wow. Not mistake of creating u typedef minus r for some Python code. But,

Speaker 1: 29:37

God. That was delicious. That was what what do we do we is that, Robert's Volley? Is that kugos gamble?

Speaker 2: 29:46

What what what what is what is the story?

Speaker 1: 29:48

This was yeah.

Speaker 5: 29:49

I'm I needed some Python code, Richard, to help deal with fixing some build related issues, and Josh wanted the ability to read some

Speaker 3: 29:59

I think this is shortly

Speaker 5: 30:00

after doing colon colon typedef, which basically lets you basically phrase a c structure just in kind of you know, you basically type it out like you would in a c declaration, which is great until it gets painful. But then there's ctf, which is it has kind of auxiliary debugging information. And I said, hey, what if we just, like, wrote that in from a file?

Speaker 4: 30:21

And An ELF file, critically?

Speaker 5: 30:24

ELF or just on its own. So just the CTF raw CTF data.

Speaker 2: 30:30

Oh, okay. But

Speaker 5: 30:32

and at the time, there was definitely some Python thing I needed to fix some build issues with how we were building stuff at Joynt.

Speaker 4: 30:40

The exception list thing. It needed not the exception list needed not to depend on Mercurial, so we could throw that out.

Speaker 1: 30:46

And and my recollection of this was that you really did not wanna write the you did not wanna write the Python and have to ramp up on that. And you you 2 basically traded problems, and Josh was done with the Python in, like, 10 minutes. It's my red Yeah.

Speaker 4: 30:59

Because it was not. It was just like

Speaker 1: 31:01

a bunch

Speaker 4: 31:01

it was a regex and deleted some stuff.

Speaker 3: 31:04

I thought it was

Speaker 5: 31:04

the other way around that I was done basically, like, 15, 20 minutes, and then Josh was angry at me

Speaker 2: 31:08

for a while.

Speaker 1: 31:09

It was, you know

Speaker 4: 31:11

It's possible it's possible that I was done in 10 minutes and angry for

Speaker 1: 31:14

a while. Okay.

Speaker 4: 31:15

But, like, that doesn't I mean, 10 minutes of Python is still 10 minutes of Python. Yeah.

Speaker 6: 31:19

Right. So It it

Speaker 1: 31:21

could it could feel like years. Alright. So, but this very function, this very useful thing had been added years ago when Robert's blocking a problem. I should also add, Robert, the tab completion in the debugger is, is due to you, and I I I I use tab completion all the time. Dave, I don't know if you're a big tab completion user in Oh, yeah.

Speaker 3: 31:42

The

Speaker 1: 31:42

debugger. All the time. But it's very nice. When you wanna print out particular structures that it knows about, it's very nice to be able to tab complete them.

Speaker 5: 31:48

And credit credit for that is also due to

Speaker 1: 31:50

Matt Matt, Matt Abner. Right? You guys did that at a hackathon back in the day. So, so you're using this this gives you kind of a higher level look, Dave, about what is actually happening. And are you beginning to kinda hone in on this thing at this point?

Speaker 3: 32:05

It was still a little ways. So I think this was a stepping stone to getting a DTrace script going that would then print out those similar you know, those same parts of the data structure so that I could trace it, so that I could have data points from earlier in the program's execution Yeah. To know when this thing was going badly. So the we haven't really talked about the the failure mode even or the bug, but the failure mode of this thing was basically that you have this data structure part of the Go runtime that describes a block of memory that it's allocating objects from. And its bookkeeping in the data structure says that it's full, And then it's it's doing an assertion that the number of allocations from it equals the number of things that's in it, and it's not.

Speaker 3: 32:50

So it's you you get this block of memory that contains, like, 54 items, and it's full, but it's only allocated 27 things from it. And so the sort of obvious question is, like, well, how many things were actually allocated from it? But in order to do that, you need to know you need the history of the allocations that came out of the span. And so I wanted a DTrace script that would trace all the allocations and then also trace the GC sweep operations and show me what those variable that the allocation count was at all those points. So I could figure out, like, is this getting corrupted or was it always wrong or, you know, what the heck's going on, basically?

Speaker 3: 33:22

Well, and

Speaker 1: 33:23

I think this is a Did

Speaker 3: 33:24

that make sense?

Speaker 1: 33:24

Yeah. And it's a good example of something that we've we've used a lot. I don't know if we've talked about as much, which is you are using because now you've used postmortem debugging. You use the kind of the symptoms from the failure to now formulate a question around runtime instrumentation. And now you're, like, changing to actually a pretty different tactic where we're now going to instrument this thing as it runs, record a colossal amount of information, and then use that information as effectively a genie that can see in the arbitrary past when we fail again.

Speaker 1: 33:55

Speaker 3: 33:57

That's right. And the and the the analysis that I wrote up in that bug is based on a combination of those, and both of those pieces were absolutely essential. Like, not just the postmortem that got us to the dynamic tracing, but also the actual data in the core file from the actual failure that we also had dynamic tracing from. It was all needed to figure out what was going on. And even then, it still took me a long time to actually, like, understand what I was looking at.

Speaker 1: 34:23

Yeah. And I think it's it's a good kind of concrete example of how when you're stuck debugging a problem, sometimes just to back to that contact lens fallacy, or the streetlight fallacy where you're looking for it or you're kind of using the information that you've got. You kinda wanna step back. It's like, what is the information that I wish I had? And brainstorm on that a little bit.

Speaker 1: 34:44

And then once you have your wish list of information that you don't have, what are then you can go solve that concrete problem of how do I go get that information that I wish I had, which you definitely did here.

Speaker 3: 34:56

Yeah. Yeah. I think that's a really, really important point. Sort of hammered home with this with the other stuff that we've been dealing with at work last week. Or it's, like, asking yourself what information do I wish I had, and then is there a way to get it?

Speaker 3: 35:10

And it's not always easy. I mean, even even once I had known exactly what I wanted, which was, you know, somewhat hard to begin with and knew how I could get it, say, tracing these things with DTrace and teaching DTrace about these data structures and all this stuff. Then I ran it you know, then you have other problems like, well, is the amount of tracing going to chase the problem away to begin with? Or am I gonna start running into drops because I need larger buffer size or faster switch rate because I'm just tracing too much stuff, basically? So the it sort of, like, continues to be hard.

Speaker 3: 35:44

You know what I mean?

Speaker 1: 35:45

I think I do love about that though is I I just love it when a computer is doing work in my absence. You know what I mean? Where it's Yeah.

Speaker 3: 35:52

You've said that before. I think about that a lot

Speaker 4: 35:54

too.

Speaker 3: 35:55

It is pretty satisfying. It's satisfying to start the test suite overnight and come back in the morning. Here's the data you asked for.

Speaker 1: 36:00

It's like, well, I'm I'm well rested. I've been I've been sleeping, computer. Have you been yeah. It's very nice. And you've got the And then you also, you know, again, kind of change that disposition to I want this to go away to I'm attacking this.

Speaker 1: 36:15

When you've seen it in that overnight run, it's really exciting. It's like, great. We saw it. I didn't chase it away, and maybe I've got the information now that I've that I'm looking for.

Speaker 3: 36:26

Yeah. Exactly. And then and then, of course, when you find a new failure mode, that's the start. Yes.

Speaker 1: 36:33

Yes. It's like, no. I'm not debugging you right now. What is this?

Speaker 3: 36:38

Yeah. And that would honestly, that was another challenge with this was a lot of, there were both false positives, like cases where, like, my stupid batch loop for running the test suite in a loop would fail. And so, you know, it stopped at 2 AM without having found a failure. And then also just new failure modes.

Speaker 1: 36:55

Those are the worst. When you feel that you've left something in the loop and realize that it, like, literally didn't make it to the 1st iteration of the loop, and the second you walked away from the keyboard, it stopped. Like, oh, god. Damn it.

Speaker 3: 37:06

So along these lines, I found I I really wished I had a tool that would basically run a command in a loop with no hub, then standard out in standard error to well known places with a no and, like, record the environment, the working directory, and get it all right. Because a lot of these things were super error prone and I just kept getting them wrong. You know, I'd be running the wrong cockroach binary, non instrumental or something like that. And, does anyone know if this exists? I know can do parallel can sort of do this, but it's not.

Speaker 3: 37:36

It's a little awkward, and I'm not sure it can quite do it. Or if other people run into this problem, I guess that's the other question.

Speaker 1: 37:42

So can you describe the problem again?

Speaker 3: 37:45

I'm like, I so I have a case like this where I I want to run a program until it fails to stay. But, I have I have standard out, I have standard error from it, and I want to keep maybe all the failures in this sort of organized way. So I want I want something that will basically just keep running it reliably and, like, also run under no hop. Like like, what I ended up cobbling together is this, like, you know, no hop, run this thing, redirect this thing to this file, and that thing to that file, and they do trace output to that file. And, still, you get all these false positives, and then you just have all this junk all over the place.

Speaker 3: 38:21

And it was just like it just felt like a mess. I I want, like, run until this fails no matter what, and keep the output and put it here and let me know when it's failed or something like that. Is it making any sense?

Speaker 4: 38:34

I think you probably just need

Speaker 6: 38:35

to write a program.

Speaker 3: 38:37

Yeah. Well yeah.

Speaker 2: 38:39

Totally makes sense. And I also understand why, why it doesn't exist because, like, you need it and then you don't. And then

Speaker 4: 38:46

It's a pretty specific pretty specific set of requirements too, I feel like.

Speaker 1: 38:51

One thing I no.

Speaker 2: 38:52

You know,

Speaker 3: 38:52

I feel like I've run into this a lot. Any kind of reproducible problem, but not it's not reproducible every time?

Speaker 2: 38:58

I mean Yeah. And it seems easy enough to write in the moment, and then you you write it wrong 6 times.

Speaker 3: 39:03

So That's exactly right. I I wrote it wrong a 100 times.

Speaker 1: 39:07

Dave, this has not solved your problem, but I had a similar kind of issue around a test where I I actually needed to be able to be attached to a console, but no hub that effectively. I wanna be attached with TTY, but no hub. And this command, detach or or detach, d t a c h, which is one of these, like, commands that was just done a long time ago. It's been written. It's done.

Speaker 1: 39:30

It was really, really valuable for debugging, and I I had another issue that necessitated a reboot loop. But it which is it's not wholly dissimilar to what you want, but it's not exact the this is more of a coming off of kind of these more like screen, than than, running things. But when you say, like, you do parallel or whatever. So, did you feel you there's that feeling when you're like, I'm definitely getting closer. I'm making I I'm iterating in on this thing.

Speaker 1: 40:04

Had that started at this point?

Speaker 3: 40:08

No. It had, but I I was disappointed so many times along the way. I think that the most promising I thought I got that turned out to be just completely wrong was, like, 2 weeks ago, Andy Fiddleman Fiddleman also at Oxide filed up in a Lumos bug where, in set context, we were not setting FS base. I believe only I can't remember if it was only on AMD or not, but we weren't setting FS base in set context, which meant that if you called get context and then, so did so you're saving the current state of the thread to resume it later, presumably? And then you set context from a different thread, you would get the first thread's thread local data

Speaker 1: 40:54

Jesus.

Speaker 3: 40:55

In the second thread. I know. I know. Right?

Speaker 1: 40:57

I just can't

Speaker 4: 40:59

Which it must be said is not a common thing that people actually do. Like No.

Speaker 2: 41:03

It's a really weird thing to do. It's a really weird thing to, like, take to, like, take a signal or to to get content.

Speaker 4: 41:10

And put it over here

Speaker 1: 41:12

and use it and

Speaker 4: 41:12

use it later.

Speaker 2: 41:14

Yeah. And to, like, then sort of teleport yourself into a different thread. We did look at that a bunch, Dave, and I think, you know, had some hypotheses about how that would manifest, and and it didn't kick over anything that was a smoking gun, obviously.

Speaker 1: 41:30

I know that Go is the victim here, and it's unfair to blame Go. But Jesus Christ, they push signals hard. They they they love signals.

Speaker 2: 41:42

Dave, you should talk about the I can't even ever remember. You've told me this a dozen times what signal they use to poke themselves.

Speaker 3: 41:50

Sigurge, sigurge, which I believe is a signal it's an old signal that means that there's urgent data waiting on a socket, I think. And so there's there's a whole comment in the source base, which I'm sure is sound. I didn't spend a lot of time thinking about it, about why they use that one because it's, like, always available but not likely to be used because it's not that useful because I think because you don't know what socket the urgent data came in on, so it's like, okay. Someone wants to talk to you urgently.

Speaker 1: 42:16

And they I

Speaker 6: 42:16

mean, and to some extent, they have no choice except to use signals because that's what the platform gives you.

Speaker 4: 42:23

Yeah. If you if you want a sync preemption or you want, like, the thread equivalent of an IPI, signals is really all you have.

Speaker 1: 42:31

Yeah. Yeah. Anyway Yeah.

Speaker 3: 42:33

Because the problem they're trying to solve is there's a go routine on CPU Yeah. Running Go code, and it's been running too long. We wanna stop it running. And and we don't wanna rely on its cooperation. So I think it's with those constraints, I think you're kinda back into a quadrant.

Speaker 1: 42:47

Yeah. Tools together.

Speaker 4: 42:48

That said, like, the Go people were the UNIX people. So it's really

Speaker 2: 42:54

Speaker 4: 42:54

mean, really, like, signals are a Unix thing. So Yes. Feel like feel like that is just the long shadow of that decision catching up with them.

Speaker 3: 43:02

So part of the reason I found that particular the FS based bug compelling though is that, first of all, it was found in Go. And Andy pointed me to a bug. I think it was in GCC Go from, like, 10 years ago. It was found in that on Solaris at the time. And the bay and I also I had spent a bunch of time in the allocator code being like, the bookkeeping here is just not that complicated, and I don't really see how it could get go wrong unless there were multiple threads operating on the same thing without any kind of synchronization.

Speaker 3: 43:32

That could definitely cause things to go weird, and that's about as specific as I got. So I found this thing, and I was like, wow. And that explains, you know, some of the, system specificness of it, but, obviously, it was, like, completely wrong. It's like, it was just nope. Not at all that problem.

Speaker 3: 43:49

And and there were a couple of things like that where I kinda thought this was super promising. I also found sort of a meta thing here about Go. I you know, Go was I'd always heard of Go as, like, as pretty, like, compared to c, like, pretty memory safe. But I found myself exploring all the ways in which there are all these rules about programming in Go that if you violate can cause basically arbitrary arbitrarily bad memory corruption to happen. And, like, the pointer passing rules in particular and the use of unsafe and and, like, if you cast an unsafe dot pointer to a u and putter, then the GC loses track of it.

Speaker 3: 44:26

So if you don't have another reference to it, then that thing can just get cleaned up

Speaker 1: 44:30

Yeah.

Speaker 3: 44:30

Even though you actually are still using it. And so, you know, I found bugs. I found a comment from someone saying the event port code is rife with these unsafe casts. And I was like, oh my god. It's gotta be in here.

Speaker 3: 44:45

There's gotta be one of these that's responsible for us collecting some point or two early, and then all hell breaks loose. Now it turned out to be a total red herring too. So there were a bunch of times where I thought I was super close and the but the time I thought it the most was last Tuesday, which is when I actually did end up nailing it. So, you know but I tried not to let my hopes get up because of all these times I thought I was close that were just nope. Not not even in the ballpark.

Speaker 3: 45:09

So Just so it's something So

Speaker 1: 45:10

what happened last Tuesday?

Speaker 3: 45:14

So Tuesday, I was I was chatting with Ben Nacker about it, and I was describing one of the failure modes in some detail. And I was about to say something that I thought was true that I realized wasn't true. And I was like, wait a minute. There's another explanation for how that could you know, this observation could have been true. So in particular, what I found and I don't I don't know if we want to get into super nitty details about it, but I think I think I can give a good summary quickly, which is that it you know, we have this data structure called the span that does that describes a bunch of memory that might be allocated, and there's these allocation bits.

Speaker 3: 45:51

It's a bitmap that describes which, sequential buffers in that thing are currently allocated. And so we run into this failure mode where we have we think we have allocated all 54 of them, but or I think it's 56 of them. But we've only actually allocated 27 things. And so I was like, well, how many are allocated? There's kind of a couple different ways to look at it.

Speaker 3: 46:15

And I thought I'd look at the allocates, and they were exactly inverted from what you would expect. That is the 0 bits were all the things that we actually had allocated. Right. And the other thing that was weird about that was as far as I understood the code, there was no way that those thing there's the the code doesn't actually set that bit when it allocates something. It actually only sets that bit when it goes and sweeps the span as part of g c, and it sets the bits to basically whatever it found as part of g c.

Speaker 3: 46:43

So these bits should have been all 0, but instead, they match something that was so close to what they would have been if you had swept the span except it was exactly inverted. And I was describing this to Ben and I was, like, wait, but if the bits I don't remember if it was him or me actually that if those bits were set that way before all this happened, that would explain the allocations. That was the key sort of insight about it. It was like I I had been assuming that these bits that were supposed to have been zeroed were corrupted at some point or were being maintained correctly by a different code path I hadn't seen, and I didn't see how they were supposed to be inverted or whatever. It's like, no.

Speaker 3: 47:21

They were just wrong to begin with, and there was a random wrongness, and that drove the weird allocation pattern.

Speaker 2: 47:28

And and to be clear, what we expected them to be was 0. Like, it had just it was newly allocated, you know, fresh from the allocator all ostensibly b 0.

Speaker 3: 47:39

That's exactly right. Except that when I saw that they weren't, I just and that they match that pattern. I was like, there must be some code path I haven't found that may that maintains them. So is the Because what are the odds that it exactly matches the allocation pattern? But, of course, they're a 100% because it drove the allocation pattern.

Speaker 1: 47:57

Yeah. Interesting. So in other words, it was the the implicit assumption that zeroed memory had been zeroed was the That's right. And so with so this is an you know, one thing that you talk to a lot of different folks about this problem. I feel you did a very good job of and I find this another important tactic is to talk to other people as you're stuck on a problem, not only for the for the the insight that they can offer, but also it forces you to repeat your own understanding of the problem and allows you to potentially find a new way of thinking about it just in the manner in in the in the course of describing it.

Speaker 3: 48:41

Yeah. Totally. And, I mean, that's exactly what happened here. And I do I mean, I kinda I do wish I had done that a lot more along the way here. But I was also you know, as we're talking earlier, you know, I had some doubt about whether it was a good thing to be spending a ton of time on, and I was reluctant to drag a lot of other people into this black hole that I'm in.

Speaker 3: 48:59

So I was sort of trying to time down it. You know, I remember spending some time with you, Brian, and with you, Adam. And, you know, I got, like, an hour with, like, 10 different people, and, and that was very helpful. All of it was very helpful. Robert and Josh and, and Jordan and Ben.

Speaker 3: 49:17

And then Sean put me in touch with a former colleague of his from Google who works on this area, the runtime, who was also baffled by this core file, which makes me feel better. I was like, can you imagine a situation in which we're allocating non sequentially from one of these spans? And he was basically like, no. That's very strange.

Speaker 1: 49:37

That's that's vindicating. Alright. So so you've got the idea that, like, wait a minute. Maybe it's not zeroed initially. And then what did that allow you to do?

Speaker 1: 49:44

That kind of that that potential lead to go investigate?

Speaker 3: 49:49

Then it was kind of the standard. Okay. I've got some corrupt memory somehow. You know, did it happen? Did something I still was assuming at that point that something might have corrected it.

Speaker 3: 49:58

And so then the question is, like, what does it look like? Does it look like ASCII? That might give you a clue about what had written over it. Is it a pointer or something else that's valid? May you know, maybe you could find the subsystem that has scribbled over it.

Speaker 3: 50:09

And I it was neither asking nor a pointer, and then I grabbed in the core file for that bit pattern wondering, like, does it appear anywhere else? And it appeared 2,000 times in the core file. And I was like, well, that's nuts because this is an 8 byte bitmap that is supposed to represent what is allocated from this span. Like, why on earth would that be the same as any other 8 byte chunk of memory anywhere ever?

Speaker 2: 50:33

I mean, that's like dropping the Kobayashi coffee cup kinda moment. Right? I mean, that that would have been really alarming.

Speaker 3: 50:39

Yes. That's exact yes. I was exactly that was my reaction. I was like, oh my god. This is really significant.

Speaker 3: 50:47

Phil didn't really internalize what it meant, but I knew it was really significant. And I think that's the point where I was really hot on the tail hot on the path, and I was like, okay. Where does this memory come from? It comes from this function. I heard some I had heard something about in some you know, under some conditions on some systems, some registers that start with x or y that I have never otherwise heard of aren't restored properly.

Speaker 3: 51:11

And that's it was pretty quick after that point, I think, to get there.

Speaker 1: 51:15

And then okay. Yeah. But yeah. Well, so and we said there. We should describe what there is.

Speaker 1: 51:21

But then it I mean, it feels like you're at that point. You are you have a a promising and interesting hypothesis, but still a long way to go to connect it to all of the failure modes that you've seen.

Speaker 3: 51:33

Yeah. So then, I think it's so at that point, I I was looking for where this chunk of bits comes from, and it either comes from having a mapped anonymous memory or called which should be zeroed. And I trusted that, although I don't know why I trusted that given what I later found. But, or there's a function they use to essentially be 0 a chunk of memory. This is the go specific function in the go run time.

Speaker 3: 51:59

There's like at first, I've I looked at it. I was like, wow. This is really complicated. I'm sure it's not this, so I guess it's fine. I looked at this, like, a couple months earlier and I was like, okay.

Speaker 3: 52:07

Whatever. But I took a closer look at it and I saw its use of these registers and it sort of clicked a couple of or, you know, rung some bells with others these failure modes. And I I think at that point, I asked Robert, is there some way to look at these registers in MDB? Because I couldn't find a way to do that. And at that point, I misread the code and thought it was using XMM 15.

Speaker 3: 52:30

That's actually for a different architecture that uses that or there sorry, different CPU features. And it was while we were looking at that, I was like, well, there's no way to print that, but do you also want the YMM registers, I think Robert said. I was like, oh, yeah. This code doesn't talk about the YMM registers. Are we saving those?

Speaker 3: 52:48

Is it possible we don't save those? And, Robert, that's, I think, what I don't know. Robert, what was your reaction when I asked that?

Speaker 5: 52:56

I'm trying to remember the the draw the kind of moment by moment reaction, but I think slowly getting to an increased sense of dread.

Speaker 4: 53:04

You you were pretty sad when we talked about it. I assume within 8 hours of this conversation occurring. And but I also feel like it wasn't like a lock yet. I feel like you were like, it could be this. Like, no.

Speaker 4: 53:18

It's that. It's the a 100% that. Because David and I had talked about signal handling, like, a week earlier, and I think I had shown you the list of other signal handling bugs that I've been involved in. Also, I was surprised to find that the async preemption stuff was turned on at all.

Speaker 3: 53:36

Yeah. Given That was what you conveyed to me in that conversation.

Speaker 4: 53:39

Because there was, like, go community skepticism that that was gonna be sound on any platform other than Linux. And, like, from I vaguely remember, like, reading some message from possibly even Ross Cox about that not being turned. But this, I mean, that must have been years ago at this point, so I guess it's probably on everywhere now. But certainly, when you when you turned it off, it was not reproducible. Right?

Speaker 3: 54:06

That's true. And I, you know, I dismissed not dismissed, but I deprioritized that data point because I just I mean, it looks like a concurrency bug, so I didn't know how much that was just because the ordering of things.

Speaker 1: 54:17

I think the the so I think you're I think you're right.

Speaker 4: 54:20

When we spoke

Speaker 1: 54:21

I mean, we I think it's like

Speaker 4: 54:22

When when we spoke about the signals though, you were worried about the other thing, the unsafety thing where if you put a go pointer in c memory or something

Speaker 1: 54:34

Yeah.

Speaker 4: 54:34

It can get confused.

Speaker 3: 54:36

Yeah. The rules around when you can like, what pointers you can store in what parts of memory. Like, can you pass us a go pointer to c memory? Think the answer is yes as long as there are another o other go pointers pointed to by that go memory. Yeah.

Speaker 3: 54:51

There's a lot.

Speaker 4: 54:52

We put a bunch of stuff on the stack between, like, between when the kernel vectors for a signal and when we get to the go signal handler. Like, we do our own stack frames, which is unusual for other platforms. Mostly, they're signal handling machineries, I think, entirely in the kernel, generally, whereas a decent chunk of ours is actually in Libc.

Speaker 3: 55:15

Josh, I loved your summary. At the very beginning of the conversation, you were like, I'm sure we're not doing anything unsound, but I'm sure we're doing things that Go is not expecting or something like that. Right. Yeah. This was a whole avenue of investigation.

Speaker 3: 55:31

You know? It's been covered in the bug report, but it's like there's a lot of ways that that could have gone wrong, but turned out not to be that. But,

Speaker 1: 55:38

but I think you were right to minimize the data point of, like, okay. So if I if I change this other big flag, if we run with other different effectively configuration around not having async preemption, I don't see it. Hard to know what to do with that. It kind of like, okay. You know, I don't see this if I have fewer CPUs enabled.

Speaker 1: 55:56

It's like, well, alright.

Speaker 2: 56:00

You know,

Speaker 1: 56:00

we're just gonna, like, ship with your CPUs. I mean, it's like at some point, like, you have to actually understand the problem. So I think it was, like, your wise to just, like, I'm gonna file that away as a data point, and we'll see. Yeah. That's it.

Speaker 1: 56:13

We'll see if if when we get to the end of this thing, it can that data point ends up being relevant. I mean, it did. It it helped bolster, I think, the the findings.

Speaker 3: 56:25

Yeah. And that is basically how I ended up using it. And, Michael from Google had also said that Linux had a similar sounding bug. I I don't know if the failure modes were quite the same, but they had a similar situation where, under some conditions, I think it was, like, if the first page of the signal stack was not faulted in, then YMM was not preserved across the signal handler. And it had been very hard to debug is what he has said.

Speaker 3: 56:53

So that was another thing that that might have been that probably was what kind of put it in my mind that, you know, once I got to the code that was using these registers thinking like, oh, that could be a similar bug going on here. But, directly, it didn't seem super useful. I mean, maybe maybe I should that's that's why I was saying at the beginning, like, arguably, I could go down that path a lot. Yeah. But it just it did seem so hard to know what to do directly.

Speaker 3: 57:17

Yeah.

Speaker 1: 57:17

I feel that you were wise to I've been thinking you gotta go to the problem and debug the problem, I think. And it I suppose because, I mean, I think that's kinda in the category of, like, making it go away. And it's like,

Speaker 3: 57:30

not that

Speaker 1: 57:31

there can be utility in that, especially when you're bisecting a problem. But when you've had hit a big switch to completely change a program's behavior, it's very hard to conclude much from that.

Speaker 3: 57:42

Yeah. And I did kinda feel like if I follow the data and just keep going, I should eventually get there anyway. Right? Like, it it's not wrong to follow the data that I have.

Speaker 1: 57:52

Yeah. And so so you got this very interesting hypothesis. How did you explore? Because I I I actually love the d script that you wrote in here to actually get this a little bit harder.

Speaker 3: 58:04

You mean when I tried to reproduce it? Yeah. Well, so before I did that, I think I I wrote a c program to test this by just, you know, having main right to y and then 0, and then a signal handler under certain conditions would whack it, and then see what happened in main if we took a signal at that point. Oh, and I used I did use a descript at that point to raise a signal at exactly the right moment.

Speaker 5: 58:29

Yes.

Speaker 3: 58:29

And saw that, yes, indeed. We are not saving you know, we're not preserving this correctly. And it was about in parallel with that that Robert confirmed by looking at the code that it didn't seem like we were saving that correctly.

Speaker 1: 58:39

Which, Robert, you you may have dread about, but I was elated. I mean, that is, like, that's great. This is great news.

Speaker 4: 58:45

It's the it's it's the best

Speaker 2: 58:47

possible icon.

Speaker 1: 58:47

Possible icon. It's true.

Speaker 4: 58:49

So we can fix it. We can fix it, and and we don't have to, like, take a patch to another consolidation to do so. Yeah.

Speaker 2: 58:55

We already couldn't need to convince anyone that they need to fix a thing that they claim doesn't exist except for in us and so forth. I mean, it sucks that, like, we got it wrong, but it's great that we it's right in our hands.

Speaker 4: 59:05

Having had the other, like, the mirror image version of that problem where it's, like, definitely someone else's problem to fix and I want, like, it's, this is this is better.

Speaker 1: 59:15

Yeah. Yes. Then how did you,

Speaker 3: 59:20

then yeah. So then I tried to reproduce the very problem. And I you know, this is Brian, you've described this sort of walk off home run feeling you get when you think you understand a problem and you trigger it. Knowing this, so something that's been very hard to reproduce, but knowing now what you think the problem is, you try to trigger it very precisely and you trigger exactly that failure and how exciting and and awesome that is when that happens. Like, really hoping for that here because I was like, well, I can raise a signal at exactly the point in this function, you know, in the very call path that I think is causing this problem.

Speaker 1: 59:53

Yeah. Like a basis loaded.

Speaker 3: 59:54

It should blow up.

Speaker 1: 59:55

Basis loaded bottom of the 9th, Right guy. Right spot.

Speaker 3: 59:58

Right. And Yeah. Casey has struck out. It didn't it basically didn't didn't reproduce. It it was more readily reproducible, but not enough to really see that satisfying.

Speaker 3: 01:00:09

So, like, you know, cockroach version, I would see like I said, it could take tens of thousands of iterations. And with this enabled, it would take, like, sorry. With the sorry. I had this d script that would raise a signal at exactly the right moment in the right call path. It would reproduce it, like, after, like, 30 iterations, which is, like, way better and, like, you know, you could've done some statistics to convince yourself that it I really was making a difference, but it was also, like, 29 times in a row that it didn't reproduce it.

Speaker 3: 01:00:35

I was, like, not feeling as good about it as I wanted to. But then I also I broadened the d script so that it would just raise a signal every time in this function that's supposed to that's basically their b zero. Right after they initialize ymm0 to 0, it would raise a signal. And, like, there like, 50% of the time, the program would die, and most of the failures were a totally different failure that I hadn't seen before. That was it was exactly the sort of thing you would expect.

Speaker 3: 01:01:02

It's like you tried to initialize something that was already initialized. It's like, yeah. Well, that definitely could cause that if you'd be zeroed something and now it would 0.

Speaker 1: 01:01:10

You said that raise is a destructive detrace action. Destructive in that it changes the state of the system. Adam, you added raise way back in the day. Did you

Speaker 3: 01:01:19

I love RAYS. It

Speaker 1: 01:01:20

is it it it is it's one of those things, man. When you need it, It is really, really nice.

Speaker 4: 01:01:27

It's like a like a Turing completeness completeness like property that the system has because of Ray's. Like, there's all kinds of things that DTrace doesn't do by itself, but, like, because you have, like, raised 6 stop

Speaker 1: 01:01:41

Yes.

Speaker 4: 01:01:42

Like, you you can implement them, like, lighter as bolt ons, basically. Like, if you didn't have raised, then you would not have been able to do almost any of those things.

Speaker 3: 01:01:52

Also, instruction tracing, which I had if I'd ever used before, it was very, very rare.

Speaker 2: 01:01:56

Mhmm.

Speaker 3: 01:01:56

People know the PID provider trace this entry and return to every function in user land. I mean, you could pick whatever functions you want, instrument entry and return, but it's pretty rare that I've instrumented individual instructions. That was necessary here, and it was

Speaker 1: 01:02:09

And this is good.

Speaker 2: 01:02:11

Well, that's cool, dude. I I didn't know you used that there. Yeah. Both both facilities made 20 years ago, and I don't know with a particular goal in mind. Like, often, we, you know, we build pieces of DTrace based on the problem we were actually trying to solve.

Speaker 2: 01:02:27

But, like, with instruction tracing, it was sort of like, well, you know, we can do this too. Should we do it too? And I was like, sure. Let's do it too. And I don't Brian, I don't think there was a precipitating use case for raise other than it being nifty.

Speaker 4: 01:02:42

Well, it's pretty nifty.

Speaker 2: 01:02:43

There we go. If you

Speaker 1: 01:02:44

if you if

Speaker 2: 01:02:44

you feel like I've used this,

Speaker 3: 01:02:46

and I used the instruction.

Speaker 1: 01:02:47

If you instruction tracing, it feels like not actually that well. The fact that you can trace an arbitrary instruction is it's, it's, again, one of those things that, boy, when you need it, you really need it. And, Adam, I remember you instrumenting every instruction in Firebird.

Speaker 2: 01:03:05

Yeah. No. In fact, in the early days of writing the PIP PIP provider, I would in instrument every instruction because

Speaker 3: 01:03:13

that didn't work. Definitely did not work.

Speaker 2: 01:03:16

So, and and I think that people run into that facility more often than not by accident because they go to

Speaker 4: 01:03:23

We have lots of cards.

Speaker 2: 01:03:25

Well, they try to instrument, you know, every entry and return, but just leave off the entry and return. So instead of getting a 1,000,000,000 probes, they get, like, a 1,000,000,000 probes, and it takes a little bit longer to create those and runs the system out of memory and stuff like that.

Speaker 1: 01:03:39

But it was, load bearing in this case, Dave.

Speaker 3: 01:03:42

Yeah. And I had also used it earlier, in one of the first d scripts that I ran that I used to trace to get more instrumentation from this failure mode, I really wanted to trace when g c was freeing a pointer, but that was there's a function call in go, but it was inlined. So it was in the middle of, like, the sweep function. So, but fortunately, that was easy to find and then trace.

Speaker 2: 01:04:05

I And

Speaker 3: 01:04:05

then and then I was able to grab the the arguments, like, what would have been the argument to that function, obviously, sitting in some register. This is definitely, like, it's not that exotic, I guess, but I hadn't really done a lot of this. Like, oh, I'm gonna instrument this in line function and print its argument out of this register.

Speaker 2: 01:04:21

So, Dave, I I think this maybe gets to something you had alluded to in a conversation with me, talking about, like, dwarf integration to be able to say, you know, trace this function that doesn't technically exist anymore and trace its arguments, which aren't where you expect them to be. But that kind of integration, you know, it's not something we ever got to, but I think Robert certainly built some breaches in the And

Speaker 1: 01:04:42

it's super important for Rust where the Oh my goodness. Yes.

Speaker 2: 01:04:47

Where everything is one function. Right?

Speaker 4: 01:04:49

The the whole program is

Speaker 3: 01:04:50

And it and it's, like And

Speaker 1: 01:04:51

you also you want I mean, for good reason. I mean, it actually you you end up with a much higher performing artifact as a result. So you really do wanna pass a challenge. And a rampant inlining is a huge challenge. And this is like nested inlining and it's gnarly because you're trying to kind of the reconnecting the binary to the code that the programmer thought that they wrote can be a real challenge.

Speaker 5: 01:05:17

Yeah. I I have there's actually I don't think it's too bad to go do. Just need some time. But, we already have prototypes where we can, transform the dwarf unwinding information into d. So that actually is a way to get at all the local variables in that context.

Speaker 5: 01:05:34

And then from there, I already have this in my head sketched out what a basically, instead of trying to do a PID provider based on the instructions that you have there, doing that based on the source. So basically, being able to switch this file line and then, like, offset the line.

Speaker 3: 01:05:48

Yeah. Yeah.

Speaker 5: 01:05:49

And you can have to transform into the entry and return, you know, they forgot some of the program spacing. But I think that could lead to a pretty useful, way to phrase that because then you can go and say, here's what I wanna instrument logically in the source code over here, and here's how it transforms to that. And then here's, you know because for better or for worse, Rust takes advantage of the fact that doesn't use a standard ABI. So you kind of have to rely on TWARF to get access to arguments and other things. But then with those 2 combined, I think you get something pretty pretty interesting, presuming you

Speaker 1: 01:06:22

have Yeah. And the DWARF it thankfully, the DWARF really is complete, for Russ, which is I mean, god.

Speaker 4: 01:06:30

It's only it's pretty complete for

Speaker 2: 01:06:32

It's great. Go Yeah.

Speaker 1: 01:06:33

It's great.

Speaker 6: 01:06:33

You know, like, the

Speaker 4: 01:06:37

despite hesitance, I think, to use debuggers, perhaps, they do actually have the information in there at least for, like, some kind of reflection thing, I think. So

Speaker 6: 01:06:47

I don't know.

Speaker 3: 01:06:48

For that and also for frame pointers and also a reasonably standard calling convention.

Speaker 1: 01:06:53

Yes.

Speaker 6: 01:06:53

Terribly against the idea of debuggers for what it's worth. But what that actually raises an interesting point, which I think is that one of the fascinating things in this entire Odyssey was the primacy of the use of tooling. And I think we've kind of alluded to that, but not stated it directly. But without proper tooling, I think this would have been an incredibly difficult thing to to go and figure out and solve.

Speaker 2: 01:07:15

It it's a great point, Dave. And and you're also talking I mean, Dan, but you're also talking about Dave who is an expert in this tooling, who is also learning more about this tooling motivated by this bug. And that that's one of the remarkable things that, Dave, you're you're kind of phoning a friend with, like, you know, 7 or 8 colleagues, and each of them was not just teaching you about a different part of the system, a system that you're very familiar with, but also, like, mode of like, instructing your use of these tools that you're also very familiar with.

Speaker 3: 01:07:42

Crap, baby. I

Speaker 4: 01:07:46

feel like we I feel like we all have, like,

Speaker 3: 01:07:48

a desk drawer full of,

Speaker 4: 01:07:54

terrible lengths of water that that we have acquired during, like, past, periods of discomfort.

Speaker 3: 01:08:01

But that's

Speaker 4: 01:08:02

you kind of just you're surveying you're surveying, like, like, can you give me your wire draw and your wire draw? And, it's definitely it helps.

Speaker 6: 01:08:12

The thing that's interesting though is that that's qualitatively different than some of the recent social media flexes you've seen with people being like, I don't use 2buckers because I don't need to. It's like, well, that's a bit more of a cell phone than you

Speaker 2: 01:08:24

guys. That's right.

Speaker 6: 01:08:25

Perhaps realize.

Speaker 1: 01:08:27

Yes, absolutely. And what I think also, Dan, I mean, when you're especially around, like, the double e's when I mean, where it's it's not like a a position where you're like, I think we should use a scope for this or I think we should use a logic analyzer. It's like, no, these things are undebugable without the tooling. So you the tooling is really, really, really important, and certainly was essential for this for this problem. So yeah.

Speaker 3: 01:08:52

Brian, I think it goes back to your point about about focusing on what question you wish you had the answer to instead of the questions that are just easy to answer because those might not be that informative. And then you have to end up doing this torturous logic to figure out what it means. You know, you had some piece of data from something that was easy to collect, but it's not clear what it means. But you can kind of maybe try to infer what it means, and you just get twisted in knots.

Speaker 1: 01:09:14

Yeah. And then so how did you because I I then I I love your the the the tactic that you used, once you really could hone in on this thing to to show that this was the same problem.

Speaker 3: 01:09:29

Well, this is where I, yeah. So I I I had first tried what we talked about of raising a signal at the right moment and seeing if the same problem happened, and I I couldn't get that line of approaches to really give me the satisfaction I wanted that I solved this problem. So, what but what I figured as well and there was also this question of, well, even if it did, it doesn't mean you know, It started off saying I hit I ran the test suite in the loop and found a 1000000 different problems, but they don't necessarily all have to be the same problem. There could've been a lot of problems here. So I said, well, what if we fix this?

Speaker 3: 01:10:01

I had a binary that was exactly the same except used the x m you know, basically, it didn't use the registers that we didn't preserve. This function has a bunch of different modes depending on what's available on the CPU in terms of registers and instructions. And so you can tell it well, you can't tell it. But with enough force, you can tell it to use whatever mode you want. And I figured if I can get this thing to not use those instructions and then see how long the test suite runs without having a problem, and if it's you know, remember, this was reliably reproducing after, like, 3 minutes on my machine.

Speaker 3: 01:10:34

So if it goes kind of a while, then I think we can have good confidence. This is the problem and the only problem. And I thought about building a new just like doing a new build, first as the go run time and then a cockroach using that go run time. But I was worried that there would

Speaker 1: 01:10:47

be some

Speaker 3: 01:10:48

other sort

Speaker 1: 01:10:48

of determination.

Speaker 3: 01:10:48

I would still have some doubt that there was some other factor. So, so I was like, well, also, I'm just gonna patch the bump patch the binary, which I also had never done before and did not know the MDB operator for doing that. Thanks to Keith for that one. And so I I I

Speaker 1: 01:11:04

may use that one a bit.

Speaker 3: 01:11:05

Step. There's the step in the function that checks a bit that Go has previously set based on what based on CPU ID and and I forget the x set b v or whatever the other x get b v, the other thing that tells what the OS supports. So it does this early in startup and then sets a bit and then checks it in this function. And if that bit is nonzero, then it jumps to a place where it uses those registers. And so I just knocked out the jump so that it would go straight to the XMM code.

Speaker 3: 01:11:35

And so it was like a very small targeted change to the binary that had a pretty well understood impact on the code, and sure enough, that thing ran for hours and hours. It ended up dying on a different problem, which I've I've since worked around, and now I'm running it again. And it's been running for almost 48 hours without an issue, so I'm feeling pretty good about there's a there's not something else horrible working here.

Speaker 1: 01:11:57

Great. So the Or,

Speaker 3: 01:11:59

I mean, I guess, I shouldn't say that out loud.

Speaker 1: 01:12:01

But And, I mean, I I think it may be worth, Robert, talking a little bit about the, because the fix is is is somewhat complicated. And I I I would also say that it I mean, one day, you saw that this was a bug that had been seen in many different kind in many different forms, had many different manifestations. Interestingly, not just on us, on other operating systems as well. I I I I would dare say that we are not the only operating system that has had this particular pathology, which I thought was really interesting.

Speaker 3: 01:12:35

Yeah. Well, there's a couple of things there. One is that the specific messages for this failure mode have definitely been seen a lot. Like, if you search for them, you see them in the Go issue tracker, and not all of them have been resolved. So one of the questions I had was, like, is this actually Lumo specific?

Speaker 3: 01:12:50

I don't know. Like, it kinda seems like it is, but then you wouldn't expect to see all these bugs with it open. But, and then the the specific problem, like, underlying cause also seems to have happened at least on Linux and possibly on other systems. And Yeah.

Speaker 2: 01:13:08

Well and as as you see, Brian, like, on its face, it sounds like a simple problem. Like, you you weren't saving some registers, and so save those registers. Like, what's your problem? But it turns out to be much more complicated than and an interesting history just of how much state there is to save. And there seems to have been some cat and mouse game that that people constantly fall behind on in terms of CPUs having more and more state to save and operating systems

Speaker 1: 01:13:33

for Interesting history in future too because I think this is this is a problem that's getting harder for the operating system, not easier. So, Robert, do you wanna give some context there? Because it's a pretty interesting problem. Sure. Yeah.

Speaker 5: 01:13:44

So I think to kind of as we've kind of talked about, the the main issue is that when you take a signal, and this is kind of goes back to I think this is a system 5 ism. On the signal stack and, you know, these get context and set context routines, was that you could kind of know all the registers that were there and modify their state. So the the challenge is not that you can just use signal handling just to, like, handle the signal and, like, move on with life, but you can actually change the interrupted state that gets returned to. So, that kind of starts to bake all of these structures into the ABI of all these systems, you know, whether it's Linux, BSD, us, someone else, you know, that's all in there. And people it's not super common, but people will do this to, like, change the actual, register state around and to change, you know, hey, like, I got this signal, you know, you can even think of the classic JVM.

Speaker 5: 01:14:52

Like, we got a SIGNSI v and we need to change, you know, where's rip? Stop executing that and jump somewhere else. I don't know that they actually do that internal anymore, but, like, that's the kind of the origin of, like, what people do with signals and why you have all this register state visible in the context. And unfortunately, on AMD 64, the the original state was this f x save state, which is 512 bytes of x amount of state and, you know, the 8387, floating point stack and all those fun things that you have. But then when Intel introduced, the YMM or AVX instruction set in Sandy Bridge, then that kind of that state starts to explode, and state keeps getting added.

Speaker 5: 01:15:41

So the you know, with AVX 512, all of a sudden now you have 2 ks of register state because we have 32 512 bit registers. Then Intel's, matrix operations which they're adding in Sapphire like 8 k of register state. And the semantics of signal handlers are that you have your own register context to do whatever you want. So, there's a lot of even if you just ignore the problem of the ABI and try to, you know, make sure you don't break anyone who's expecting to modify this even though the floating point state hasn't been modified too much. The bigger challenge these days is actually just not overflowing someone's signal stack.

Speaker 2: 01:16:28

Right.

Speaker 5: 01:16:29

You know, everyone for a long time said, hey, use the 2 k signal stack. That's great. You'll be fine. And now the actual state that someone needs to spill is over to you.

Speaker 1: 01:16:37

And then you also need to you want to not spill the those registers if they are not in use. So you've got complexity there. And then you also have vulnerabilities where, that that attribute is effectively being exploited in in speculative speculative or side channel attacks.

Speaker 5: 01:16:59

Yeah. So it's it's it's, solving it well is kinda nuanced. And, I think, actually, AMX, which is forthcoming, makes an interesting, trend where it will be the first unit not to be enabled by default in that one time. And that application actually have to opt in to using it because all of a sudden imagine every time you're switching threads, you're paying for, you know, 8 k of b copy there. Or you said or they just want people to promise it like, yes, my signal stack will be able to support you spilling, you know, 10, 16 k onto it when I take a signal.

Speaker 5: 01:17:35

So it and I think in the the matrix and you know, the Intel's AMX is a whole bunch of matrix tiles that are like 1 k each. So I don't think you're gonna see that in a lot of general purpose computing per se, at least not

Speaker 1: 01:17:50

far from what. Fully expect this as GPUs and CPUs continue to converge, which I think is a not an unreasonable kind of prognostication. Certainly, Intel and AMD wanna do that. We can this problem to get actually narrower because you're gonna have these registers that, are designed for kind of one type of software. But in your case, Dave, Dave, with the b0, they're being used for a different kind of software.

Speaker 1: 01:18:14

So, well, actually, it's a register and I yeah. I'm not gonna use it. It's the the floating point aspect of it. I'm going to use it as a register to to be able to stage in and out of RAM or what have you.

Speaker 3: 01:18:26

I do wonder if, like

Speaker 4: 01:18:32

because we virtualize all these registers today, ultimately. Right? This signal handling mechanism is part of, like, the saving and restoring and switching from present thread is all, like, like, in an attempt to give you the appearance that nothing has happened and that the register is just yours and they're just regular registers. But, like like, if this starts if this gets up to being, like, a bag of, you know, random, accelerator crap, basically. Like, I don't know that

Speaker 3: 01:19:02

Lies will continue. Spilling it.

Speaker 1: 01:19:04

Lies will continue. Is lies.

Speaker 4: 01:19:06

Like, I just it just doesn't seem it's like at least with the GPU right now. Right? Like, it's a separate resource that is managed apart from the LWP, which, like, has some benefits, I think.

Speaker 3: 01:19:21

I don't know.

Speaker 1: 01:19:23

So, Dave, did you when you did not, did you get that walk off home run feeling? I mean, you must have certainly through this whole thing, but was there a single moment that that gave you that kind of, primal feeling?

Speaker 3: 01:19:40

I think the closest, like, discrete moment was the c program that showed that we were not preserving YMM. I was like, that definite I mean, because that can explain everything. Nothing nothing up to that point could explain all the data that I have, And that really could. And actually, like, something a lot like it had to explain it. So that felt pretty good.

Speaker 1: 01:20:02

It must have also felt great to have so many of these different manifestations cleaned up by the same problem. And because you're thinking, like, I'm looking I mean, there are Yeah. These manifestations are not necessarily all that similar. I mean, some of them are similar, but there's a there's a lot. I mean, and, of course, in hindsight now, it's like, yes.

Speaker 1: 01:20:21

When you when you scramble someone's registers, there'll be lots of different kinds of failure modes.

Speaker 3: 01:20:28

Yeah. Totally. The 3 that I ended up focusing the most on were all in the Go memory subsystem. And the one we've been talking about was this case where it blew an assertion thinking the thing was full when it was only, like, half allocated. And that one, you you kind of had to get further unlucky after hitting this problem in order to hit it because you had to have reached the point where it would do this assertion, which it wouldn't normally do.

Speaker 3: 01:20:49

If it did a GC sweep before then, you might hit one of the other failure modes, and that's why you'd sometimes get more more than one failure mode. And depending and there's a third failure mode that would happen during sweep depending exactly on what the state of that thing was and exactly how it was corrupted, basically. Then we also had some Segvs. We had this port get in returning with apparently impossible, air no. We've also had a couple of cockroach bugs that were the cockroach folks were like, this can't happen.

Speaker 3: 01:21:18

This is like this is, like, memory corruption of the worst kind. And at the time, we're, like, Okay. Well, not our highest problem right now. But, like, now looking back on that, I think we can say there's this could have certainly caused that.

Speaker 1: 01:21:32

Yes. And I mean, I would go one stronger in that we left unfixed. This would have had manifestations for us in production.

Speaker 3: 01:21:44

Yes. I think that's right. I I think probably

Speaker 2: 01:21:46

Calamitous ones. Absolutely. Right. Right. The in

Speaker 4: 01:21:49

OS 1028 of, databases.

Speaker 3: 01:21:54

Yeah. Because we've seen things where again, I don't I haven't gone back and checked that all these are explainable by this problem, although it's hard to imagine a problem that couldn't be explained by this problem given, you know, p zeros and 0. But we've had, like, a bunch of stuff where you're just, like, doing a select from Cockroach and it's, like, I blew my own internal unique index on ID trying to insert an ID into this table.

Speaker 1: 01:22:17

What? You're, like, what?

Speaker 3: 01:22:19

That's like that error message doesn't even have anything to do with what I asked you to do. You know what I mean? But it's, like, very deep inside Cockroach being confused. We've had a couple of things like that already. Like, yeah, there's no way to me that none of these were caused by this problem, then we weren't gonna have a problem a serious problem in production because of And

Speaker 1: 01:22:37

a problem that would have been without any guarantee of reproducibility. And when Josh is making reference to a really gnarly bug we had years ago where we had, memory corruption, a kernel memory corruption issue where we would die in, like, wildly different manifestations. And we in a smaller number of cases, that data corruption actually, it was ZFS that was the target of that data corruption. ZFS was absolutely it's a victim. This is not a ZFS bug at all.

Speaker 1: 01:23:10

But the data corruption ended up then being on a MetaSlab on disk, and that's the gift that keeps on giving. And now you've got you've fixed the actual data corruption. And we absolutely Dave, it's very easy. I can I can imagine that we could have had this issue where we ended up being corrupt on disk?

Speaker 3: 01:23:29

Yeah. And even if you did know exactly what it was, how would you know which of the pages of the database were were were corrupt? How would you know what was supposed to be 0s? How would we fix it? How would we even assess the blast radius?

Speaker 3: 01:23:41

So yeah.

Speaker 4: 01:23:42

I mean, what what I think happens is you ultimately end up closing hundreds of bugs later with it was probably this.

Speaker 3: 01:23:49

Oh, sure. But, I mean, what do we do with the production system that seems to have hit this?

Speaker 2: 01:23:53

Apologize.

Speaker 4: 01:23:57

I mean, there's really nothing else you can do. Right? The data is gone. Like, the the right data is not there.

Speaker 1: 01:24:01

When you've got data corruption, you have these these effectively, these failure modes that are, wild, that are completely outside of the bounds of the system and are nonreproducible as data corruption often is. It is wholly dissatisfying to not get that completely debugged because you know that that thing is lurking out there for you, under presumably a less presentable future failure mode. Dave, I mean, this is I know this took a long I mean, this was heroic effort on your part to keep grinding on this because I feel also like if you had stepped away from this, no one necessarily would have no one would have faulted you for it. It would have been understandable for you to be like, look, we've got other things to do. And, you know, it's but I'm really glad that you stayed at

Speaker 3: 01:24:48

it. Yeah.

Speaker 2: 01:24:50

Here. Here. It's gonna be so vindicating not only to solve it, but to solve it in a way that that shows just how damaging it could have been. Because I don't think we we I don't think we recognized specifically that. Like, we we had some intuition that this could be real bad.

Speaker 2: 01:25:07

But now that you've determined the actual problem, could have been real, real, real bad.

Speaker 1: 01:25:11

And, William, that you're just said earlier, Adam, I did not think that this was gonna be an issue beyond Go itself. I felt like the odds of this being an issue with the Go on the Lumos on Helios. Like, that felt like definitely plausible. Maybe likely though, as you say, I mean, Dave, the thing is interesting is like this failure mode has been seen on a lots of other systems Like we are not the and I think even on the ticket, someone had pointed you to a Linux bug. This is very that the symptoms were very reminiscent of.

Speaker 1: 01:25:44

And, you know, on some other systems where you don't necessarily have someone digging all the way into this, you do wonder if these other issues aren't aren't lurking out there.

Speaker 3: 01:25:58

Yeah, totally. I mean, certainly in retrospect, time well spent. Happy to

Speaker 1: 01:26:03

do it. Very, very, very glad. And did you, you know, any other kinda higher higher order bits coming out of it? I mean, obviously, you glad that you spent the time. That felt very vindicating.

Speaker 1: 01:26:18

Were there other other lessons coming out of it?

Speaker 3: 01:26:26

I'm like

Speaker 2: 01:26:27

Other than saving the company and and and our customers from data collection, are we still active, for Brian's that.

Speaker 1: 01:26:35

Okay.

Speaker 3: 01:26:40

So I mean, I mean

Speaker 1: 01:26:41

I just think that, like I mean, not words in your mouth, but I think that that, you know, that conversation with Robert where you began to really attack the actual manifestations of the problem, I think that's an that was an important inflection point on this.

Speaker 3: 01:26:58

Yes. Very much. Totally. And, there there's another thing that I haven't figured out how to distill it, but I just ran into so many, many problems along the way. We talked about a couple of these.

Speaker 3: 01:27:11

But, like, you know, just, you know, I wanted to look I wanted to search the address space for something and, you know, this bit pattern. And we have this ugrep command for that, but it doesn't work if libumem isn't loaded. And so Adam gave me an incantation that will basically, like, dump all all of it to a text file, and then I can grab that, which is fine. I mean, it's good. There's just, like, so many things.

Speaker 3: 01:27:32

Another example is I wanted to try to reproduce this on AMD in AWS because at at one point, I was like, is this my machine? Like, is is memory being or or bits of memory being flipped in my machine? I'll just, like, provision AWS machine, but I couldn't because, we had some trouble with the newer, AMD instances in AWS. And so there's, like, a meta point here around, like, deciding that it's important enough that you're gonna keep going even though there's all these things, which is hard when, as Adam, you said, you're like, I don't know what the probability is that this path is gonna be that important. Like, I wanna grab this address space, but, like, I don't know if this is really that important to spend the time to that one's not that time consuming, but there are just a lot of other stuff.

Speaker 3: 01:28:16

Or, like, teaching MDB about the Go structure. It's like I wanna print this. How badly do I wanna print this? For a while, I was, like, not badly enough, and then, eventually, I was like, okay. I'm gonna figure out how to do this.

Speaker 2: 01:28:27

Dave, I think they're there there there's so much that people can learn from your experience here, and I think that if they take away nothing else, it's just the level of detail that you put into the write up.

Speaker 6: 01:28:39

Really good.

Speaker 2: 01:28:39

Because it's that artifact that lets everyone benefit from from this experience. And and to some degree, I mean, obviously, fixing diagnosing and fixing the bug is is hugely beneficial. But the the write up that you did just carries it beyond that because, like, you're gonna influence folks on the team, folks on on, you know, the podcast, but then just folks in the world who want to, you know, have that joy of debugging, and discover that. And that that artifact really comes along.

Speaker 1: 01:29:08

Yeah. I and I was thinking back to, you know, Lukeman. We had the the the terrific episode. What was it? Or in the in the earlier in the year, Adam, with Lukeman and Jordan on their write ups.

Speaker 1: 01:29:19

And, you know, I just think it's so important. It's so important pedagogically. But, also, Dave, it's important to kind of for your own thinking, I'm sure. But, I loved seeing all the the the techniques and the different techniques. I mean, Dave, you you described this as a bingo card at one point, and it really does feel like you've got you've got blackout on the debugging bingo card.

Speaker 3: 01:29:42

Yeah. I definitely felt like I pulled out everything I knew when at one point or another. And, yeah, I think the write up's really important. And part of it is because I feel like I invested all this time, like, we've gotta be able to leverage that somehow. So I wanna I want this information to be available, not just, like, if people see a similar problem in the future or people are wondering, like, how do I do this particular thing.

Speaker 3: 01:30:03

But also for myself looking back, every time I've looked back at a bug that I've written analysis for, no matter how detailed I thought I was, there's always stuff that I wish that I had put in. I'm like, wait. I know there was something else here that I've forgotten that I didn't write down. And so my write ups have gone longer and longer, but I just feel like it's a way to make that totally off.

Speaker 1: 01:30:21

And I think I I do think that the the bit that is that we think we've gotten into it here. The despair of when you are not actually feeling like you're making progress on it is something that I mean, you know, problems worthy of attack prove their worth by fighting back. Right? The the Pete Heinlein. And I feel like this one fought back a lot.

Speaker 1: 01:30:43

This one was definitely, was heading you off at every corner, and I think it's easy for, like, it's hard to persist on a problem when you have despair. It's really hard. But, they're hard problems because of that. That's what makes them hard. Yeah.

Speaker 4: 01:31:02

Absolutely. The ex HCI despair. It's definitely like that was a source of despair.

Speaker 6: 01:31:09

There's another sort of benefit to doing a write up with this level of detail, which is it's a little bit, I guess soft is one is the way one might describe it, but it helps set a culture that says, hey, it's it's okay to attack really hard problems. And like it is okay to sort of go down into the well of despair and be like, oh my God, I'm not making any progress. Like, You know what? What is going to happen here? Like that is something that I think that we we encourage it outside.

Speaker 6: 01:31:39

We encourage the development tooling. And Brian is, you've said several times, like, it's never time that's wasted. It's always time well spent. And I've been in lots of other organizations where people like that's not important. Don't do that.

Speaker 6: 01:31:50

And that's, you know, I I think if you have these artifacts that people can look at and say, oh, hey, look, you know. It took all this time and, like, really ran this down to ground and, like, figured it out and found this actually really pretty severe bug in the operating system that could have been completely dead.

Speaker 1: 01:32:04

Yeah. You make it even more important. And, you know, hopefully you know, and this is the advantage of it being open source and everything being out there. You know, hopefully, if someone saw a, you know, different environment, different operating system, different programming environment, but saw a problem that only reproduced, you know, under certain conditions, under a test run or what have you, but was disconcerting for and and wanted to be able to investigate it to be able to point to this kind of odyssey and say, hey. This, like, this can be really, really important.

Speaker 1: 01:32:30

And, you know, my big belief is that these things may be giving you their last your last opportunity to really debug it. And the next time, maybe in the field, maybe with a corrupt database, it may be much more difficult to debug. So seize those opportunities, when you got them, which Dave, you definitely did on this one. Great work. Really, really, really good work.

Speaker 1: 01:32:55

I'm sure that must have been a huge relief.

Speaker 3: 01:33:01

Thanks. Yeah.

Speaker 1: 01:33:02

Yeah. And and it was a village too. It was a lot of you know, obviously, Robert, you know, playing a kind of a clutch role at the end and then and Keith and everyone else along the way, giving you Adam and everyone else giving you helping

Speaker 3: 01:33:16

you out. So Totally. Totally.

Speaker 1: 01:33:20

Good stuff. A great debugging yarn. So, well, Dave, thanks again for joining us. Really, really appreciate it. Yeah.

Speaker 1: 01:33:30

Thanks for having me. We are. So, Adam, what are we what are we thinking about for next week? Are we.

Speaker 2: 01:33:39

I don't know if we have a topic, but I I think we said that we're gonna

Speaker 1: 01:33:42

do it next week and then then not but not the week after. Not the on so so we will do it the day after Christmas, but not the day after New Year's, effectively, because I think you're out. Right? And then and then we got predictions for the next week. I feel we may wanna do some some, some looking back of our predictions from 2020.

Speaker 2: 01:34:01

Oh, yeah. For sure. For sure. We're gonna have a little bit of a clip show on that one, for sure.

Speaker 1: 01:34:05

Is this have you relistened to the predictions episode recently?

Speaker 2: 01:34:10

No. I haven't.

Speaker 4: 01:34:11

Yes? Have you?

Speaker 1: 01:34:12

And you are your your web 3 prediction is really, really good, I felt.

Speaker 2: 01:34:21

I think I I actually I did go back and I looked at the show notes on on that.

Speaker 1: 01:34:25

No. You should be looking at that one. It totally reminds me of my iPhone prediction. This prediction that was very correct, but also very wrong in that I was dismissing my own prediction. You were like, I think no one is gonna even talk about web 3.

Speaker 1: 01:34:36

It's gonna be a term that people haven't remembered. But I also, this is my heart, not my head making this prediction. Like, alright. Adam's heart nailed it. Alright.

Speaker 1: 01:34:46

So we'll, we'll see everyone, next week. And, have a great holiday. We'll, we'll see you in a week.

Speaker 2: 01:34:56

And and Dave, last, depressing thought, that the next time there is a Go, memory subsystem bug, unfortunately, you're gonna be the one I turn to.

Speaker 3: 01:35:07

Help me out.

Speaker 2: 01:35:09

Alright. Thanks, everyone. Great work, Dave.