Oxide and Friends | Transcript: Futurelock

Futurelock

November 7, 2025 / 01:37:52/S5 E31

Bryan Cantrill: 00:00

How are you?

Adam Leventhal: 00:00

I'm doing very well, Brian. How are you?

Bryan Cantrill: 00:03

I'm doing well. You know, we we were very concerned or rather the Internet was concerned that we were scheduling last week's podcast over the world series. How cruel. And that they were gonna have to miss the world series game.

Adam Leventhal: 00:16

Yeah. Miss that whole game.

Bryan Cantrill: 00:17

I I actually, I guess I I get it was more that they're gonna miss oxide friends to watch the world series game. As it turns out, they would have only missed the first third of that game. Yeah. Because that game I tuned into that game in the tenth inning and watched for three and a half hours.

Adam Leventhal: 00:32

So, yeah, I mean, we started so we ended at 06:30. I that game went until almost midnight. So there's still plenty of game left.

Bryan Cantrill: 00:42

No. It was it was amazing. Yeah. And the did you watch the game? I I assume you

Adam Leventhal: 00:47

want I listened to it on the radio because I don't know why. Because I'm I'm I'm a psychopath, I guess.

Bryan Cantrill: 00:54

Did you did you watch the entire I I I I it's my understanding that people really love it when we go deep into baseball. So did you watch the rest of the world series? Did you

Adam Leventhal: 01:03

My wife's birthday was conflicted with the world series, so I prioritized that. But then I we watched game seven as a family, and it was fun. And except for for Canada. So I'm sorry, Canada.

Bryan Cantrill: 01:15

Why do they not send the runner, Adam? I this is you you're a baseball scholar. Can you explain to me why I mean Kai

Adam Leventhal: 01:22

Falefa on that on that play at at the plate?

Bryan Cantrill: 01:24

No, no, no, no, no, no, no. The, when you had the, you got runners at first and third with one away and you've got, you've got perk up. Why would you not send the runner from first to second

Adam Leventhal: 01:38

to get out

Bryan Cantrill: 01:39

of the double play? So I was just like, get out of the double play. What are you doing?

Adam Leventhal: 01:44

Because

Bryan Cantrill: 01:44

I think there's

Adam Leventhal: 01:45

too much risk of him being thrown out because I I don't think he's a great runner.

Bryan Cantrill: 01:48

Get someone in who can like, if you audit No. Like, serious if you cannot take that bag in that situation, you just should never be on a base path. You know what I mean? I I don't know.

Adam Leventhal: 02:00

I mean, it's it's like it's

Bryan Cantrill: 02:01

kinda high. One out and a runner on third. You really think they're gonna gun you down in score? The plate and runner.

Adam Leventhal: 02:06

Yes. I do think that they're I mean, I look. Having just

Bryan Cantrill: 02:09

Who's on the track team in the dugout? Get him out there. I don't know. I just thought I

Adam Leventhal: 02:14

think they're all they've been in the game. I think they they I mean, at that point, you'd you'd kinda gone

Bryan Cantrill: 02:18

through the people who knew how to play baseball. With the benefit of hindsight, though. I mean, was a GIDP. Was a gun in the double play.

Adam Leventhal: 02:24

Yeah. God. What a brutal way to loot. I mean

Bryan Cantrill: 02:28

It's harder to do with a runner on second in the base bag open. Dave, you're

Adam Leventhal: 02:32

This power. If if if look. If you're if you're auditioning for your next role as a Major League Baseball

Bryan Cantrill: 02:39

coach Yeah.

Adam Leventhal: 02:40

How's going? I I I don't think you had been a a literally baseball coach for many years.

Bryan Cantrill: 02:45

I think I know you podcast.

Adam Leventhal: 02:47

I know you I know you would have sent that runner in a heartbeat on your double a on your grasshoppers team or whatever.

Bryan Cantrill: 02:53

Well, you know, listen. The the the catcher would just gonna sail it out into the outfield. Like, why would you not?

Adam Leventhal: 02:59

Yeah. No. I guess that makes sense.

Bryan Cantrill: 03:01

No. That that's the class play where you have the first the guy in first, like walks to second because they're trying to bait the throw so you can send the runner home.

Adam Leventhal: 03:10

I okay. Good. I I know. I know. I know it worked in your little league.

Adam Leventhal: 03:14

Why would it not work against I guess that makes sense.

Bryan Cantrill: 03:18

I I gotta say, send the runner. Yeah. I I well, that's gotta get arm. Send the runner. Get someone on the track team .

Bryan Cantrill: 03:23

That's all I

Bryan Cantrill: 03:23

gotta say.

Bryan Cantrill: 03:24

That's they but it was an amazing World Series also with with two amazing with many amazing pitching narratives, I'm obviously a bias. I Will Klein just the pitcher for the Dodgers in that 18 in game. I mean, you you you had to root for the Dodgers in that game with based on Will Klein's performance, it was amazing. So there you go. I think I've said my piece.

Adam Leventhal: 03:46

There we go. Did you when would you have said the last time we talked about async on oxide and France was? Like, maybe you've already looked it up.

Bryan Cantrill: 03:57

You know, I I what I can't recall, and I haven't I meant to look it up and probably should have looked it up. But did we ever discuss March and April here, or did we we mean mainly, do we merely talk about the need to talk about it here as Rain unmutes herself? It's like

Rain Paharia: 04:14

Yeah. I think we only talked about the need to talk about it here. And then I was like, hey. This is not gonna happen here. Let me go do it at Ruscon.

Rain Paharia: 04:22

And then I went So

Bryan Cantrill: 04:23

Oh, rain. Oh, I'm Oh, wow. Oh, oh, man. I'm so sorry. You're just like, well, the podcast doesn't want this.

Bryan Cantrill: 04:32

So no, that's not it at all. I boy. Okay. Yeah. Feel bad.

Bryan Cantrill: 04:37

Which obviously rain, I'm sorry. Let me just let me just get that out there.

Rain Paharia: 04:40

Oh, no. You're good. You're good.

Rain Paharia: 04:42

It's all in jest.

Bryan Cantrill: 04:43

Yeah. That's but we we obviously should have talked about the this the async cancellation issue which rained, I mean, really for the better the better of humanity if that's what inspired you to give your rest comp talk because a really extraordinary rest comp talk and a great blog entry that accompanies it on on that issue. And was that Adam, was that the last time? Oh, we are I guess the last time being that time we didn't discuss that. What was the last time?

Bryan Cantrill: 05:09

So it was the

Adam Leventhal: 05:10

last So we talked about it in like June.

Bryan Cantrill: 05:13

In June with the other issue, the when we were talking about state maps.

Adam Leventhal: 05:18

Yes. Exactly right.

Bryan Cantrill: 05:19

Yeah. And so, yeah, we are back again with another and and Dave, don't know where how we want to pick this up or John. So we are once again, we find ourselves just trying to ship a product really over here. We're really not trying to find like new pathologies in the language, I don't think, or in the, I I think we're just trying to ship a product and something that's robust and works. And but we have found ourselves here again where we have found something that is a pretty deep pathology.

Bryan Cantrill: 05:57

But John, do you want to kick us off with kind of how we got here? Because I think the, you know, the story that you told, I thought was really compelling and interesting in terms of what the symptoms were. Did this start? How did this odyssey start?

John Gallagher: 06:16

Sure. I should probably hand it over to Dave pretty quickly. So what happened was we have been testing live update, right? Which I think we've talked about some on the podcast. So this is like the entire oxide rack, which is a distributed system updating itself while the control plane remains online.

John Gallagher: 06:33

We had done a, we've been trying to get a bunch of repetitions of that in play on what we call our dog food rack, like the rack we host at the office. So Angela, a coworker of ours had kicked off another update. These take several hours to run because the whole thing stays up. So just updates one thing at a time in the background. And she had kicked one off over the weekend.

John Gallagher: 06:53

I think this was three weeks ago, maybe a time, I don't know, two weeks ago. And she filed an issue saying after a couple hours after the update, one of the Nexus instances Nexus is our our main API server that's reachable from the outside. One of the three was unresponsive. And I think I should hand it over to Dave at this point because Dave, I think you actually got on on Sunday to just kind of like poke around a little bit and see like how bad things looked. Is that right?

Dave Pacheco: 07:23

Yeah, that's right. I'm not sure why I did, but I just happened to be checking chat and I was like, maybe this will be something easy. And so I just basically spent a couple of minutes looking for the easy things, and that was disappointing. Basically, yeah. So Angela had reported that we had a bunch of this Nexus instance of hung and you tried to make requests to it.

Dave Pacheco: 07:45

That was easy to reproduce. So then the obvious thing is I tried to make different types of requests to it, especially ones that I thought it could fail very quickly. So like something that's not even a valid API, that's not a URL for which we have a handler, and it's returning successful or, you know, error responses for that very quickly. So that's pretty informative. So the process is not totally stuck, but it's stuck doing something.

Dave Pacheco: 08:08

And then I managed to find an API endpoint that did something nontrivial that also succeeded, which was sort of the first hint that, like, okay, something specific in this thing is hung. And then the question is, what is it? I don't know how much I actually did that day, but the next day at some point we kind of dove in some more. Should we just like keep going with

Bryan Cantrill: 08:29

Yeah.

Dave Pacheco: 08:30

Right now?

Bryan Cantrill: 08:31

Yeah. I think it is helpful to kind of take this from the outside in in terms of debugging this because I do think one thing that's really important to understand is the kind of the nature of this pathology, which is pretty hard to debug actually. It it actually, you've got I mean, not not to give away any spoilers, but let's just take Giedra becomes involved at some point. So the yes. What's next?

Bryan Cantrill: 08:59

So you got a a Nexus instance that appears to be hung and is hanging on certain kinds of operations.

Dave Pacheco: 09:07

Right. So the natural next thing is to start digging into the logs, and I see the last log that we completed, which was a I think it was an authorization. We successfully authorized some trivial thing within the process. It was that we were trying to see if we could query the database, and we agreed. We said, okay, yes, this user can query the database.

Dave Pacheco: 09:26

And then we got stuck. So that suggests that we were about to go do something with the database. So I did this sort of obvious thing and looked for duck database transactions or connections. Like, you know, looked into the database cockroach to be and was like, I did a query that was like, what are the current running queries from this particular host? And it was nothing.

Dave Pacheco: 09:47

It hadn't hadn't made any queries in several hours at that point. And it wasn't currently stuck in a transaction. So that kind of ruled out the obvious things. Like we somehow got stuck on some database lock and everything behind that. And that's when we sort of like, from my perspective, we had to bust out the sort of like big guns, the sort of much more time consuming and error prone debugging, which is the sort of this is where I busted out the paid provider tracing of the Tokyo task activity and the drop shot activity.

Dave Pacheco: 10:21

And this was kind of based on some of the so back in June, when we had debugged a different problem, it was not a hang, but it was sort of similar, where something would hang for, like, ten minutes or something like that. Right? Was the problem we were debugging back in June. That one turned out to be a problem where the the Tokyo scheduler was not picking up a runnable task when it should have, but it did eventually pick it up and did eventually complete it. Things were just getting stuck in the meantime.

Dave Pacheco: 10:47

And so the sort of obvious thing, if you're like, Well, I don't know what this thing is getting stuck on is, well, I'm going to start tracing everything I can in the process between a point where I know it ran and a point where it stopped running. And I was able to use the probes that Eliza had added after that bug from back in June to have to have a much simpler way of tracking activity by Tokyo task. And from there, we were able to see what was going on, which was the thing was basically going to sleep. What were able to tell at that point? I think we were able to tell that it was the task was going to sleep trying to trying to claim a database connection from CORB, send a message on an MPSC channel to in Qorb.

Dave Pacheco: 11:34

So, basically, Qorb go ahead.

Bryan Cantrill: 11:37

Qorb, q o r b, which is a is doing connection pooling that Sean wrote. Right? Inspired by the cue ball work at Joyan. And the and and then you're using just to to fill in a couple of the things you went by. The the instrumentation that you're using is the instrumentation that we used after this this episode.

Bryan Cantrill: 11:56

I mean, both, I guess, metaphorically and literally because we, of course, did a podcast episode on it. This is the when async attacks episode. If Adam hasn't already rung the chime by now, well, he will ring it here. Hope one hopes. The there's just a future Adam Adam, not present Adam, which has

Adam Leventhal: 12:12

no Understood. Powers. Yeah. Yeah. Exactly.

Adam Leventhal: 12:13

Exactly.

Bryan Cantrill: 12:14

The the notes future Adam. The which by the time people are listening to this is actually past Adam, but, know, a little bit

Adam Leventhal: 12:20

Very confusing.

Bryan Cantrill: 12:22

But so and Dave, you are using and we just talked about using the PID provider in the Tokyo in Tokyo and then the the probes that Eliza added, you're just trying to, like, you're kind

John Gallagher: 12:32

of like,

Bryan Cantrill: 12:32

what the hell is this thing?

John Gallagher: 12:34

Right? That thing is kind like What

Dave Pacheco: 12:35

are instructions that are executing? Right. Tell me anything about what's going on, basically.

John Gallagher: 12:41

We should say Yeah. Well, there are also probes in CORB itself, many of which were added after the deboning Yeah. Of the And that that they were good, but just more confusing. Right? Like like the things that we were the information we were getting

Bryan Cantrill: 12:56

from CORB

John Gallagher: 12:56

made no sense to us. Right? So CORB is sort of like your standard actor pattern. Right? So it has like a it's when you create a CORB pool, it has spawned a Tokyo task that is just receiving off of a Tokyo NPC channel, and it's running a big loop.

John Gallagher: 13:14

Right? Like, polling for messages on a channel and some other, like, internal bookkeeping stuff. And the probes that Sean had added, we were seeing Corb's internal bookkeeping fire every sixty seconds just like it's supposed to. So like that, normally if you can't talk to an actor, the assumption is that the actor has gone out to lunch, right? The normal pathology we've seen in the past.

John Gallagher: 13:36

It's like, all these things are queued up trying to talk to this actor because the actor isn't pulling messages out. But Corb was running. Like we saw, like every sixty seconds, it does its bookkeeping. It takes like a hundred milliseconds and then it goes back to sleep, which like, I think multiple times as we were looking at the code, were like, don't understand where we could possibly be stuck based on like, all of the all of the senders seem to get to the point where they're trying to claim, which means they're sending a message to Corb. And Corb on the other end is just happily running along in the background, never returning claims, but still doing its internal bookkeeping every sixty seconds.

John Gallagher: 14:10

That's about as far as we were at this point, I think.

Bryan Cantrill: 14:12

Which doesn't make a lot of sense. But it's also, like, really important data points. And this is what I love when you kind of, like, gather just facts about the system that seemingly don't make sense. You know that kinda any future hypothesis has to accommodate these facts. It's actually helpful to have them even though I'm not sure.

Bryan Cantrill: 14:30

Did it feel helpful at the time or did it just feel John, I remember at one point you mentioned like one of the hypotheses was we are actually observing like the wrong corp. Like the corp that we are actually observing is one that is much more functional. We are somehow, like, maybe something's going on, and, we are observing the wrong thing. I mean, you're beginning to ask yourself because things are not making sense.

John Gallagher: 14:52

A 100%. Like, what you said about facts that reject hypotheses, like nearly every hypothesis we came up with was rejected based on the facts we had, right? Like it just, we were like, we have to at least be getting here, but we have evidence that we're not. So like any theory that involves we're getting stuck at this point along the way, get to throw out because we're not like we can see from the evidence that we're not even making it that far. I think we had we

Dave Pacheco: 15:16

were able to tell what we had that we were not stuck on CPU. We also looked at CPU utilization. Like it wasn't like the cores were pegged. We could also see the thread that we expected to be running these things was going to sleep and not picking up other tasks. So it's not like we didn't have a Tokyo task.

Dave Pacheco: 15:34

It's not like we didn't have Tokyo task to run this thing. Right? We could tell that we were idle and Tokyo didn't have anything to run. That, like, ruled out a bunch of theories. And as John said, we were staring at the code and it's like there's not that much explanation for what's going on at this point.

John Gallagher: 15:50

Yeah, we should say So one thing to clarify is that when we say that everyone trying to claim a connection was going through an MPSC send call, right? So the handle to core, which gets cloned all over the place, is basically just a wrapper around an MPSC sender. The channel is of size one, which is relevant later in the story, but, like, it's not immediately obvious at this point why. So all of the senders the only thing that goes into that channel is a we use the the normal send function, which is just like send message dot await, and it blocks forever until it's able to send on that channel. So the only two theories we had at this point, spoiler, neither of these was right.

John Gallagher: 16:34

The only two theories we have are that yeah. One is that we were we're observing a different cord pool. Like, have we somehow spawned two database connection pools and we're seeing, like, one working, but the other one is

Dave Pacheco: 16:45

Which, to be fair, we believed from code inspection was impossible

John Gallagher: 16:49

That's right. The other theory, which I think all along I claimed was statistically impossible, was that somehow the Tokyo MPSC channel itself had gotten broken in some way, which, I mean, we are far from the only user of that, and we use it hundreds of thousands of times a day in a thousand different contexts. Right? So the idea that somehow we've tripped over some weird edge case inside the channel itself seems impossible to me, but I have no other theories at this point about what could induce this behavior.

Bryan Cantrill: 17:19

I mean, Eliza is saying in chat that this is about when she got her subpoena. I mean, must feel like, I mean, you've got, did it feel a little bit like Murder on the Orient Express where you have like as the four because the four of you, you Dave, John, Eliza, and Sean are all debugging this together. And all of your software is kind of intermixed in here. And obviously, like, you know, know, I mean, obviously, blameless postmortem, of course. But meanwhile, each of you individually must be thinking like, shit, is this like do you mind how can we put this code carefully?

Bryan Cantrill: 17:49

Is this I I mean or or or were you was it did it feel like force majeure? Did it feel like it was transcending any one of you? Not to personalize it.

Dave Pacheco: 17:59

I I feel like it was still very unclear at this point

Bryan Cantrill: 18:02

because Okay.

Dave Pacheco: 18:03

Because it seems so impossible. You know? And I I kept I wanted it to be in some code that we controlled more easily, like in in something. But we had a couple of conversations where we were like, is there any way this could not be a bug? Like, how could it not be a bug in the Tokyo channel?

Dave Pacheco: 18:18

Like, it doesn't like, I don't think any of us was convinced it was, but we could not come up with a way in which it wasn't based on just looking at the, like the there's only two or three code paths that touch this channel. What we were able to determine conclusively from the tracing was that the senders were going to sleep on the sending side and the receiver side was relative, was frequently polling that thing and finding no messages. It's like, how is that possible? And if you know the answer, it's like, well, actually, I guess it's kind of obvious, but it was very not obvious at the time.

Bryan Cantrill: 18:50

Very not obvious. Yeah.

Sean Klein: 18:52

There's a bit of an oversimplification of like our mental model that was here, which is you have a MPSC channel and you have a receiver that is awake, alive, pulling values off of that channel and is not getting stuck, then senders should be able to send, right? Like, things will get through the sending side as long as they're pushing things through. So there are details that I'm leaving out here intentionally, but with that mental model, we see the receiver is waking up to do other work. Like that task is waking up, and the channel appears empty. So that really felt like a major, like, violation of expectation that the channel seems empty, you know, no one can put anything

Bryan Cantrill: 19:34

to it. Right. Right. Right. And it's, Yeah.

Bryan Cantrill: 19:40

John, go ahead.

John Gallagher: 19:40

Well, the thing we decided to do here was like, can we can we collect more information about the state of the channel? So I think it's about at this point that Dave collected a core from the process at the point it was calling receive. So, like, from from the perspective of the thread that happened to be running the core of actor task, the the like, the instruction at which like, the first instruction inside receive, collect a core there, we can load it in the debugger. And then

Bryan Cantrill: 20:09

And, George, can we just pause just for our heartbeat because this is a technique that I that we have used a decent amount on our most vexing problems, but I think it's a somewhat unusual technique. Dave, can you just describe a little bit about how you were collecting a core dump from a process that was not dead?

Dave Pacheco: 20:26

Yeah. So this was DTrace, PID provider for the win here. This is basically looking at the I think I've looked at the trace first. So I guess backing up a little bit, I didn't really explain what I had traced earlier. When I created that trace earlier, I was using some detrace probes that we've created with the USCT from drop shot around the HTTP request handler from handling from CORB around its claiming, claiming database connections and stuff.

Dave Pacheco: 20:56

And also the DTrace PID provider, which is capable of instrumenting, most instructions in user land processes and also function entry function boundaries, so entry and return. Is that a fair summary? And so I was I was trying to trace everything in Tokyo and everything in Nexus, which is our program, but not everything in the whole process because there were way too many symbols in the process. And there was that was too much to trace. From that trace, we were able to see the Tokyo functions that were getting called, which included the receive.

Dave Pacheco: 21:27

And then I was able to use that to trace. So I created a new de enabling that would trace entry to that function. And I stopped the process there using the stop action, which is the destructive action in D trace. And I ran g core to save a core file at that point and then p run to run the process again.

Bryan Cantrill: 21:46

So this is where we're gonna get a we are gonna get the process into a state that is interesting to us by using dynamic instrumentation. And then we're gonna take a we're gonna use g core to grab a cord up of it. And now we've got everything that is static in the system. Now we can actually go study this thing offline and I mean study maybe putting that either literally in a sense, John, because at this point you you you've got a core, but the core is a limited utility in part because of some of the way Rust operates.

John Gallagher: 22:21

Yeah. That's right. I don't know if anybody's had the pleasure of trying to read assembled Rust, but it's challenging. So at this point, I spent the better part of a day with Ghidra. So I loaded Nexus into Ghidra, which took, I mean, Ghidra was a champ, took like forty five minutes to just do its initial analysis of the binary.

John Gallagher: 22:41

But I'm sort of like source assisted reverse engineering, I guess, maybe the right phrase for it. Like, feels like reverse engineering because there's been so much inlining going on, even though I have all the source to our program and the Rust Standard Library and Tokyo, etcetera. So I think there's some walkthrough in the issue that is linked from the RFD. So I don't know if anybody's ever done reverse engineering, like a malware or whatever, but it's like a very unglamorous task. You just sort of spend hours staring at assembly, trying things.

John Gallagher: 23:16

Eventually I'm able to convince myself that I can track through like the particular bits of the channel are in these registers at these point in times. And the things that we care about to look at are at these offsets from these registers. Guess we can, at this point, we're like really digging into the internals of the channel just to understand what that is. So an MPSC channel, this is gonna be a little oversimplified, but not a ton, is a semaphore and a linked list of blocks that hold the actual messages. So from the receiving side, the thing we we could look at was what's the current state of the semaphore, which it said there were no permits available, which is consistent with all the sender's blocking, but maybe is a little surprising if we expected like someone to be able to send into it, And there were also no messages in it.

John Gallagher: 24:07

So like we can look at the current head of the link list so that block will hold 32 messages and we could see that we were like 26 or 27 messages into that block and we saw all the old messages and we see space for the new messages. And like, at this point, we're like, well, this confirms everything we've already seen, which is that from the receiver's point of view, this channel is empty. There are no messages here. It is calling receive. The arguments all look right.

John Gallagher: 24:33

Nothing looks corrupted. It's just an empty channel.

Bryan Cantrill: 24:35

Which on the one hand, I mean, is okay, this is like a vexing because this is still like, there's still obviously a huge mystery here, but we've got a lot of things that are at least lining up. Right? We're not this is not rampant data corruption. We've got could maybe some other fears that one might have. It's like this program is in a state that we do not understand, but we have some self consistency here.

Bryan Cantrill: 24:55

I mean, I I assume that that was of some solace. Maybe Yeah. It's easier to say that after the fact. Exactly. Yeah.

John Gallagher: 25:02

I thought that's right.

Bryan Cantrill: 25:03

Yeah. Like, yeah. Alright.

John Gallagher: 25:06

I mean, there was there was definitely a part of me that was hoping I would look at the, like, current permit count in the semaphore and see, you know, just garbage. Right? Like, that would that in some ways, that would be more of a relief. Right?

Bryan Cantrill: 25:16

Oh, well, I'm glad that you would have been very relieved by that. Meanwhile, the rest of

John Gallagher: 25:20

us Yeah.

Rain Paharia: 25:21

Yeah. Well, don't

Bryan Cantrill: 25:21

reliable computer would have been terrified by it. But

John Gallagher: 25:25

Play back at the recording. I don't think I said very relieved. I think

Bryan Cantrill: 25:28

I would have said Early. Exactly. Yeah. Fair. Fair.

Bryan Cantrill: 25:30

Fair. Exactly. Fair.

John Gallagher: 25:31

It would at least explain the the symptoms that we had that we had no theories for. Right? Like we seem to have an empty channel that an empty channel with 180,000 tasks trying to send to it. Like this doesn't make any sense. Right?

Bryan Cantrill: 25:44

Yeah. Think this is So I think

Bryan Cantrill: 25:49

Go ahead, Elisa.

Eliza Weisman: 25:50

Oh, I was this is when I think John started asking me some questions about the semaphore that started to make me feel really uncomfortable.

Bryan Cantrill: 25:59

Right. Exactly. Like, I feel I need my lawyer present for these questions. Why are you asking me such a pointed question? Right.

Eliza Weisman: 26:09

And we did some we did some stuff. One, we did like a lot of just walking through the the semaphore code and looking for things that look like potential, like issues that might have caused this this weird behavior. It you know, it's it's hard to do that because the last time this code in Tokyo changed substantially, I checked the blame, was when I rewrote the semaphore five years ago. And so it's been like that for five years. And if there was this bug in it that caused it to, you know, fail to distribute permits in the way that it's supposed to, That bug, somebody should have seen that before this happened to us.

Eliza Weisman: 27:02

Right? So it feels it feels kind of implausible, but

Bryan Cantrill: 27:07

it's but I can I love Alyssa, I just love the trepidation in your voice where you are clearly telling yourself this in the mirror because you also must know that like, well, it's definitely possible that someone would see this pathology and not be able to dig enough to, I mean, it's, it feels like it's extremely unlikely, but not impossible?

Eliza Weisman: 27:27

This is about where I think some combination of Dave and John and I started coming up with like scenarios in which we thought something might misbehave that I think involved like possible races around acquiring and dropping permits. And I wrote some like Loom tests. Loom is a concurrency model checker that Carl actually, Carl Erica wrote while he was rewriting the Tokyo scheduler at about the same time. It essentially does like a simulation of possible interleavings permitted by the c plus plus memory model, which is the memory model that Rust's atomics implement. And it can be very useful for finding this kind of like weird deadlock where maybe some particular ordering of events results in something happening.

Eliza Weisman: 28:19

And so I started trying to write some tests that I hoped would trigger this kind of hang based on some of the scenarios we were dreaming up. But we really we could not get anything to break, which was pretty unfortunate.

Bryan Cantrill: 28:34

So what's the next breakthrough at this point? So, I mean, I first of I love that about, like, okay, let's go back to some of the the tooling that we've got in terms of Loom to go see if we can go induce this in a way that we can we can detect it.

John Gallagher: 28:47

Yeah. So, I mean, the next thing we did, this was another fool's errand in hindsight, but the next thing we did was collect a core from one of the senders. Right? So this didn't go quite as smoothly. I don't know, Dave, if you wanna talk about that.

John Gallagher: 28:59

We ended up accidentally killing the process in the steps to collect this core.

Dave Pacheco: 29:06

Yeah. So I was going to mention, there's this context that I don't think we've actually said, or maybe it was implied, that we have never seen this problem before, and we had no idea what triggered it

Bryan Cantrill: 29:16

at all.

Dave Pacheco: 29:17

And we have, it's a complicated system. It's a complicated control plane. It does a zillion different things. So we had no idea how to reproduce it. All we had was this one precious process.

Dave Pacheco: 29:26

And that's why we were doing these sort of sort of production focused debugging in terms of using D trace and tracing these individual things and trying to get core files and disassembling and trying to pick the state out because we had no other way to get more information except from the live system that had hit this. And then, yeah, I don't know how much it's worth going into the digression, but I tried to do the same thing on the send side. And I forget exactly what problem I ran into. I called in Adam for some help, and then I did something I shouldn't have done. And and I think I I tried to use MDB with no stop so that it wouldn't stop the process instead of breakpoint and then exited.

Dave Pacheco: 30:07

And Yeah.

Adam Leventhal: 30:08

You wanna guess how that one went, Brian?

Bryan Cantrill: 30:10

Not well. Yeah.

Dave Pacheco: 30:12

The good news is we got a core file.

Adam Leventhal: 30:16

I think you can guess what instruction it was on. Yeah.

Dave Pacheco: 30:19

We got a core file in precisely the context we wanted a core file.

Bryan Cantrill: 30:22

There you go. Right. That's the good news. The bad news is there'll be no future core files because this core file is actually the last core file from this process. This process has now died.

Dave Pacheco: 30:31

That's

Bryan Cantrill: 30:31

right. Okay. So what so now with with the killing of that, have we now lost the have we now lost the problem?

Dave Pacheco: 30:42

That we we did. Yeah. But John, do you want to talk about what you found from that one?

John Gallagher: 30:48

Sure. I mean, the the main thing we were because remember earlier we had two theories. One is like the internal state of the channel is corrupted somehow. The receiving half seem everything seems fine. The the other theory we had was maybe we've somehow spawned two pools.

John Gallagher: 31:02

So we're like, if we collect from the sending side, we can at least see if they're talking on the same channel. And we were able to confirm that. So from this from the core collected on the sending side during the death of the process, we confirmed that it was the same channel and that the channel state looked the from the sending side. Like there are no permits available, which is like that explains why all the senders are blocked. So I have a comment from the issue that basically after we did the analysis of both of these cores, like the internal state, according to both cores, it's consistent with all of the evidence we had seen from the DTH probes, which is that the semaphore has no permits available, therefore all the senders are blocked.

John Gallagher: 31:44

And also the channel has no messages in which explains why CORB is continuing to process but not handing out any claims. And I have I have done I where I wrote down in the issue what our theories were at this point. One was a sender had grabbed a permit and leaked it without inserting a message at all. Second was a sender had grabbed a permit and was still holding on to the permit, but wasn't inserting a message for some reason. A third one was the receiver, which is the corb actor, had gotten a message out, but had failed to put the permit back into the channel somehow.

John Gallagher: 32:20

Or the fourth one was a This gets into like the details of how the channel is implemented, which actually is relevant in the way we ended up tracing the problem here. So this channel has size one, which means it nominally has a semaphore with one permit in it in its sort of initial state. Once senders have lined up to send into the channel, when the receiver pulls a message out and goes to put a permit back into the channel, it does not it doesn't start by putting the permit back into the semaphore directly. Instead, it goes to the first sender waiting in line. Right?

John Gallagher: 32:57

So this channel ensures fairness. Like if you have 10 people call send at once on a channel of size one, the first one puts their message in, the second one blocks. Once the receiver pulls the message out, it's gonna go give the permit directly to the sender that was second in line instead of putting it back in the semaphore and letting everybody race forward or something like that. So the last idea on our theory was did we actually give the permit back to someone but then fail to wake them up? Like, did we have is this like a missed wake up of some kind?

John Gallagher: 33:27

So those those were the four theories we had at this point.

Dave Pacheco: 33:30

Is that the Yeah. The second one was the sender grabbed a permit and is still holding it, but not inserting a message.

John Gallagher: 33:37

That's right.

Dave Pacheco: 33:38

Is that the is that the same as the fourth one, or is there a different thing you were thinking there?

John Gallagher: 33:42

I think it's I think we were wondering, is this I think it's the same same pathology, but different reason. Right? Is this someone got it, but it's failing to insert or we gave it to someone but failed to wake them up. Right? In either in either case, like the sender is holding the permit, but is not inserting a message.

John Gallagher: 34:01

It's just a question of why.

Bryan Cantrill: 34:05

And I I think that, John, I put this in the chat, but you can find the the the current theory that you just described, you can find by searching for one more wild theory, which was what you added as you're trying to like go through what are the different things that can explain this. And that last one you described is in the is the kind of the wild theory bucket. And so now what? So with the okay. So the problem we've seen we we've got some ideas of what this could be.

Bryan Cantrill: 34:34

And now do we what's the next step to go investigate this? Because we've now has the system righted itself when we or or is the system just like killed itself completely at this point in terms of the system in dog food?

Dave Pacheco: 34:51

The so the time only one of three nexus processes had gotten into this state and that one was totally court, but the other two were working fine. The rest of the system was basically fine. And so that process died, but it came back and I think was fine. I don't remember if

Bryan Cantrill: 35:08

it fine. Okay. Got I

Dave Pacheco: 35:09

think it was fine.

Bryan Cantrill: 35:12

So we are now in the position of like, okay, we've got to now figure out, do we take kind of a half measure in the system to debug this? We know we're gonna, we believe we're gonna see this again. We hope we're gonna see it again.

Dave Pacheco: 35:23

Well, And and also, I don't think we've given up from the data we had. I think we were still trying to figure out what can we figure out from this core file that we did get. And I think it was when we were discussing, we had just gone through and figured out what the channel state was when we started talking about these four theories and theorizing about what could have happened. Right? Is that am I remembering that right?

John Gallagher: 35:43

Yeah. Think I think I told things slightly out of order. The bit where I think the sort of the last thing we did, which I I described earlier, was actually look at the content of the linked lists storing the messages. So each of the each of the requests to CORB has a claim ID, which is just an integer. There's like a thread local integer, incrementing integer.

John Gallagher: 36:05

So we can actually see from the message block that like, I think there were 26 old and we actually saw their claim IDs like incrementing by one. And then we saw that there were like five empty slots waiting for new messages. That was sort of the last thing we did. It was like the last hope of, I think maybe even Eliza said like, she didn't actually write the message block. Maybe there was a bug in the way the message block handle, you know?

John Gallagher: 36:27

So we're like digging through trying to figure out if there was any evidence at all that like the sender had actually put a message in there, the accounting of the bitmask of which slot was available had gone off in some way. And we basically ruled that out from the from the course that we'd collected.

Eliza Weisman: 36:42

This is where John did some very cool debugging because we realized that these linked list blocks are actually reused and are potentially uninitialized. So they might look like they had data in it that was valid even if it's not actually from that pass around the ring buffer. And John was able to figure out that there was this counter that's being incremented, and so we could actually see, like, where the edge of the most recent lap around the ring buffer was. But, of course, all this was, like, normal and fine.

Bryan Cantrill: 37:14

And Right. Right.

Eliza Weisman: 37:15

And then we're just sort of sitting on this this call scratching our heads, and that's where we that's where we've ended up.

Bryan Cantrill: 37:22

So so you did not so that's what it was kind of like having you don't have a lot of constraints on what the system oh, you know how the system was acting in a lot of ways. Did we did you not have to see it again to debug it?

Dave Pacheco: 37:38

No. We didn't we didn't see it again. Think it was about Damn. What I remember at some point was Sean asking the first leading question that brought us brought us to the answer. Does it Yeah.

Bryan Cantrill: 37:51

I act do wanna make this I wanna and I need to talk with you for to figure out, but I think we can make a because we actually recorded this. You recorded this because you're recording the debugging session. Yeah. And it is actually remarkable. The it is literally a passing of the baton from it's like Sean, Eliza, John, Dave.

Bryan Cantrill: 38:12

And it is really, really, really remarkable of so yeah. Because sorry, Dave, you wanna describe the kind of the thought process there?

Dave Pacheco: 38:19

I don't Does anyone else have the transcript? Remember Sean asking some Oh, go ahead, Sean.

John Gallagher: 38:24

Yeah. Well, I was gonna say, I think we didn't get the automated transcript, but for the purposes of the demo talk last week, I went back and manually transcribed just this like thirty second snippet. Do want me to just read what I have?

Bryan Cantrill: 38:34

Yeah, should do. Because it was amazing. Yeah.

Eliza Weisman: 38:36

We should each say our parts.

John Gallagher: 38:40

There, I think Sean, if you wanna start, we'll walk through it.

Bryan Cantrill: 38:46

And action.

Sean Klein: 38:47

All right, sure. So I was asking, we were sort of at our wit's end with the existing theories of something being wrong with Tokyo Channel, so we sort of took a step back and asked how how could this situation happen in our existing code base? Would this be by a caller pulling send, waiting there and then not pulling again, but also not dropping to relinquish the permit?

Dave Pacheco: 39:13

This point I say something that's not quite in the transcript. I don't see how that can happen because and I think it's because we've looked at the code in Tokyo send and saw that it waits for a permit and calls await to do that and then immediately sends the message. So there's no opportunity for it to not send the message once it has the permit.

Sean Klein: 39:34

And then I was asking if there was a any possible construct that we had set up, such as using a Tokyo select statement, where on one arm, we're trying to use CORB to get a database connection, and in a different arm, we do something else, but for whatever reason, we don't leave the function, we don't drop the first hold feature in the select.

Eliza Weisman: 39:55

Well, in order for that to happen, the select would have to select on the thing that's doing the send but use a mutably borrowed future. I may have said and mute, that sounds like the type of thing I would have said, so that it doesn't get dropped if it doesn't complete. And then the other arm of the select, one that we actually go into, would basically just have to have a loop that never returns in it.

John Gallagher: 40:19

And I said, what if that other arm is itself trying to acquire a database connection? And then if you post the recording, you get to see me staring like an idiot at the screen for about fifteen seconds solid.

Bryan Cantrill: 40:29

It is about fifteen seconds. Is about fifteen seconds. Actually, I don't feel this this transcript is complete because I think this is the I I think that they're Dave, you in here say, did John just blow up all of our brains?

Dave Pacheco: 40:42

Something like that. John just dropped a bomb.

Bryan Cantrill: 40:44

John dropped a bomb. That's what you said. Right. Exactly. Yeah.

Bryan Cantrill: 40:48

Would yeah. So then Fifteen seconds of stunned silence, which unfortunately due to Google Meet, we only have John Yu, like, kind of staring into space.

John Gallagher: 40:57

I'm well, worse

Adam Leventhal: 40:59

than I

Dave Pacheco: 40:59

am blankly into space.

John Gallagher: 41:00

Yeah. But it's so much worse than I'm, like, staring up at the ceiling. So you're, like, looking right at my it's awful. Right? It's terrible.

John Gallagher: 41:05

So at that point, like after the fifteen seconds, I'd sort of tried to put together everything the four of us had just said and said, okay, you would have to have a select where one arm is selecting on an ampersand mute future that is trying to acquire a database connection. That, like we got into the send at that point and then return pending because the channel was full. And then we go into the other arm, for some reason that other arm becomes ready and then within the body of the other arm, we try to acquire database connection again, which means we try to send on the same channel and now behind it in we're behind our own other arm in line in this fare channel.

Bryan Cantrill: 41:42

Yeah.

John Gallagher: 41:43

And at that point, Dave, like, without missing a beat, says, you just described the code that I'm looking at.

Dave Pacheco: 41:50

Yeah. Well, I mean, I so I didn't follow this in real time. I didn't understand what you guys were describing in terms of the failure mode, but I recognized the code pattern you were describing. I was pulling it up, and then and then you got really concrete in terms of, like, the select arm does this, and it's a mute feature. And I was like, I'm looking at that code right now.

Dave Pacheco: 42:07

Tell me now how this how this can cause this problem. Still didn't understand it at this point.

Bryan Cantrill: 42:16

So then at at this point, do you go because I've actually not watched the recording from that. I didn't go watch the recording from that point on. But John, at what point do you all like say, hey, if this is the case, we might be able to reproduce this in a smaller are we like, outside of of Omicron, which should obviously be great.

John Gallagher: 42:36

I would have to look at the recording, but I feel like it was almost immediate. Yeah.

Bryan Cantrill: 42:40

Yeah. I think yeah.

Eliza Weisman: 42:41

Were particularly intent or it was either you or Sean. Somebody left that meeting, like, feeling really worked up about the need to write a minimal reproducer for this. I think it was either you or Sean for sure.

John Gallagher: 42:53

Yeah. It was I think suggested

Dave Pacheco: 42:56

exactly how it should be reproducible. Like, it should be really easy and a 100% reproducible, which it was. Right?

John Gallagher: 43:02

Yeah, I think I said, unless somebody else is dying, it would be cathartic for me after spending the last sixteen working hours in Ghidra to write, like, a minimal reproducer for this. That would be very nice. So I left the meeting. I think I had a reproducer up while we were still in the meeting even. It only took like five minutes once we understood what was going on.

Bryan Cantrill: 43:21

And then it reproduced the problem.

John Gallagher: 43:23

Yeah, immediately.

Bryan Cantrill: 43:25

Dave, this is what you and I have called the the software engineering equivalent of the walk off home run to go back to baseball. And that is just like a ball game. I mean, that is just that must have felt great, John.

Dave Pacheco: 43:40

Particularly, you're referring to the reproduction. Like I've Reproduction.

Bryan Cantrill: 43:44

I've been staring at this. We we've seen this bug once. I've been staring at it. I I know there's so much now I understand about this. Wait a minute.

Bryan Cantrill: 43:52

If it's this, is it this? If it's this, I should be able to reproduce it this way. Going to reproduce it this way and seeing the exact same symptoms. It's just like, I mean, that that's a ball game. That's it.

Bryan Cantrill: 44:06

So, you know, ball sails into the night sky. That's it rounding the base path. I mean, John, it must've just found, it must've felt great.

John Gallagher: 44:14

A 100% followed up very quickly by, oh god, how many other places have we done this?

Bryan Cantrill: 44:19

You're right. Well, listen, have the celebration home plate a little bit before you get to the yeah. Right. Where you and this is where you kind of get to the analog of like, oh, shit. This is this like the async problem of like, okay.

Bryan Cantrill: 44:36

So and maybe now it's worth describing a little bit, Dave, what you put in the RFD about like, okay. Why do we feel this could be pervasive? Because we need a like, there there are a bunch of different things that that have to to line up to hit this.

Dave Pacheco: 44:56

Right. So should we sort of re summarize, like, the crux of the problem here?

Bryan Cantrill: 45:00

Yeah. That'd be great. I think it would be great to do that.

Dave Pacheco: 45:02

The problem is basically we had a Tokyo task that was running a couple of different futures using Tokyo Select. One of those features was going off and doing a database operation. So that would use this MPSC channel of size one, although the size is kind of a red area. And then it would get blocked on anything like there may have been a transient point where that channel was full because the receiver hadn't picked up the message from something else that was trying to get a connection to the database. The Tokyo Select is also selecting on a different branch, which I think was a timeout.

Dave Pacheco: 45:37

And then in that branch, in the arm of that other branch, it would go do something else with the database. What it was actually doing was it was collecting a support bundle. And in that other arm, it was checking whether that support bundle request had been canceled so it could stop doing this thing. And so you basically had it had pulled on one future that wound up putting itself in the queue to send a message on this channel and then pulled on a different future that resolved, which was the timeout, and then entered a branch that would go in pull on a different feature, which would also try them to message on the channel. But now that's the only thing that task is pulling on is that second feature, but it's actually blocked on itself because it's blocked on the first feature that is that currently has the permit.

Dave Pacheco: 46:19

So this can happen. I mean, there's a lot of ways this can happen in principle. But where it would happen most likely is in a situation where you've got a task that is pulling on multiple futures concurrently, or it has pulled on multiple futures and is now only pulling on a subset of those futures. And those that subset depends on the previous futures, the ones that it's not pulling completing. Does that explain that?

Bryan Cantrill: 46:43

Yeah. And I think this is where the task versus future distinction is really, really important. A good load bearing, obviously.

Dave Pacheco: 46:49

Yeah. That's a good point. So tasks I sort of think of as the rest async analog to threads, which is, like, not quite right, but, like, you can go spawn tasks in their individual schedulable entities in, like, the Tokyo runtime. But futures are the unit of asynchronous work. So every task has sort of one top level future, I think we can say.

Dave Pacheco: 47:11

But that future may itself concurrently pull a bunch of other futures. So you can have concurrency without parallelism, without multiple tasks in this way using select and there's other constructs that let you do that.

Eliza Weisman: 47:23

Yeah, I would. The way I often speak of this is I like to say that if you understand the difference between concurrency and parallelism, a future is the unit of concurrency in async Rust, but a task is the in Tokyo is the unit of parallelism. And that's really important to understanding like this bug and also how to avoid it.

Bryan Cantrill: 47:46

That's right. And like, why would you not just do everything in a task? It's like, well, they're a little heavier weight. It's a it's heavier weight to create a there are also reasonable spots where you'd wanna actually have concurrency without parallelism. You don't wanna actually have cost of parallelism in terms of other memory or potentially in terms of CPU, but you actually still wanna have concurrency.

Adam Leventhal: 48:06

Well, not only that, but cancellation can be a feature. Right? Like, in some circumstances, like, you you want features because, like, you want the the feature of cancellation as opposed to the the curse of cancellation. Like, if you're trying to time out a call or whatever, like, it's a very convenient way to build that kind of state machine. So Yeah.

Adam Leventhal: 48:25

I I think it's it's not as crisp as like the heavyweight or lightweight of it. It it it a lot about the intention of the code.

Bryan Cantrill: 48:36

Yeah. Interesting. And in in this case, we have, I mean, we deliberately because Adam, this construct that is create I mean, the the we we hit the particular spot we hit this is kind of exactly what you're talking about. We have a time out. We actually want to time this thing out if it takes too long.

Bryan Cantrill: 48:53

So it's actually quite reasonable for that not to be in its own task. That's right.

John Gallagher: 48:58

So I think on that debugging call, like, after we had the sort of, like, crystal clear reproducer, a lot of the the subsequent thirty or forty minutes is us sort of like on the spot having just found this specific problem, trying to, let's call it a blameful postmortem on a particular construct. Right? Like is the problem that we used a channel of size one

Bryan Cantrill: 49:22

is the problem

John Gallagher: 49:23

that we used, like, select with a mutable feature? I think we were pretty uncomfortable with select sort of like generally because we'd had problems with select and cancellation, right? On the podcast we haven't had yet about async cancellation, but select is big source of that too. So I think this happened, like when Dave's RFD got picked up by Hacker News and other websites, a lot of people landed on Select. So I think Dave, you and I, pretty soon at that meeting, we were like, I don't think, this isn't specific to Select.

John Gallagher: 49:53

Like we can rewrite this to without using Select at all, using any of these other concurrency within a future without parallelism constructs, like futures unordered or even futures ordered. All of those, anything that lets you pull multiple futures, well, not anything, but like there's sort of like a specific set of circumstances that is not specific to Select that can induce

Eliza Weisman: 50:13

In this fact, that was a major most of the like prior art for discussion of this specific category of issue in async Rust is a blog post by Without Boats, which deals specifically with futures unordered and how to write this type of bug with futures unordered.

Bryan Cantrill: 50:32

Right. That that you've got other ways of getting into this exact issue. And no matter how you get into it, it's really not easy to debug. I mean, they you know, I've always said that like deadlock in a in a traditional threaded system, deadlock is kind of a straightforward problem to debug because the you know, as our as our colleague, Cliff Biffle, is fond of saying, the program counter is actually a very powerful piece of state. And when you have deadlock between traditional deadlock, you know, AVBA deadlocks between two threads, like the system is stopped and it stopped in a way that you can get, you can under if you can understand where these two threads are, you can understand the lock over ring violation.

Bryan Cantrill: 51:13

It's like it's actually not that hard to debug traditional deadlocks. Live locks are harder to debug because the system is moving, but it's not making progress. And where you are, you However, you are constructing the live lock, you are not able to make progress because these two actors or multiple actors, the actors in the lowercase a thread sense are not able to satisfy a condition that allows them to collect would move forward. But I feel like this is something different. I felt, I felt like this is like a different kind of pathology because there is when you are locked up on where you've got one future depending on the the kind of a future being executed on another arm like this or a future that is that's gonna be executed by the same task.

Bryan Cantrill: 51:58

However that is, whether it's on the select arm or what have you. You don't have there's not a great way in the system to be like, hey, by the way, can I like what are all of the futures that exist in the system? It's like, it's just a random way of answering that question?

Adam Leventhal: 52:12

Well, can I pause for a second there, Brian? Because I think you said something very important about, like, you know, deadlocks are easy to debug. But I would just observe that, like, we've made deadlocks easy to debug. Yes. That's right.

Adam Leventhal: 52:23

Bryan Cantrill: 52:24

Yeah. Yeah.

Adam Leventhal: 52:24

Or, you know, we've spent a long time like, we've we've spent a lot of time with this multithreaded model where it's got this nice property that, like, a thread can be at one place at one time. And as Dave and I and others have discussed, like, with tasks and futures, you don't have a stack. You have, like, a tree. You have a tree of the multitudes that any task contains about all the places in code that it logically represents. Also, like, in MDB and debugging and and and the work that we've done for for many years, you can walk up to any lock and say, who owns you?

Adam Leventhal: 52:58

Yeah. What what what's your state? And there there's a lot of complexity introduced by asynchronous, but then there's also a lot of miss that that kind of experience and tooling and observability that's also absent.

Eliza Weisman: 53:12

Yeah. And a case I would make about this issue is that I don't think this is a fundamentally novel kind of bug. It's a fundamentally novel to async Rust way of writing that bug. And I think that that distinction is actually pretty important. This is kind of a normal non reentrant mutex deadlock.

Eliza Weisman: 53:33

It's just that the way in which you are contending that mutex twice on the same thread in two different places is very unobvious to the programmer. The issue here, the unique thing about what is future lock is that it's a combination of language constructs that conceals where you have accidentally written a actually like pretty traditional non reentrant mutex deadlock.

Bryan Cantrill: 54:02

That's a very good point, Aliza. The way

Eliza Weisman: 54:04

that async hides it from you, that's what makes this interesting and worth discussing.

Bryan Cantrill: 54:09

Think that's a very good point.

Dave Pacheco: 54:10

That sort of misses, I think that for there's this mental model, I think, that people have and that I think at least the traditional presentation of asynchronous tries to provide, which is the tasks are sort of like threads and you can sort of write sequential code with a weight and like it's kind of a lot like threads. But this is a failure mode. At least I don't see how it's the same as anything that can happen in a threaded system. Like the closest thing I can think of is like, this is sort of like if the kernel scheduler woke up a thread and then like didn't put it on a run queue, woke up a runnable, like took a runnable thread that was just woken up and didn't put it on a run queue. That's like a, that's not really a thing that happens.

Dave Pacheco: 54:50

I mean, only because that software is so mature now, but like, that's a weird thing to be able to introduce in the user land through analog.

Eliza Weisman: 54:59

Going to I'm going to push back on that a little bit. And the reason I'm going to push back on that is because I think, well, you know, if you write only straight line async await code and you spawn a bunch of futures that just do sort of straight line sequential await await await, and you treat tasks as being the same as threads, you will never have this bug. This bug requires some primitive that combines multiple futures together in the same task and allows them to be awaited concurrently, such as a select or a futures unordered or something like that. So if you really treat tasks as being exactly the same as threads, you actually can't have this problem. It's once you get into, I'm trying to select over this, like, large set of multiple concurrent notifications from different sources that you have this problem.

Eliza Weisman: 55:56

And that starts to look like, what's the threaded equivalent of that? Is it like, it signals? Because signals are pretty messed up, right? You're now Like you're outside of the land of normal straight line threaded code.

Bryan Cantrill: 56:13

Oh, it was I actually do have an analog for you in kind of the threaded world. And I think because the I mean, part of the challenge here is that the parallelism has now become load bearing. That the parallelism that if you don't if because as you point out, like, if you actually have the true parallelism in addition to the currency, you don't hit this. And it actually remind the adamant reminds me of, you know, my my favorite whipping boy, the end to end scheduling model. The end to end scheduling model is like, man, come on.

Bryan Cantrill: 56:37

It's been like thirty years, dude. Like, you coming after me again? It's like, yes. You know, just to be I think no point.

Adam Leventhal: 56:42

Fair to call it green threads in like the modern prevalence?

Bryan Cantrill: 56:45

Mean, for sure.

Adam Leventhal: 56:46

People aren't confused.

Bryan Cantrill: 56:47

Yeah. Fair fair to call it green threads. And the and the but there was this and the and the the I think it was I think it was Smaller's and Co. That had originally coined the term MNET scheduling model. But the first implemented in Solaris back in the day, imitating mother systems and bad idea basically.

Bryan Cantrill: 57:08

It had a lot of problems. Problems that it had was that if you had because and it because you could have a program that appeared to be correct, but the the the parallelism that you would get from kernel schedule entities, LWPs, lightweight processes. The it's only a lightweight process that could actually execute on the CPU or the kernel. And if you had if all of your threads became blocked in the kernel on an LWP, runnable thread at user land, a runnable thread that was not kernel schedulable, and all of your existing LWPs are actually blocked in the kernel. So the process should be runnable, the threads should be runnable, you think you're runnable from a program perspective, you're runnable, but if you are if those things are blocked in the kernel with the dependency on the thing that should be runnable but can't run, you have this problem.

Bryan Cantrill: 57:57

So it and this was a I I mean, it'd be I I interesting to get into the history of this. This was kind of discovered. I don't know if I you know, the the people who really know the history of this, I mean, sadly Roger who's passed away, but the this must have been discovered, Adam, sometime in the I mean, is obviously Solaris two dot o. It's kind of in the the late it started in the late eighties and it's in the it ships in kind of the early nineties, '90, like '92, '93. And it's a disaster.

Bryan Cantrill: 58:29

And it's it's sometime in Solaris two dot three time frame where I think this is discovered. And this is it's bad because you and it that you have a correct program that can actually work if they're not actually creating a LWP for every thread. And so they came up with a complete clue for this and Eliza cheese, your I mean, signals are involved. They came up with a signal called sig waiting and it would drop. It'd be kernel would realize the kernel this is terrible.

Bryan Cantrill: 59:04

I I I just don't shoot the messenger on this one. The the the the kernel will be like, oh, you've got all of your threads are are blocked in the kernel right now. I'm gonna deliver you this signal called sig waiting. It was also even the name is like, who's waiting for what? I mean, are we we're waiting for a proper implementation of threads.

Bryan Cantrill: 59:23

Is that what we're waiting for?

Adam Leventhal: 59:23

That's like a department of truth on that one. It's a department of truth.

Bryan Cantrill: 59:27

And then so you would is you would get dropped this sig waiting and then the and it would the the the default operation for for Lipsy would be to create an LWP and then it would potentially be able to run this thing that was runnable. And I mean, unsurprisingly, this made and actually part of by the work I did again as an undergraduate, this is a long time ago now. This is this is actually quite literally thirty years ago. But we had a really hard time getting like reproducible performance on this thing. And it took a little while to figure out it's like, oh, that's because like every once in a while your LWPs would like happen to get blocked in the kernel and you would get a sig waiting dropped on you and then be able to use parallelism that you actually always had in your program because you're like, oh, wow, I've gone faster now because I've got this this sig waiting.

Bryan Cantrill: 01:00:16

It was was bad. And but it was it it it was one of the end of course, the signal is still still present. I and does I think it's now basically SIG user three effectively and hopefully not mimicked by other systems. But it was it it was it is similar to what you're what you're describing. And I I I think that, like, I I understand that, you know, kind of the thrust of what you're saying, but I also feel that like from a programmers perspective, it's like, it feels like a correct program that is that that the the run time is kind of doing dirty because it's the the the abstraction that's been provided is not is not it's not behaving by the principle of least surprise for sure.

Adam Leventhal: 01:01:01

That's right. And and a signal handler, like, you know you're in a weird spot. It's not it's not like you would go in your signal handler and start, I don't know, executing arbitrary complex code and grabbing locks and whatever. So I think the analogy sort of makes sense in that you've got a thread that's effectively in two places at once, but it breaks down a little bit in that people are very constrained about what they should be doing and what they typically do.

John Gallagher: 01:01:26

In terms of, like, programmer feeling. Right? The the point at which, like, on the debugging call when I described this this construct that we'd have to have, and Dave was like, I'm looking at that. That felt like it felt like a miracle to me because there are I mean, are literally hundreds of places in Nexus that wanna claim database connections. Right?

John Gallagher: 01:01:43

So, like, if if we boil this down to just its reproducer, it's like you have a channel and a select with a very particular construct where one arm wants to send on that channel and the other arm tries to send inside the body of the channel. But in the actual code we have, those things were far separated and spread across like at least three or four different functions. And it could have easily been significantly more than that. Right? Like the channel is hidden deep inside of QORB.

John Gallagher: 01:02:06

The fact that we're trying to claim database connections is hidden inside these other methods. Like the select is like it's not at all obvious that it's sending on a channel at all ever, let alone that it's got two different places where it's sending on the same channel. Right? Like it's a very non local reasoning kind of problem to look at.

Dave Pacheco: 01:02:25

And it's also worth mentioning, I think that that is a good thing. Like, that's abstraction as far as those other layers are concerned. Right? Or is like, I've got I've just got a method where you claim a database connection. It's a sync.

Dave Pacheco: 01:02:37

I happen to do it with a channel inside, but like, that's not really your business. And, and then above that, have the data store, which is the thing that it's our interface for making all kinds of queries to the database and it's taking care of the connection management for you. So you make some query, you get a connection for it, makes the query, gives the connection back. Fine. Again, there's another abstraction there.

Dave Pacheco: 01:02:58

And then we have a couple of layers above that that this code is using to like go do a go collect a bunch of data that would be useful, some of which came from the database. So it's that's part of what, you know, we spend a lot of time debating about what what we did here that was wrong, you know, like which of these patterns was wrong? Is it is it having a borrowed future and select? Is it having an await in a branch of select? Is it using a channel with capacity one?

Dave Pacheco: 01:03:26

Is it using channels at all? Is it writing software? Like, it was not that clear.

Adam Leventhal: 01:03:30

It's been writing software.

Dave Pacheco: 01:03:32

Think we kind of went around it.

Bryan Cantrill: 01:03:35

It was an original set and Right.

Dave Pacheco: 01:03:38

And it, you know, something, you said a long time ago, which is like when when engineers are struggling to come to consensus and, like, really disagreeing, the answer is not actually clear. Like, we just don't have enough information to make that determination. And I think it's sort of similar here that none of these things is the bad thing. And that's part of what's sort of a gut punching moment for us where we're like, what? What do we do?

Dave Pacheco: 01:04:02

How do we not have this problem again?

Eliza Weisman: 01:04:04

Yeah, I think the answer

Dave Pacheco: 01:04:05

is like all of these things together, and we just have to be careful when we see any of them is kind of what it boils down to.

Eliza Weisman: 01:04:11

Yeah. What really hurts about this is that no one, no individual piece of this is wrong. No one is at fault. And this is where a recent Brian Cantrell Blue Sky post came from our discussion of

Bryan Cantrill: 01:04:25

this. Right.

Eliza Weisman: 01:04:26

Where I said, you know, all postmortems are blameless, but this one is especially or more blameless than others. And Brian immediately had to post that on Blue Sky.

Bryan Cantrill: 01:04:39

Which everyone assumed by the way was a reference to the AWS outage, which I thought was kind of funny. I'm like, no, no, no. That's not the AWS outage at all. Yeah.

Eliza Weisman: 01:04:48

And like, we spent a long time, myself included, in our discussions of this, like really trying to find a way for Tokyo to have been at fault here. Right? Both in our investigation of the bug, but also in our sort of like postmortem discussion. We spent a lot of time thinking like, could the library have provided an abstraction that is less leaky? Could it have documented behavior more clearly?

Eliza Weisman: 01:05:15

Could it have done something to stop you from doing this? And the answer is, well, not easily. It potentially could have not allowed you to use a mutable borrowed future in a select ARM at all, but there's actually a great deal of legitimate code that is safely doing that And that wants to do that in order to express something that like you have to actually do that. You have to, you know, await some future while you also wait for some other thing. And when that thing happens you go and do something and then you go back to continuing to drive that other future forwards.

Eliza Weisman: 01:05:54

And it's like that's actually a thing that's very useful to be able to express in Select and Tokyo could go out of its way to prevent you from doing that but also that's really a consequence of the Rust standard library permitting a mutably borrowed feature to be pulled. And so it's kind of like all of these pieces individually are correct. The library has done very little wrong in the way that it communicates stuff to the user. We found like maybe a small area where the documentation could be made clearer that I think Sean did go and fix, but overall, like no one has really done anything wrong here. And that's what makes it so sad is that you end up in this like, really bad situation through every individual piece working correctly.

Adam Leventhal: 01:06:43

I don't know if this is the right time to pile on, but I would also say that this the the opposite of what you're describing, Eliza, is what I like and I think what we like about Rust. Right?

Bryan Cantrill: 01:06:55

Yeah. Yeah.

Adam Leventhal: 01:06:55

Exactly. Yeah. That there is so much about the structure and and specificity and structures of Rust that keep you from needing to, like, construct these constraints with your mind. Like you do with c plus plus memory management. Like, there's so much you have to just it is upon the ingenuity of the programmer to document these things, to write about these things.

Adam Leventhal: 01:07:21

You don't have sort of local reasoning, and instead you have to, you know, uphold these constraints with without particular support. And I think this is where I, perhaps wrongly, always feel betrayed by these async Rust idiosyncrasies, where I just feel like the thing that I liked about Rust got the bottom you know, the the rug gets pulled out from under me.

Bryan Cantrill: 01:07:44

You had higher expectations for your rust.

Adam Leventhal: 01:07:46

I'd say. I'm not mad. I'm just disappointed.

Bryan Cantrill: 01:07:47

I'm not not mad. Just disappointed. Like but see yeah. I know. I know c's a mad.

Bryan Cantrill: 01:07:51

We know we know c's a mess. Well, son, you mentioned c plus plus. You know, we we especially we were kinda comparing it to the the the the r d three '97 to April and they kind of the async cancellation issue. Ring the chime for future podcast episode, I guess, Adam. I'm not sure what that sounds like.

Bryan Cantrill: 01:08:07

Okay. Sorry to put put work on on few edible. I mean, you had a very kind of good point about like, yeah, but you know what? Like, let me tell you about all the problems we don't have in the system. It's like, yes, we are, but we, you know, I I I gave a talk a couple years ago on zebras all the way down about the the kind of, you know, there's the adage that we tell or that that medical residents are told that if you hear huff beats, think horses not zebras.

Bryan Cantrill: 01:08:38

But if you actually in software, if you're building on a reliable systems, like actually what's left of the zebras, and this is kind of a zebra that's left.

Dave Pacheco: 01:08:47

Yeah. I don't remember exactly how I phrased it when we were talking, but it was something about like what's left. Rust has taken so many of the types of runtime problems that we used to have in past lives in Node. Js and C and stuff like that that were crashes or surprising dynamic behavior and made those compile time failures, and that's great. And that means that what we have left are these doozies.

Dave Pacheco: 01:09:11

And I feel like we've had, like, three of them now that have been really rough.

Bryan Cantrill: 01:09:17

They have okay. But if we were but I feel this one has been less rough. Do you feel that way or no? Am I the only one feels I guess

Dave Pacheco: 01:09:23

it's you know, the part

Bryan Cantrill: 01:09:24

that I'm speaking to the stands, so I guess it's little easy. Why don't send the runner? Why don't they send the runner? Just don't understand.

Dave Pacheco: 01:09:30

I mean, I look back and I'm like, look, this took us like two days. It felt like so much longer than that. And there was so much of those days where I was thinking, oh my God, how are we going to figure this out? I have no idea. Like, this is completely contradictory.

Dave Pacheco: 01:09:43

I have no possible explanation for any of this and no way to gather more information about any of this. We're doomed. It's all over. And then we figured it out. And so like the cancellation one was it's definitely a much more prevalent problem.

Dave Pacheco: 01:09:56

This is like our discovery of the impact of cancellation in our code base. That was like a much bigger version of a sort of similar thing. And the other way I was thinking of was the one over the summer that we talked about earlier, that Tokyo delay slot thing.

Bryan Cantrill: 01:10:12

Yeah. Yeah.

Dave Pacheco: 01:10:13

And like, that was at least like a very narrow, problem that like we fixed and like it's done.

Eliza Weisman: 01:10:18

And we were on the LIFO slot bug for over a week. This one was a bit shorter than that. It was.

John Gallagher: 01:10:26

Well, only because we reuse the tooling we built during that week.

Dave Pacheco: 01:10:30

That's a good point.

Bryan Cantrill: 01:10:32

That is a very good point. Yes. Yeah. And I actually did not realize that you all had done this from one instance of the bug. Honestly, I thought we had seen this again.

Bryan Cantrill: 01:10:41

I did not realize that we'd only seen this once, which is just really impressive. I gotta tell you. And Sean is saying in the chat, but I just it's also really bears the emphasis. We got really lucky with this because it happened in in dog food, in our dog food rack, which you know, Sean, you were an early advocate of of getting our software up and running it for ourselves as quickly as we could. And that's just been, I mean, so huge so many times over that we that we were able to have this in front of us.

Bryan Cantrill: 01:11:13

And then I enabled to actually debug it and honestly had the wherewithal of this one to kind of debug it on the first shot. So I think it could have been much worse. I guess one question is, do we think we have seen this in the past and not diagnosed it correctly?

Dave Pacheco: 01:11:28

So I don't think so. That's a good question because I, I don't remember a hang like that, but there's another element of this story, is interesting to me, which is that we mentioned at the beginning, we're on the run up to our first release candidate for shipping live update. And one of the things we were doing, sort of jamming in at the end is I asked Sean like the week before, like, Hey, can you add to support to our existing support bundle facility, the ability to collect a bunch of information about the updates? Because that way No.

Bryan Cantrill: 01:11:57

That's low risk.

Dave Pacheco: 01:11:59

It seems low risk. It seems so low risk.

Eliza Weisman: 01:12:02

Like, if

Dave Pacheco: 01:12:02

I looked at that PR, I would have been so confident that that

Bryan Cantrill: 01:12:05

was Oh, absolutely. This is great. This is gonna help us debug systems in the future. This is great.

Dave Pacheco: 01:12:10

Yeah. And then like that Sunday, like three days after it integrated was when we hit it. So it's not a coincidence that we hit it in tests or

Sean Klein: 01:12:17

Yeah.

Dave Pacheco: 01:12:17

And and another reason to think we did I I think there was a use of the database before, but that code path that was added ended up using the database much more heavily for many, many more queries than it was before. So it made it much more likely that this would happen. And I, that part feels pretty relevant.

Rain Paharia: 01:12:37

Yeah. I have to say, I've definitely hit this bug and I did not realize that

Bryan Cantrill: 01:12:42

the war was going yeah,

Rain Paharia: 01:12:45

I've definitely hit this bug a couple of years ago and I was just scratching my head and then I rewritten it it was fixed and well, turned out it didn't tickle this bug anymore.

Bryan Cantrill: 01:12:54

Well, it's the kind of bug that you could hit, and like, oh, I hit it, I didn't understand what was going on, I restarted it and everything worked. So like, I don't know. Yeah, it's hit again.

Eliza Weisman: 01:13:03

I definitely have seen similar pathologies in a past life in my previous job at Buoyant. We had written at least one very similar bug and I did come away from it with some skepticism of the use of selects in loops with mutably borrowed futures. But I also came away from it thinking like it's still that's still a useful tool that is worth having, which I still believe, and didn't come away from it with a name or nearly as well defined a sort of checklist of this is precisely the pieces that have to come together to cause the bug, which is actually what I think is so interesting about both this bug and the investigation and the writing that we've done or that Dave has done in RFB six zero nine is that we have a very specific checklist of you have to have all of these features in order to have the bug. If you have any of these features, but you're missing one of them, then the code is actually safe. And that is the one good thing I can say about this.

Eliza Weisman: 01:14:16

It takes a bunch of stuff at the same time, and that's actually pretty well understood. Now, you can certainly still shoot yourself in the foot if those various, like those factors on the fire triangle, right? Where you have to have like fuel and you the third thing is.

Bryan Cantrill: 01:14:41

Oxygen. Yeah, get oxygen Yeah. Fuel and

Eliza Weisman: 01:14:45

Yeah. It is certainly still a foot gun in that like some of the vertices, the fire triangle can be hidden very deeply from the code that you're But looking we do at least understand it actually requires all of these factors at the same time. And if it lacks any one of them, it's safe. That's Yeah.

Adam Leventhal: 01:15:09

I I love this, Eliza, but it's safe dot dot dot for now. Because, like, one of those, you know, the the spark of the oxygen

Bryan Cantrill: 01:15:17

Hey. Be safety begins with you. Or you know?

Adam Leventhal: 01:15:20

That's right. But it it the spark of the oxygen can be added nonlocally in ways that are it's not clear that you have now made this toxic brew.

Eliza Weisman: 01:15:29

But knowing that gives you an opportunity to program more defensively, which I've seen Sean do. Like Sean has done some refactoring of code in corp, which was not victim to this bug to make it harder for subsequent changes to that code to introduce this type of bug inadvertently. It sucks about to

Adam Leventhal: 01:15:51

do this. But you There's little

Sean Klein: 01:15:52

bit of an irony here too of I mean, we we should have a top point about the cancellation issue that we have previously encountered, but, like, one of the pieces of advice coming out of the cancellation issue was, well, you can mutably borrow a future. You can just not drop it to keep it alive for longer to avoid cancellation issues if you need to. Which I'm not saying that that's contradictory advice with the thing that we've experienced with future lock, but I think that it's just there's careful threading of the needle to be done here, where

Bryan Cantrill: 01:16:22

Yeah.

Sean Klein: 01:16:22

Like, a very easy way to avoid future lock is to have things all be running on, like, you don't borrow futures at all. You drop them immediately, or you spawn everything on distinct tasks. And it's I think it's an interesting aspect of this where it's like you if you take a step in any particular direction, you're opening yourself up to new pathologies that you need to guard against in a different way. Like, there is no silver bullet here. It's possible to program things correctly, but you have to be defensive in each angle against cancellation safety versus future lock safety versus make sure that you spawned bunch of a bunch of tasks, coordinating between them correctly and you don't leak them or anything like that.

Rain Paharia: 01:16:59

I think the right. So, you know, using mutable feature features is one of the things you do to avoid cancellations. But in this case, having a cancellation was actually one of the ways that you would avoid the future log bug. A different way of avoiding a future log bug is to spawn a task. So there's kind of both of those sides of things.

Rain Paharia: 01:17:20

As Sean said, you can take a step in any direction and it'll be fine, but there's kind of an easy place to end up where, oh, actually, you have this bug.

Adam Leventhal: 01:17:32

Okay. Rain, at this point, you know, Sean and I were were talking with someone on social media, and I just want to read something that was written, which is they say, I suppose my gripe is with the frame that this is somehow a rust to rust async problem when it just seems like a knuckleheaded coder's problem. I've done dumb things exactly like this. Never thought to blame the language or the async runtime. And I and I think that the the

Bryan Cantrill: 01:17:54

It's the wicked child. The wicked child is here.

Adam Leventhal: 01:17:57

It's funny. You know, Brian, that's exactly what I was thinking when I saw this, like the the our our our the the four children asking their questions, clearly the wicked child. And I just felt like this you know, I get it. I get why, you know, it it's easy to to blame the folks who've experienced it. But exactly, Lorraine and Sean, as you're saying, like, this is not obvious.

Adam Leventhal: 01:18:26

This is also a correct pattern for many types of behavior. And there's nothing in terms of documentation, like almost literally nothing describing cancellation at all or its impacts. And then very little, very, very little in terms of that might help one in this situation.

Rain Paharia: 01:18:47

I think, yeah, I mean, for what it's worth, as someone who prides themselves on being a toolmaker, I think whenever, like it is never the human's fault. Like I have very much the customer's always right kind of thing. If the tool is wrong, if something goes wrong, then the tool's at fault. Like that is, I think it is genuinely unhealthy to have a different attitude here. Just my opinion, that's how I that kind of attitude.

Bryan Cantrill: 01:19:14

Well, yeah. And I think that in terms of it, Adam, you're absolutely right about the documentation, which is why I felt like the Dave, you writing RFV397 on on async cancellation was so important. And then Rain, you on RFV400, and then Rain, you're you're against a proto talk at at a Rust comp. And so I did it. Dave, it was really maybe you wanna like talk a little bit about the RFD because I I this RFD is really, really good.

Bryan Cantrill: 01:19:40

609. Very very well written. Very thorough. It's so well written that a wicked child on the internet could could say that it's all easy because it's like, actually, this seems easy because Dave has explained this in a way that a really terrific explanation of this problem. And and and so Dave, you wanna talk a little bit about the RFD and I I mean, I know I felt I hopefully I wasn't pushing too hard on the new for nomenclature here, but I really did think we needed to for to help other programmers be aware of this.

Bryan Cantrill: 01:20:12

Had to name it.

Dave Pacheco: 01:20:14

Yeah, that's right. So I'd say there was a similar feeling when we had found this problem, it would look the problem for the few had after three ninety seven, which I've affectionately call cancel gate, which is where we're like, we need to tell everyone this, at least at Oxide, because everyone needs to be aware of this because this is so subtle. The thing that they have in common is that the failure mode is really bad, huge impact in a production system. It's undebuggable once it happens. There's no way for the compiler to help you avoid this problem.

Dave Pacheco: 01:20:47

Just have to avoid it somehow at programming time. So the only thing to really do is socialize the problem and try to come up with some guidelines for people to avoid the problem. And so we did that with three ninety seven, and we're trying to do that again here. And so, yeah, I mean, I tried to write it in terms of like I was also trying to capture a lot of the discussions that we'd had about like where the problem was and what you do about it. A lot of the FAQs are kind of along those lines.

Dave Pacheco: 01:21:11

So it start the the art

Bryan Cantrill: 01:21:12

of anti patterns, the anti patterns are really like, oh, hey, wise ass. Oh, do you have an idea? Oh, why don't you go consult the FAQs? Like, your your idea may be in here.

Dave Pacheco: 01:21:23

Yeah. I mean, this might be just sort of me putting my my thumb on the scale on some of the discussions that we've been having and kind of seeing if people would push back on it, and I mostly didn't. So hopefully that's all right. But yeah, I mean, the RFP basically describes the problem itself. And an example of the problem, kind of tried to come up with the minimum John and I spent a while on this, the minimum reproducer that would be clear and that a person would look at this code and be like, that that looks fine.

Dave Pacheco: 01:21:49

That's not going to deadlock. And then, of course, it does deadlock and tries to explain it and then talks about like the different other ways you could hit it. And then given all those different ways, whether it's Tokyo Select or a Stream or whatever it is, how you avoid it. That's kind of the structure of it. And the idea is, yeah, it'd be a PSA that's like we should all be aware of this.

Dave Pacheco: 01:22:07

And I didn't have a name for it. I didn't it wasn't that clear to me that we should. You know, you made that case and that made sense to me. And I'm really glad that we do have a name for this class of problem. But I was a little reluctant to sort of like coin a term for this.

Dave Pacheco: 01:22:20

I'm like, this is just like a mistake we made. I mean, it's an easy mistake to make and it's important to talk about and everything. But the the previous title was something like like one task, many futures, and a deadlock or something like that. It was not there was not a summary of the problem.

Bryan Cantrill: 01:22:36

A little well, know, planes, trains, and automobiles. But, yeah, I felt we could, know, do something a little a little tighter. I felt like it would be but it would really terrific. And then you also passed your minimum reproducer to some LLMs to see what they thought of it, did you?

Dave Pacheco: 01:22:53

Yes. That's right. That's right. I'd forgotten I did that. I gave the minimum reproducer to chat GPT, like, free chat GPT, and it was like it completely nailed the problem.

Dave Pacheco: 01:23:03

It it it described it exactly.

Adam Leventhal: 01:23:06

Wait. Seriously? It it it you gave it the code, and it's like, this has been a deadlock?

Dave Pacheco: 01:23:10

Yes.

Adam Leventhal: 01:23:11

Wow. Crazy.

Dave Pacheco: 01:23:12

I was pretty surprised by that.

Adam Leventhal: 01:23:14

Did it tell you, like, what a handsome question you had just asked it?

Dave Pacheco: 01:23:18

Well, of course, it did. And it was a handsome question.

Bryan Cantrill: 01:23:22

It's right. That actually. Yeah.

Dave Pacheco: 01:23:26

Then I asked it some follow ups. I can't remember what it was. Brian, do you remember?

John Gallagher: 01:23:30

I asked

Dave Pacheco: 01:23:30

for some follow ups and it totally flubbed all of those.

Bryan Cantrill: 01:23:34

And I can was that GPT five or GPT? I can't remember what you were the which one you were using for that. But

Dave Pacheco: 01:23:39

I'm not really sure. I I don't have the free thing. Whatever it whatever it Yeah.

Bryan Cantrill: 01:23:43

And it's like, look. I I am ultimately just like, I'm just a sarcastic parrot. I just, you know, lucky guess that that thing deadlocked. I don't know. It looked like a deadlocked.

Bryan Cantrill: 01:23:50

I don't know. It was a great question is what I was trying to tell you. ILM. And then we, when I said, this is a really good RFD. And then Dave, I feel kind of, well, I actually don't think that'll be bad because actually I think the discussion on hacker news ended up pretty good.

Bryan Cantrill: 01:24:05

I did submit to hacker news and then when it was like immediately just like went nowhere because it there was a lot going on. I I I did bring it to the the attention of the second chance pool, which I I do rarely, but I I do maybe once a year, but I felt like this is something that I kind of made the case for like this is something that actually we want to get a some real attention on this. And then it really like once it was once it kind of was above the fold on hacker news, it caught fire on its own. And there was a lot of extra discussion that was pretty good, I thought. Dave, what did you think?

Bryan Cantrill: 01:24:43

Did you

Dave Pacheco: 01:24:44

There's a lot of good discussion. There was a lot of less than good discussion, but

Bryan Cantrill: 01:24:50

there's also a world class own in there. Would like to say, Dave. I mean, you really outdid yourself. Adam, how closely did you read the comments?

Adam Leventhal: 01:24:59

Not closely enough, apparently.

Bryan Cantrill: 01:25:02

I mean, Dave, absolute killer with their someone made the mistake. Dave basically basically, you know, very cheerful and, and, and, you know, really a lot of great comments. But, know, if I yeah. The wouldn't some of the less helpful comments. There was a comment deriding the names of the variables in the minimum reproducer.

Bryan Cantrill: 01:25:31

That these were

Adam Leventhal: 01:25:32

What an astute code review.

Bryan Cantrill: 01:25:33

Yeah. Very astute code review. Thank you. The quote quoting Dave saying it's really important to understand what's happening here. And the comment is, well, then maybe you should take a moment to pick more descriptive identifiers than future one, future two, future three, do stuff and do async thing.

Bryan Cantrill: 01:25:53

The coding style is atrocious. To which Dave replies, is it possible these names are intentionally chosen and actually do carry meaning? Dave, I I love you asking the question. Just No.

Dave Pacheco: 01:26:08

I I I'm embarrassed because I don't feel like this was a high moment of mine, but I was Oh, come here. Annoyed and I knew that the best thing for me to do in that situation was to do nothing.

Bryan Cantrill: 01:26:18

Oh, no. No. It wasn't. No. It wasn't.

Dave Pacheco: 01:26:20

And I was like, look, I could explain this. And it's like kind of annoying to explain. I'm like, look, it's important that the futures are fungible, that they're opaque and you don't know what's behind them because that's the nature of the problem is that the actual implementation of the future that causes it to future lock is like way far away from what you're seeing. That's why they're called future one, future two and future three. But like, that's a lot more to explain.

Dave Pacheco: 01:26:41

And like the question wasn't really in good faith anyway. So I just asked it as a question because it was easier for me.

Bryan Cantrill: 01:26:47

Well, was really terrific. And you should know that you brought joy into my life in that So I was just stewing on that like looking at the comment like, oh, God. And then I saw your reply. So it's just really great. And then there's someone else actually even had a actually then then I mean, is I mean, the great thing about the internet.

Bryan Cantrill: 01:27:05

Someone else comes along with it even like more like a just pointing the actual code like, hey, pal, if you want the actual like real names, go look at the actual code referring to. It's like it's all open source. So I thought that was a very very helpful and productive comment. But Dave, I thought your comment was very good and I thought it was I thought it was the best angel of our nature. It's what I have to say.

Dave Pacheco: 01:27:25

Thank you. It was I mean, better than better than the drafts, I guess.

Bryan Cantrill: 01:27:29

Yeah. Exactly. That's right. Better than better than what you deleted for sure. For sure.

Bryan Cantrill: 01:27:34

And then but I thought, like, I thought the tenor of the comments was pretty good on there. I mean, Eliza, I know you were in there swinging when people were saying that, you know, well, this is a Tokyo select issue. You kinda pointing out like, no. No. No.

Bryan Cantrill: 01:27:45

Go, like, go read this part of the RFT. This can happen with other things as well. Not I

Rain Paharia: 01:27:50

was lucky. Person. It was the same person who said that Dave's coding style was atrocious. Oh. It's like, oh, you're the last yeah.

Bryan Cantrill: 01:27:57

That's what I was revealing.

Adam Leventhal: 01:27:58

Well well, I would say there was a similar comment from someone on the Rust AC Committee basically fixating on the use of select Tokyo select and saying, like, basically educating people about the appropriate use of select is a lost cause.

Bryan Cantrill: 01:28:12

Oh, that's fun.

Adam Leventhal: 01:28:13

Yeah. Exactly. Which is like I seems like both unhelpful and beside the point. But other than that, a useful comment.

Bryan Cantrill: 01:28:20

But, hey, I'd like to point out that this was this ended up being a number one story on hacker news. 438 points, 243 comments, and oxides compensation model did not come up. So I actually whoever had the under on that one, you're a winner. That was

Dave Pacheco: 01:28:36

a victory.

Bryan Cantrill: 01:28:37

That's actually a victory. That is, that is a bit of a What

Adam Leventhal: 01:28:40

do you expect from a bunch of guys who all get paid the same amount? Exactly.

Bryan Cantrill: 01:28:47

That's really great stuff. I mean, I, extraordinary teamwork. And I I mean, Dave, I I I think just your kind of summary of it as being like, this is act to me and honest, I do have the head on her believe this. This to me is a tremendous validation of rust. The fact that these are the issues that you're I mean, you they're always gonna be subtle issues in a system and yes, like there are lots of ways that async could be better.

Bryan Cantrill: 01:29:14

But you know, I I still and I when we originally had the async cancellation issue, I like, you know, you kinda compare it to memory safety issues in c and c plus plus where, you know, you can make all these same arguments that really require the programmer to be. It's so easy to introduce these pathologies. And I think, of course, our our code is not tautologically bug free and rust, but it makes it a lot harder to introduce these things. And I think you've made it harder to introduce this kind of bug in the future. And I love your open questions in the RFD about like, hey, maybe we can get a a clippy win in here, clippy warning about this kind of construct.

Eliza Weisman: 01:29:51

I would like to make one very brief statement in defense facing gross, which is that we have been doing this for, what, five years now as a as Oxide in particular. Yeah. And this code base has I just ran account. It has 500,000 lines of rust in it. Wow.

Eliza Weisman: 01:30:14

And it took us five years and 500,000 lines of rust to hit this book for the first time.

Bryan Cantrill: 01:30:21

Yeah. No. I agree. I I I think that that's actually the right disposition. I mean, I'm I in all honesty, I think it's like and Dave, you were also like and I should I don't think I quite because you were like, boy, every day I would be looking at a core dump.

Bryan Cantrill: 01:30:35

This is back in the in the old country of a problem that could have been caught today by the compiler. And I think that's a good a good reminder.

Dave Pacheco: 01:30:48

Yeah, for sure.

Bryan Cantrill: 01:30:48

So Rust haters should not be viewing this as fuel for their argument because it really does not fuel their argument. So that this is not we are this is not, I don't think change our disposition towards Rust at all. Rust async haters, more complicated. But otherwise, it's right.

Rain Paharia: 01:31:11

Think one of the things in general that really help is better observability for some of this stuff. And I think some

Dave Pacheco: 01:31:16

of these problems are

Rain Paharia: 01:31:19

inherently harder to solve because futures are passive and so on. But I think there are many things we could do here, maybe Tokyo layer, maybe within MDB, which I hear stands for modular debugger, or you could add modules to it. I don't know. But it would be interesting to of explore kind of that thing, right? Like, you mentioned that deadlocks are easier, like, are relatively easy to debug, I think, Adam, something like that's just stuck in my head, which is that we made them easy to debug.

Rain Paharia: 01:31:54

Right? Yeah. So, you know, can

Adam Leventhal: 01:31:56

we go here? No. Rain, I couldn't agree more. I think my frustration I think I count myself now as a Rust async hater because I think they let the the implementation and the evangelism of it so outstrip the documentation. And and I don't mean, like, oxides generated documentation about these pitfalls, but I mean, kind of like the official Rust documentation outstrip the documentation, and far outstrip the tooling and the need

Bryan Cantrill: 01:32:23

for

Adam Leventhal: 01:32:23

tooling. And instead, I think we've accidentally put people in a situation where the promises of async were use it, don't sweat it. It's it's like threads but better. But then once you've written 500,000 lines of code, you start finding some of the pitfalls. It'd be nice to have those pitfalls upfront, the the lessons for how to avoid them and use things appropriately, The the times when programmers do have to uphold abstractions with their mind and the compiler is not gonna help you.

Adam Leventhal: 01:32:56

And then the tools to help you solve when you when you encounter these problems.

Bryan Cantrill: 01:33:01

Yeah. For sure. Well, if we can get a and at least on on future lock, there is actually reason to believe that Clippy might be able to help us out, which would be amazing. Amazing. But well, again, work all around.

Bryan Cantrill: 01:33:15

It was definitely exciting to get to watch and it was really loved watching. I thank you for the again, for your reading, your reenactment of I love this reenactment of debugging, Adam. I think that this is like we you can imagine like having, you know, centuries from now, we're gonna have debugging reenactors that put on authentic twenty twenty five garb and

Adam Leventhal: 01:33:37

And do table reads of

Bryan Cantrill: 01:33:41

of debugging Rust async problem. So I think that the, you know, debugging future log, but I think it was that was really it was really terrific. And it was a lot of fun to kind of watch that that moment of great realization. And John, congratulations again on that. Just that feeling of reproducing it is so overwhelming, I feel.

Bryan Cantrill: 01:34:04

It is really one of the unique links in software engineering to be able to go do that. And to be able to do it from like, again, the single occurrence of this is just remarkable. Definitely. What's that?

John Gallagher: 01:34:15

I definitely slept better that night than I had the two previous nights. Yeah.

Bryan Cantrill: 01:34:20

That is awesome. Well, again, great work. I'm hoping we can reduce your need for GEARDRA in the future. Amazing work on that, but that's that is definitely where that tooling gap lives. So it'd be great to go address some of that.

Bryan Cantrill: 01:34:35

But really terrific stuff and a great way to use the the tooling that Sean and Eliza developed for our previous problem to be able to to help hem this one in a little bit because I think it ended up being it ended up help helping to constrain it all. So and and the, you know, did you say in the chat about a poor corb, poor old corb, you know, so you actually would remind me of is, you know, in back in the day, Jeff Bonnelly got that son rewritten memory allocator. This is the slab allocator that was then adopted by a lot of other systems including Linux. And the in part that the the memory allocator has the kernel memory allocator has extraordinary debugging to debug memory corruption problems. And part of the reason that Jeff had to do that is because when he rewrote the allocator, he now owned every bug in the system because it would die.

Bryan Cantrill: 01:35:30

I mean, especially in a memory unsafe system, like you seemingly you die in the memory allocator. You corrupt the memory allocator state. So he ended up having to do a bunch that was really defensive to be able to quickly exonerate the memory allocator. And, you know, over the years you got to the point where it's like, okay, we actually know it's not the memory allocator. And Sean, just to your you have one that we've got a lot more debugging in Corb.

Bryan Cantrill: 01:35:56

And as you say, like Corb is actually you've done some refactoring to make it more defensive. And so I think each of these issues makes Corb stronger and better even though Corb itself is not to blame. And I think it also gives us confidence in Corb going forward. Poor Corb, but you know, Dave, it's Corp Corp is it's serving its purpose well and and really establishing itself as infrastructure that we depend on. Alright.

Bryan Cantrill: 01:36:24

We are thank you again everybody. We are so Adam, we are out next week. Correct? And we are gonna be back in two weeks I believe.

Adam Leventhal: 01:36:34

Right. For for

Bryan Cantrill: 01:36:35

what would be founder versus investor, but is actually founder founder founder founder versus investor investor. Is that right? Yeah. Is it a it is gonna be a and it's gonna be a scrum of founders the investors are gonna be reading. So if folks not get a chance now time to get a quick read of of the of the this is by Elizabeth Zalman and Jerry Newman.

Bryan Cantrill: 01:37:05

We had Jerry on a couple weeks ago or eight months ago now maybe, but and Liz and Jerry are both gonna be here. It's good fun. I I've got some hot takes in the book. I don't know if that'd be made it all the way through, but we've got some Yeah.

Adam Leventhal: 01:37:16

I guess I'm not ... I'm just hoping I'd be able to get a word in. It's gonna be a lot of confidence that everyone is bringing to the show.

Bryan Cantrill: 01:37:25

Could me a lot of confidence. Confidence will not be in short supply. And it should be fun. It should be so join us then. And what and we will look forward to a podcast episode on async cancellation.

Bryan Cantrill: 01:37:39

Clearly we needed to go do that rain. I'm so sorry that we did not force you to go to Russ conference debt, but we'll, we'll definitely look forward to a future episode on that.

Rain Paharia: 01:37:48

Yeah. Sounds good.

Bryan Cantrill: 01:37:49

Awesome. Thanks everyone. See you next time.

Creators and Guests

Host

Adam Leventhal

Host

Bryan Cantrill

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere