Rebooting a datacenter: A decade later

Adam Leventhal:

Bryan is on the way. Bryan is here.

Bryan Cantrill:

I am here. How are you? Welcome.

Adam Leventhal:

I'm good. How are you?

Bryan Cantrill:

I've been well.

Adam Leventhal:

Do anything fun for the Memorial day weekend?

Bryan Cantrill:

For the Memorial day weekend. You know, I I, actually it's it's a little, little tie in to some, some past Oxide and Friends episodes. I was watching the ball horse. Oh, awesome. Yeah.

Bryan Cantrill:

They have not started here in Oakland, but they were, opened up with a an extended road trip.

Bryan Cantrill:

It was good baseball. It it was fun to watch.

Adam Leventhal:

You've been following them on the road. So were you at the the, Yolo High?

Bryan Cantrill:

No. I was not. They are in Montana. So I was watching that they've got a streaming provider. Oh.

Bryan Cantrill:

Oh. Oh. Oh. Oh. Yeah.

Bryan Cantrill:

I was watching it online. Yeah. I was watching it, you know, on, and I'm man. Road trip for

Adam Leventhal:

the ballers, like, a level of dedication that is truly impressive.

Bryan Cantrill:

I would a 100% do that. It was really fun. I was watching, baseball with the boys, which is great. Nice. And, it's fun.

Bryan Cantrill:

You know? It it is they, won their first two games, unfortunately lost their next 4, but, these last two games were very close. And it was it was great. It was it was a high quality baseball. It was all just a lot of fun.

Bryan Cantrill:

And, you know, I loved our the conversation we had with those 2 was was great with with with Paul and Brian. And, it was really, it was great. It was just just great to be, really be able to support a baseball team without complications.

Adam Leventhal:

Full throated endorsement of a team. That sounds great

Bryan Cantrill:

to say. Full throated endorsement of a team was been, it's been very nice. So, and then the flip side of that is, like, then I I I started stewing over the a's again and, you know, had to have a an extended tweet thread yesterday. So there you go. Nice.

Bryan Cantrill:

But, I, I'm excited for this. Yeah. And so I think we I'm I'm hoping that if there are other joyers that they will, identify themselves, raise their hands so we can get some people up on stage. I know Josh is here, and, I believe Robert is gonna join us as well.

Adam Leventhal:

He is here, and he is on his way up. Welcome, Robert.

Bryan Cantrill:

And that, you know, I don't know. With Discord, like, I just don't even know who's what

Adam Leventhal:

I'm sure I'm sure T San is is one of your people.

Bryan Cantrill:

I would assume. I actually don't know. So There we go. We're gonna find out. We're gonna find out.

Adam Leventhal:

So can you kick us off here, Bryan? What, I don't know that I remembered this talk that you posted and I, I would not have known that it was the 10th anniversary of this, of this Oh,

Josh Clulow:

yeah. Yeah.

Adam Leventhal:

I'm sure I'm sure a date that was etched in your mind.

Bryan Cantrill:

It for sure. To to

Josh Clulow:

be fair, he was like, what's the anniversary today? And I'm like, I I don't know. But I was I was there, but, you know, and and then you were like, it Googles really well. And so I spent like 10 minutes Googling dates, and I'm like, nope. It doesn't.

Josh Clulow:

You

Bryan Cantrill:

did not spend 10 minutes.

Josh Clulow:

I actually did. I did.

Bryan Cantrill:

If you Don't make me go get

Josh Clulow:

the time stamps.

Bryan Cantrill:

I go go get them right now. Go get them and and and bring them to Maybe it

Josh Clulow:

was like 3 minutes, but

Bryan Cantrill:

okay. There we go. Okay. I've already I I feel Adam, if you don't mind, I need to bargain a little bit more here. We're already I've already lopped off 90% of this, and I think I can get more.

Bryan Cantrill:

The it it but okay. A bit opaque. Fair enough. And, I mean, admittedly, I knew the answer, Josh, so I was being a a bit unfair. But, actually, I was reminded of this, and I think I might have you know, I might have forgotten.

Bryan Cantrill:

It's not like I observe this date every year, but I I might have forgotten, if it had not been for, for Steve. So Steve mentioned it because, he was in the hospital with Abby. So his oldest just turned 10. And and And at risk

Adam Leventhal:

at risk in breaking tradition. Do you wanna mention what it is we're talking about?

Bryan Cantrill:

Oh, boy. I don't know. You got all these newfangled ideas you've got. I mean, I just feels it feels like it's a slippery slope to intro music and then high quality audio, and I just don't know. I I like what will the Internet talk about if we yeah.

Bryan Cantrill:

Okay. So let me, I'll give you some context, and then we'll kind of, dovetail that into how how Steve learned about this. So it was exactly 10 years ago today. Josh, you remember where you were. Robert, I think you remember where you were.

Bryan Cantrill:

I think we got, we got Brian here. We got a couple other folks here. You all remember where you were because we, we had a massive outage, a very scary outage. It was very broad. It ended up being very shallow, fortunately, but we, errantly rebooted an entire data center.

Bryan Cantrill:

And you in fact, we rebooted our most important data center. I mean, they're all important. We love all of our children, but US east 1 is definitely the most important child. And we we rebooted US west.

Bryan Cantrill:

The US primary was US west one.

Bryan Cantrill:

Oh, fair enough. That's true.

Bryan Cantrill:

East one was the largest by far.

Bryan Cantrill:

It was the largest by far. And in particular, US East 1 had a bunch of our own infrastructure was in East 1. And, Brian, the website was in East 1. Right? We had, so we we had, like, joyant.com.

Bryan Cantrill:

Don't worry. It's highly available because it's it's in 2 different nodes in US east one. We had a bunch of our own infrastructure. And all of a sudden, our own infrastructure, it's like, you know, Adam, you have that moment of, like, did the Wi Fi just go out? Like, what happened?

Bryan Cantrill:

You know, you're just like, what is this is funky. And, like, your first thought is, you know, there's something wrong with the Wi Fi. And then and I just remember looking up, not really knowing what was going on, and then Mark Cabbage saying, yep. We we rebooted US east 1. And I he's like, get in the chat.

Bryan Cantrill:

I mean, I was in chat, but going into and and actually and Jabber Brian, where was Jabber? Jabber was in it was not in US east one. This is actually something that was

Bryan Cantrill:

It was in west.

Bryan Cantrill:

Very fortuitous. Yeah.

Brian Bennett:

It was in West. West, and we're lucky because Can you imagine?

Bryan Cantrill:

Oh, god. Why am

Adam Leventhal:

I even Did you have a backup plan? Because I've heard that, for a while, it's Slack, their backup plan was Skype. Like, if everything went on, they

Bryan Cantrill:

didn't have Skype.

Brian Bennett:

I mean, there was the old jab. There's the the era of dueling jabber server. So that might have been the backup plan is, you know, you get a fraction of, you know, 5% of the company that still had access to that Jabber server.

Josh Clulow:

Was it was it also in West, though?

Bryan Cantrill:

But a a a large portion of us were also on IRC, so I think we might've

Bryan Cantrill:

I I think you're right, Brian. I think we would've gotten to IRC. I think you're right. I think it would've been a it's on the one hand, it's too terrible to contemplate, Adam. On the other hand, I think Brian's right.

Bryan Cantrill:

It would have act I think we would have self assembled pretty quickly on free node. Yeah. I mean, I mean, as we would have gone ahead

Bryan Cantrill:

to ask, like, what the hell just happened? Are are you guys still alive? Like, did San Francisco get, like, nuked or something? You know? I know.

Josh Clulow:

We also had Google shit though. Like, we had Gmail.

Bryan Cantrill:

We did. And so we would have sent an email.

Josh Clulow:

And a bunch of other Gmail. No. But, like, I really feel that this this problem is, like, the smallest possible problem

Bryan Cantrill:

that we could have had on that day. It's like, I

Josh Clulow:

I think yeah. I think

Bryan Cantrill:

it would

Bryan Cantrill:

have figured out

Josh Clulow:

how to contact the other 20 people that we need. So we did, Josh, we quickly

Bryan Cantrill:

get on to the, the conference chat.

Bryan Cantrill:

Yeah. We did. Yeah. Yeah. I I and Josh, I think you're ultimately right in that, like, we would have There was just some

Josh Clulow:

sample met many of us. Like, it was a pretty small company at the time.

Bryan Cantrill:

It was. And so okay. So, I mean, each of us is gonna remember, like, the that first moment of events, pretty differently. I imagine, maybe not. But, like, I honestly, like, kinda whited out.

Bryan Cantrill:

I mean, I I Adam, I've said this before, but, like, this is, I think, the closest I've come to feeling like I'm about to pass out. Like, I it just felt like everything fired at once, and for like, what felt like minutes, but I think was only just like a small number of seconds, my brain just, like every snaps fired. And I just, like, couldn't think for a very brief moment. And this is where Robert might bone chill, like, no. No.

Bryan Cantrill:

You were catatonic for, like, 6 minutes. We actually gave up on you. We actually I told someone else to call a paramedic, and I focused on the outage. No one knows. Like, you had a distant look in your eyes.

Bryan Cantrill:

I don't know, Robert, for you. Because what I remember in that kind of this, like, brief catatonia. And so just to actually just describe a little bit our architecture and part of the reason why this was so terrifying is we pixie booted everything, Adam. So everything booted over the network, which felt like a really good idea. And only for 1 thing.

Bryan Cantrill:

Until there was

Adam Leventhal:

nothing on the network?

Bryan Cantrill:

Well, I

Bryan Cantrill:

mean Yeah.

Brian Bennett:

You just just need the one thing to be on the network. That's it. Just one.

Bryan Cantrill:

Just one. And then, of course, like, well, actually, it's gonna be one thing. To

Bryan Cantrill:

evolve. Yeah. But the head node is, on 3 servers, and the database is on 3 servers. And, if you only have one node for the quorum, then you can't bring up the booter agent that is gonna allow your 2nd database server to boot up. And I actually don't know how we overcame that.

Bryan Cantrill:

I Obviously I mean, obviously, we did, but I don't know I don't know who took care of that. I don't know

Josh Clulow:

It was a cache file, as I recall, of stuff, from from the database that that we were able to

Bryan Cantrill:

Okay.

Josh Clulow:

Convince it to use to boot at least a couple of other machines eventually.

Bryan Cantrill:

Yeah. So we

Bryan Cantrill:

Once you get that second node online, then everything else will come up.

Bryan Cantrill:

And, Robert, what's your recollection of this? Because this is where, like, one of the things that early like, the the head node clearly completely rebooted.

Bryan Cantrill:

Yeah. If I

Bryan Cantrill:

recall correctly. I've been trying to remember

Brian Bennett:

everything, and I have forgotten more than I remembered. But yeah. So everything I mean, this is a clear, like, everything honored the reboot command. And by the time we, and by the time we realized that we had sent reboot everything, it was too late to, you know,

Bryan Cantrill:

to Yeah. So in important context, Adam, and maybe this is what you're asking for, like, 10 minutes ago. But the, an operator intended to reboot a small number of machines. Like like 10? Like, 10.

Adam Leventhal:

And Of of of how many? Like, how what's the what's the

Josh Clulow:

Of 100. 6 or 6 or 700, I feel like? Yes.

Bryan Cantrill:

Something like that. 100 is like many. 100, not like 100 of 1000, but definitely 100. Like a lot of machines. And but those machines were not in service.

Bryan Cantrill:

So, like, it didn't matter whether you could reboot them as many times as you want. It didn't matter. Instead of because they were running an error prone command that I wrote just to be just to just to get that out of the way, they an argument that, specified the node. So what they they ran a command that said what you said what you should run is the command specifies the nodes as a comment eliminated list, and then the command you wish to run. And what this error prone command did is it he the the the operator alighted the argument minus n.

Bryan Cantrill:

So what the command is like, okay, you wanna run the following commands on all nodes. A comma separated list of UUIDs that is

Josh Clulow:

Whatever that is. Yeah. Whatever that is.

Bryan Cantrill:

I don't know. Some command. Look. These Unix commands aren't always named in the most intuitive way. This is a new one.

Bryan Cantrill:

I don't know. Haven't heard of this comment eliminated list. That that's a weird command. And then followed by the next command in this list, reboot. And that is we're gonna we're gonna execute everywhere.

Bryan Cantrill:

So it was like

Bryan Cantrill:

It wasn't

Bryan Cantrill:

It was truly like the r m starspace.0. And it's like

Josh Clulow:

I think it was also I mean, mister Bennett, I think, went back to look the other day. Yeah. Yeah. Some of the code, and it used, like, a totally bespoke option pausing.

Bryan Cantrill:

Okay. Okay. Okay. It was not Settle down. Settle down.

Bryan Cantrill:

But but like Like, oh, man. I have never encountered someone in

Bryan Cantrill:

a constructive way. It was a tantatic that That was like, if it's this, then we're gonna do this thing. If it's this, we're gonna do this thing. If it's this, we're gonna do this thing. And one of the things in there was, like, for something like minus n or minus t, which was like the time out that you're going to wait for compute nodes to respond with this command that you sent.

Bryan Cantrill:

It would only evaluate

Josh Clulow:

important on the side.

Bryan Cantrill:

If it's

Bryan Cantrill:

prefixed with a dash or 2 dashes, then that is, like, the the art signifier. Otherwise, we're gonna do all this other stuff. And so it didn't do the validation because, like, the list of compute nodes didn't have that prefix minus n on it, and so it just kind of, like, ignored doing command syntax validation in general. But if it was just, like, sdc on each node, list of nodes reboot, then that would have been concatenated as a single command. Because what it would do is take an arbitrary number of arguments and just, like, join by space as a single command and send the whole thing.

Bryan Cantrill:

Interesting. What did you say?

Bryan Cantrill:

So why didn't it try to execute, like, host name reboot, which would then return an error host name command not found?

Bryan Cantrill:

Right.

Bryan Cantrill:

Right. And so I was trying to figure that out, and it looks like what happened was he probably also included, like, list of hosts minus t one because I'm rebooting these things. Oh, wait for this thing to come back. Like, I know that it's, like, gonna hang. If I just say reboot, it's gonna hang waiting for reboot to exit and, like, for this whole thing to return.

Bryan Cantrill:

But it's not gonna do that. So I'm just gonna say minus t one. That way, I don't have to wait, and it's gonna return immediately, which then you know, if you do, like, list of hosts, some additional valid flags, and then the command you wanna run, it kicks it back into the command or, like like, the argument validation mode. And then the, like, bare words that you had earlier just kinda, like, gets tossed. Just

Josh Clulow:

it just it, they're gone now. Yep.

Brian Bennett:

Yeah. Well,

Bryan Cantrill:

I I think that's what happened. I was I was trying to dig into it the other day. Yeah.

Josh Clulow:

They should use they should use get ops. They just did that and and just, like, don't ever make your own option pausing. I feel like

Adam Leventhal:

You know, Brian, it sounds like 10 years ago, you had a very spirited, blameless postmortem.

Bryan Cantrill:

Did you did? Yeah.

Adam Leventhal:

I agree. Saving up the blame. I mean,

Bryan Cantrill:

like, I think we're saving up the blame for me, apparently. I the the I yeah. This is I

Josh Clulow:

honestly didn't realize you had written that command. So

Bryan Cantrill:

1, it's a very blameful postmortem.

Josh Clulow:

There you go.

Bryan Cantrill:

Yeah. The idea of of all people on God's green earth, mister Clulo criticizing me for writing my own bespoke options parser, it's a little rich. It's a little rich.

Josh Clulow:

Written a bespoke options puzzle. I used to get ups library.

Bryan Cantrill:

Oh, fair enough. Get ops library.

Adam Leventhal:

Josh, good answer. Wait until your lawyer is present.

Bryan Cantrill:

Yep. Dave.

Bryan Cantrill:

I love a call. Post court of lawyer. Dave had to write

Brian Bennett:

note, get opt. So it's not like that's a chicken and egg problem.

Bryan Cantrill:

Was a bit of a chicken and egg problem. And now I actually

Steve Tuck:

do want

Bryan Cantrill:

to go through the through the SCCS straight if we

Bryan Cantrill:

I guess

Josh Clulow:

you'd spin. You mentioned the

Bryan Cantrill:

options parser, I guess.

Josh Clulow:

You mentioned the time out thing that I'd totally forgotten until this moment. The Yeah. Setting the time out to 1 so you just ignore all the failures. This reminds me of another problem that exists in this interface, which was the, do you recall the exit code 113? Yes.

Bryan Cantrill:

I was gonna bring that up.

Adam Leventhal:

Oh, god.

Bryan Cantrill:

This is hey, Adam. I do actually need you. Adam, can I retain you for this the the period of this postmortem? I'm gonna need sorry. There's another felony that I did tell you about that I

Josh Clulow:

The agent so the agent the agent that would run the commands. So the the tool that we were using was called sdc dash on each node. Right? So that that was and then it would use rabbitmq to speak to the Ur agent, which is very hard to pronounce in a way that Americans can then the u it's like you are like the Not Right. ZFS.

Bryan Cantrill:

ZFS. I wanna take the car to ZFS.

Josh Clulow:

Park the car and have it got the but the and the it did not have a very expressive protocol, I feel, for executing commands or doing things. Oh.

Adam Leventhal:

Is this make this big, curl pipe to Sudu look like relatively safe?

Bryan Cantrill:

Yes. A 100%. It's really

Bryan Cantrill:

it's brutal,

Josh Clulow:

at least, I feel like.

Bryan Cantrill:

It's even worse than that, because it It's very bad. On the Triton control plane, you can actually just send, like, Kanapi server UID slash execute, and then, like, pass it a command, like, to the API, no authentication, and it'll just run that, on, on whatever, you know, you specify.

Bryan Cantrill:

Mistakes remain.

Bryan Cantrill:

It's one of the reasons why If it if it exited is, like, extremely controlled.

Josh Clulow:

If it exited though 113, if any program that you ran exited 113 Yeah. The origin would interpret that as a request to reboot the computer.

Bryan Cantrill:

Do you know the origin of 1 13?

Josh Clulow:

You know

Bryan Cantrill:

I was gonna look it up,

Bryan Cantrill:

No. Before This is I mean, I I I'm definitely this is like if I did it, before so no. I, for sure, the the origin for 113 is not unlike Adam, what is the port number that we attach to for the Freshworks appliance? 215? Ex and what was the what was the significance of 2 15?

Adam Leventhal:

That was, that was February 15th. I think that was the day that we, like, left the colonel group or or told all our managers or whatever.

Bryan Cantrill:

Okay. So 113? Any any it'd be any guesses on 113, Josh?

Josh Clulow:

I guess the month has to be first.

Bryan Cantrill:

That's right. That's

Bryan Cantrill:

Yeah.

Bryan Cantrill:

Yes. I know. I'm so sorry.

Josh Clulow:

The 13th January or

Bryan Cantrill:

the 13th March. January 13th January, I believe, 2,011. When Orlando was like, I need a hack for this, a very temporary hack.

Josh Clulow:

And- Very extremely temporary.

Bryan Cantrill:

This one is a 100% on me. And, I believe did I suggest this? I'm glad I hope I'm hoping the chat history is lost. I I I I quite like we suggested this. The specific exit code is definitely good to me.

Bryan Cantrill:

That is definitely my suggestion. So my fingerprints are on that revolver for sure.

Josh Clulow:

I definitely There

Bryan Cantrill:

you go. Exit code 113 reboot. Hey. And you know what?

Josh Clulow:

Very long Jira issue about this. I feel like at some point, and it's possible that I determined at that time writing that issue that you could use colors in the in the text.

Bryan Cantrill:

Colors in in in in Jira, you mean?

Josh Clulow:

Yeah. Like, in the in the text. I feel like I may I may have painted a a a rainbow of Spongebob text or something like

Adam Leventhal:

Such a rich commentary that it required multiple colors to fully express the the dialogue and nuance of this issue.

Bryan Cantrill:

Just He's like a he's like a mantis crimp developing, like, new new cones in his eyes to see the colors of rage that he had over the 1 13 x.

Josh Clulow:

About about in band signaling. About in band signaling. Yeah.

Bryan Cantrill:

So this is this is all new to me.

Bryan Cantrill:

Oh, sorry, Brian. Back here. Yes.

Bryan Cantrill:

This is all new to me because I actually, I tried to drill into this before and figure out, like, it's such a, like Uh-oh. Seemingly, like, random like, where did this number come from? You know?

Bryan Cantrill:

That's where it came from.

Bryan Cantrill:

And I I was very curious. Like, why 113? The source code, if what you're saying is true, and I I don't doubt it, the source code completely lies about what it is. Really? It says yeah.

Bryan Cantrill:

Reboot of the exit code is 113, or 1 slash 13. So there there's the clue right there. But it says, it is the 30th prime number and also an Einstein prime. I'm told on more than one occasion that the the, the 30th prime number and the Einstein prime is why it was picked. This, like, January 13th date I think

Bryan Cantrill:

that was a joke.

Bryan Cantrill:

Is entirely new to me.

Bryan Cantrill:

I think that was a joke. It it does feel like it could be the 30th prime though. Is it the 30th prime? Probably is.

Bryan Cantrill:

I I don't know. I I didn't check.

Josh Clulow:

There's really no risk to check.

Brian Bennett:

Wikipedia claims it is. So either you've conveniently edited it ahead of this or, you know

Bryan Cantrill:

Smart. Listen, I I I had 10 years to do it if I did it.

Bryan Cantrill:

Well It is also fairly

Brian Bennett:

in Einstein Prime, so, you know.

Bryan Cantrill:

Yeah. The git blame for for this is 11 years ago. So, you know, it it it hasn't changed in preparation for this,

Brian Bennett:

You also only really have a 128 numbers to pick from, and it really can't be 0. So

Bryan Cantrill:

It cannot be 0. You yeah. You don't actually have that many numbers to pick from. And I think we were trying to pick

Brian Bennett:

Okay. UA. It's like an intake.

Josh Clulow:

I think it has to be I think it's an IH. Right?

Bryan Cantrill:

And it has to be more than I would like to point out that even chat gbt is a bit at a loss about the significance of 113. It doesn't know that it's prime. It's like it's prime, and then it's like, cultural references. In some cultures, numbers can have specific meanings. Moving on.

Bryan Cantrill:

I don't I don't have no idea. Like, 1 so I would like to point out that, you know, I I I don't see a lot of programs exiting spuriously with the status code of 113. Of course, now I'm sending a pair

Bryan Cantrill:

of shirts that I missed. I thought or someone had said, you know, it it was picked because, because it's prime, it's a very large prime, It's very unlikely that any random software would have to text the code for any error status.

Bryan Cantrill:

That is also true.

Bryan Cantrill:

I took that at face value. Like, that seemed fine. Yeah. Yeah. That's all true.

Bryan Cantrill:

That all makes sense. But I'm very curious about what happened on January 13th now, because this is

Bryan Cantrill:

We wrote it. I've never heard of.

Bryan Cantrill:

That's what

Bryan Cantrill:

it was. No. No. That that that literally is when it was That's what

Bryan Cantrill:

it was? Yeah. That was the day? Okay.

Bryan Cantrill:

That was

Bryan Cantrill:

the day. So then you found some other justification to, like, put into the code into the comments, you know, other than, like, I just need to reboot this, and I'm supposed to delete this in 10 hours. But that never happened, obviously, because

Bryan Cantrill:

It was definitely never gonna be deleted in 10 hours. That was it would that that the the shelf life of that thing was definitely gonna be we are so this is in early 2011, and we were furiously working on smart data center as both in SEC, and we this is like a problem we just needed to, like, knock down, and, like, we'll come back to it in the indefinite future. Narrator's voice, they never get back to it. Also, I'd like to point out that, like, everyone can find the 113 distasteful. It actually had nothing to do with this outage.

Bryan Cantrill:

Outage. I would like to point out. It is merely the same mind.

Bryan Cantrill:

Just that that was

Adam Leventhal:

the quality of this.

Bryan Cantrill:

I don't

Bryan Cantrill:

know that he used the 113 because I think I learned about it after that event saying, like, by the way, you can do that and not have to send the time out. Because if you say exit 113, the shell will actually exit succeeds and replies. It returns back to sccnh node, and sccnh node will say, this is a result of what you ran on that host. But if you just say reboot or, like, in it 6 or whatever, it's waiting for that command to return. But then the origin is shutting down because it's going through the shutdown process, and you never get a response.

Josh Clulow:

This is what I And why I brought it up. Right?

Bryan Cantrill:

It's like because we had

Josh Clulow:

we actually had a a a wacky other button for

Bryan Cantrill:

for this Right. For this thing.

Bryan Cantrill:

And I

Bryan Cantrill:

I know that it was common for people in support to use reboot and minus t with a timeout because it was going to take forever to respond, or to to time out and fail. So that was, like, part of their SOP was to do it that way. And, I don't know, I guess this 113 was like a, an engineering secret or something that it it wasn't well advertised.

Josh Clulow:

Clearly, like, the cultural memory forgot about it between

Bryan Cantrill:

We 1

Josh Clulow:

113, and and I feel like people forgot about it because

Bryan Cantrill:

I know I I know a bunch of people that would not forget about it. I'm sure or let me forget

Bryan Cantrill:

I use it all the time now, you know. There you go. Anytime I need to reboot a computer, that's how I do it. I'd say so. That's the safe way.

Bryan Cantrill:

So we had errantly rebooted everything, Adam, due to a combination of Robust infrastructure.

Josh Clulow:

Human factors. Human factors.

Bryan Cantrill:

Well, I

Josh Clulow:

do think human factors.

Bryan Cantrill:

That that we I mean, on the one hand, mistype command. On the other hand, obviously, a command that was very, very brittle. Yes. Living in a in a sea. Before we

Adam Leventhal:

move, not not to pile on the command, but

Bryan Cantrill:

You you're my lawyer. You're my lawyer. I would've I would've

Adam Leventhal:

Your honor, I mean, I would've screwed this up, like the first time I ran the command. So it it it is remarkable to me that this command was run apparently so many times

Bryan Cantrill:

without managing

Adam Leventhal:

this to reboot the entire data center.

Josh Clulow:

Rusty, tetanus coated scalpel. Yeah. Sitting on a desk in a preschool was not

Bryan Cantrill:

like Okay.

Bryan Cantrill:

Well, let let me tell you something about that rusty scalpel. That thing looks so obviously dangerous that you wield it carefully. And the singing I

Josh Clulow:

I I We've all heard that one before though.

Bryan Cantrill:

Right? I mean,

Josh Clulow:

that's come on.

Bryan Cantrill:

The I think that was for revealing thought was you need to be careful with this thing.

Bryan Cantrill:

Well, I don't think it was not a point of principle that, like, oh, by the way, this thing should be totally brittle. It should blow up our entire business if anyone mis types anything.

Bryan Cantrill:

Right. I mean, like, that was not, like, on on the design criteria for sure.

Josh Clulow:

No.

Bryan Cantrill:

But yeah. I think it's with care was kinda like a given.

Bryan Cantrill:

One of the things that happened early in this in this outage is that the and I think that one of the things that actually, honestly, Adam set the tenor for the outage is so the operator has done this errantly and obviously does not feel good about themselves. But Right. More or less everybody was very supportive immediately. Because a couple of folks said, you know, I almost did this exact I've almost done this exact same thing before, which is actually a pretty productive thing to say, hey. In a you know, you've got this outage that has this kinda operator component to it.

Bryan Cantrill:

And to have your fellow operators be like, I have almost made this exact same mistake, by the way. Like, this thing is any one of us could have made this mistake, and, I think it was actually helpful. I think it was helpful for I think that, you know, our disposition anyway was always gonna be, let's work together to get this fixed. But I think it helped set the disposition with the operator of, like, okay. This is actually, like, I have a job tomorrow.

Bryan Cantrill:

I I actually don't need to on top of the anxieties of getting this system back on its feet, I don't need to have the existential anxiety of, do I work here or not? Because, I mean, that I think we took that off the table implicitly pretty immediately, and hopefully explicitly as well. But I think that that

Josh Clulow:

I think explicitly. I mean, I remember Yeah. You know, a bunch of things being said, like, I don't think anyone could make the person feel worse that they feel themselves right now.

Bryan Cantrill:

Yes. Yeah.

Bryan Cantrill:

Like yeah.

Bryan Cantrill:

Wait. I actually admired, honestly so, actually, it was funny. I was going through my email from that day, and the which was I kind of mesmerizing because, yeah, there are a whole bunch of things going on, and we're kind of, like, kicking around the draft postmortems and so on. And, you know, the operator sent me a personal note, like, later that evening, and just being very appreciative for the for the way we all handled it. And I wouldn't have handled it any other way, but I I do feel that that's an element of this.

Bryan Cantrill:

That, you know, we did a we did a bunch of things wrong, in that we had a system that was that had this failure mode that it was exposed to. But I think we handled it right from a social perspective, and from a and just really setting the bit early that, like, we all need to work together to get this thing resolved. The thing that I was very worried about, though, that we're kind of hitting on we were hitting on earlier, Adam, is that we did not necessarily know what the implicit dependencies were that the system had upon itself. And when you have a system that you kind of evolve over time, where you kind of upgraded in situ, you've got a SaaS thing that you're kinda upgrading all the time, and you're not really doing a fresh install in your DCs. And I just feel like we don't necessarily know what's gonna happen.

Josh Clulow:

And we Terrifies me about, like, you know, like Google. Right? It's like, when when have they ever turned the whole thing off Yes. And back on again? And the answer, of course, is almost certainly, like, never.

Josh Clulow:

Right? I mean, like, clearly they will do drills in certain areas certain times, you know, and, and try and create the salient aspects of the failure that they're trying to defend against. But like when has it ever actually happened? You know, and I think the answer is generally often with these incredibly large and complex systems that, like, it has never happened, that they've been completely cold started. Every computer in the world turned off for 20 minutes and then turned back on.

Josh Clulow:

Right?

Bryan Cantrill:

Right. I mean,

Josh Clulow:

that's Right. That's on call.

Bryan Cantrill:

What what

Bryan Cantrill:

what does

Bryan Cantrill:

that mean? Yeah. Totally. And it I mean, it's scary because you're about to, like, alright, we're about to learn something about the system. And so, Robert, I'm trying to put together what happened next because in my memory, one of the earliest things is that we we know that systems are unable to boot.

Bryan Cantrill:

The head node must have rebooted because one of the earliest things I remember is us realizing that the zone for TFTP was maxed out on CPU. I remember you making that discovery. Am I remembering that correctly?

Brian Bennett:

Yes. I think we had there's kinda like 2 or 3 different splits off at that point. You have a few of us basically making sure that so everything booted from picks except this one system which booted via USB. Everything was booting at ramdisk, regardless. So we have some of us on that, and I think then we've split off.

Brian Bennett:

And I wanna say Mark is then also waiting for Manatee and looking at, getting Manatee and Moray probably up in, like, a one a one note only mode initially.

Josh Clulow:

Manatee is a postgres chain replication thing.

Bryan Cantrill:

Yeah. That's right.

Brian Bennett:

But But, yeah. So I

Steve Tuck:

think Yeah. We're seeing

Bryan Cantrill:

That's probably what it was. It was Mark put it back in the one of right mode. Probably.

Brian Bennett:

I mean, I'm just trying to because that one copy was on that server itself.

Bryan Cantrill:

But we

Brian Bennett:

were trying to get everything booted, and I think we were noticing things were timing out. And yeah, Brian, you're right. We we I don't remember how, but I suspect we were just looking at good old MP stat and

Bryan Cantrill:

Yes.

Brian Bennett:

The lat column. And you have all these you have a box that has all these zones. Whenever they came up, they would all have a thundering herd as everything started up, You know, probably 50 plus Node. Js processes got out of bed, you know, all fought for CPU at a time, during an era where a server CPU sweet spot was, like, 10 cores.

Bryan Cantrill:

Right.

Brian Bennett:

Yeah. Right. So probably probably a Sandy Bridge. So Sandy Bridge era. Yeah.

Brian Bennett:

Yes. Or maybe a Westmere. Maybe just Westmere p.

Bryan Cantrill:

Yeah.

Josh Clulow:

I feel like, critically, it was a a node process as well. We wrote our own TFTP server. Right? Yes.

Bryan Cantrill:

We did. That's right.

Brian Bennett:

Booter was its own TFTP server. And it

Josh Clulow:

And then Presumably wasn't load bound, so it would have been one core. Right?

Bryan Cantrill:

It was one it when it was also super limited on DRAM. I was I recall that's, like, a 256 meg zone. Yeah. Robert, it was small.

Brian Bennett:

Pretty aggressive on both zone and CPU caps on just about everything. And Booter was definitely sized for, the sunny day where, you know, like

Bryan Cantrill:

It intermittent booting.

Brian Bennett:

Every, like, once every couple minutes, you know, like or like Yeah. Really, like, once a week or, like, once a month. Not really, waking up

Bryan Cantrill:

Everybody all at once.

Brian Bennett:

Everyone's saying hello. Combine that with UDP's, you know, I think because this would have been probably before we we rolled out Yeah. HTTP booting. And HTTP booting. So

Josh Clulow:

100%. It was one of the reasons we did it. Right.

Brian Bennett:

So yeah. So yeah. So we wake up, and you have this this zone that probably has, like, a CPU cap of I'm the guess at best one CPU core, maybe.

Bryan Cantrill:

For sure. What I think it was actually less

Bryan Cantrill:

than I

Brian Bennett:

like, like, one thread and, like Yeah.

Bryan Cantrill:

Yeah. Yeah.

Josh Clulow:

It was not a again, a single threaded node thing. So I think, like, even if it had a bunch of threads, it wasn't gonna go any faster.

Bryan Cantrill:

No. No. It was gonna go much faster because we that's exactly what we said. It went much faster. In that.

Brian Bennett:

So it was it was something from that.

Bryan Cantrill:

So Yeah. It was it was in it it literally couldn't get CPU. And I remember that Robert made this discovery. Again, this is, like, super early. This isn't, like, the first, like, we are on the head node within kind of like seconds, tens of seconds, minutes of the head node rebooting.

Bryan Cantrill:

Yeah. We know that other things aren't making progress. And I just, like, I don't Robert, the reason I remember that so viscerally, and remember that the timing was very early is that kind of knocked me out of my, like, panic state a little bit, where it was, you're just like, Oh, my God, nothing is booting. And you kind of like, I kind of feel like, you know, you're in the final exam where you, like, have forgotten all the things that you've studied. You've just, like, forgotten how to, like, do anything.

Bryan Cantrill:

And you were like, oh, wait a minute. We're we're look at the high latency. We need to give this thing more breathing room. And we gave it more breathing room, and, like, a whole bunch of nodes started booting. And we're like, okay.

Bryan Cantrill:

This is the the like, we and I think we had

Josh Clulow:

to turn some of them off, though. I feel like ops started

Brian Bennett:

we did

Josh Clulow:

powering a lot of machines off, right, to to to clear the air a little.

Bryan Cantrill:

Yeah. I remember that also. We did

Brian Bennett:

several things. So, like, we've what the first thing we do is we put that we put that process into the real time scheduling.

Bryan Cantrill:

A scheduling class. Yeah. Exactly. Yeah. Like,

Brian Bennett:

which definitely get rid of all its flat. I don't know. But you can probably describe the real r t RTs.

Bryan Cantrill:

Yeah. I mean, basically, you know, we it is now the highest priority thing in the system, and it would now and we removed its CPU caps, and, things started working much much, much better. We went from, like, not able to boot anybody to servers beginning to come up. And the there were a bunch of machines, and then somewhere in here, and I know, you know, as we so the other kind of aspect that's happening is, like, US East is all down. Like, all of our customers are down.

Bryan Cantrill:

Our website is down, so we don't have the ability to communicate to people to let them know they're down. And some point in here, I I remember telling Brent, who was running ops at the time, I'm like, we I know Steve is out on paternity leave, but we've gotta call Steve. You've gotta call Steve, actually. And because Steve needs to know that every single one of his customers is, adversely affected by this. Some are gonna be completely down if they're only in US east, and some are merely gonna have some significant fraction of their infrastructure down.

Bryan Cantrill:

But, like, Steve needs to know, so he's not finding out about it from a customer even though it's. And then Robert, the the other thing that that I was very concerned about, we were very concerned about was we were gonna have systems that simply were not going to come up. And as I recall, I, we dispatched at least Keith, but I thought maybe you and Keith, on figuring out how we needed to bring up machines by hand if we had to. Is that remember am I remembering that correctly?

Brian Bennett:

You probably are, and I'm trying to, like,

Bryan Cantrill:

Can't dispute

Brian Bennett:

it? I definitely no. I mean, it feels everything about that rings true. And I just can't remember, like we we definitely potentially perks, maybe? And so the the part would or

Bryan Cantrill:

I think that that was the the the well, we definitely had the Broadcom firmware issue on a bunch of these nodes. So which is part of that ended up being the long tail.

Brian Bennett:

Oh, the b n oh, right. B n x screwing us.

Bryan Cantrill:

Yes.

Bryan Cantrill:

Yeah. If not for that, everything would have been up in, like, much sooner.

Josh Clulow:

Remember Would

Bryan Cantrill:

have been up much sooner.

Bryan Cantrill:

Yeah. A lot of people were were working on just, like, connecting to IPMI and, like, manually rebooting things and, like, babysitting it until, they didn't get hit by that b n x error.

Brian Bennett:

Also, I do think that the when the TFTP timed out, the bios of that era did not handle it very well.

Bryan Cantrill:

Oh, interesting.

Bryan Cantrill:

Yeah. I think

Adam Leventhal:

it was the

Steve Tuck:

what was this

Brian Bennett:

I don't think it was, like, a permanent, like, let's keep trying until we find a boot order.

Josh Clulow:

You know, like, no. Like, we're gonna

Brian Bennett:

we did this, then, like, we went to the local disc and tried to find something. Didn't find anything.

Bryan Cantrill:

And then it was Do you wanna hit

Brian Bennett:

f one to, like, do you wanna hit f one to try something again? Yeah. So

Adam Leventhal:

what was this Broadcom bug?

Bryan Cantrill:

The bug.

Bryan Cantrill:

I couldn't find the bug number for this. I went looking for it.

Brian Bennett:

My theory on this, so at the time there was a closed driver for BMX, which was the Net Xtreme 2, nick line that got sold to QLogic eventually of now Marvell, through someone else I'm blinking as all these confirms have continued to pick themselves up and wear each other's skin. So

Bryan Cantrill:

Robert, you may have a career as, like, a driver genealogist. You know? You could be, like, help people trace their driver's roots.

Adam Leventhal:

And 23 and me for drivers.

Bryan Cantrill:

23 and me, but for drivers. Yeah. Were they, like, they, you know, tell you

Brian Bennett:

I'll be lucky if I have 23 customers, I think.

Bryan Cantrill:

That's right. You actually have this driver is actually the the descendant of a driver that was a baron in in France. Anyway, sorry. Go ahead.

Brian Bennett:

Yeah. Well, I think, unfortunately, this driver had descended from someone who didn't understand how to correctly enable and disable interrupts. As I found years later, when, Supermicro's website inadvertently just had the source code for the driver sitting there.

Bryan Cantrill:

That's helpful.

Brian Bennett:

It was helpful.

Bryan Cantrill:

It's, like, is this open source, or is this source that is the source available? Is this just source on your website? What is this? Is this what

Josh Clulow:

It's open now. It's open now. Exactly.

Brian Bennett:

Yeah. And then and then and then the companies are like, I I actually made a deal. It's like, if I promise to never bother you about this again, can I just treat it as open source? And they said yes.

Bryan Cantrill:

So that

Josh Clulow:

was Did Yeah.

Brian Bennett:

But but I found in the gldv3, like, m stop or, like, m start interface. It's like, how about I turn off

Steve Tuck:

and how about, like, a

Brian Bennett:

turn off interrupts? And Oh, interesting. And, like, part of the problem is that there's for those who haven't had to deal with the intricacies of PCIe, devices both have MSI interrupts, and then also, like, a way to control interrupt enablement at a lower level. And I most devices, you almost you basically turn on MSIs and MSIX is, like, when the driver attaches and, like, maybe the thing gets reset, and then you you toggle, like, in band, not, like, device specific bits that control interrupt generation. And so this is always in my theory, This always manifesting itself as effectively, like, you'd be trying to assign an IP address in the link, and it would just, like, give up halfway through.

Bryan Cantrill:

Yeah. Right. It

Brian Bennett:

was, like, net eighty something.

Bryan Cantrill:

And so and we had a bunch of machines that hit that, and so they were they were struggling to boot. We were Steve, are you or is your audio working? Steve's trying to get some

Steve Tuck:

Is my audio working?

Bryan Cantrill:

Your audio is working. You're here.

Steve Tuck:

This is just such a hard topic to talk about. I had to take a moment. Exactly. It brings it all flooding back.

Bryan Cantrill:

But, you know, your therapist said that this would be a productive session for you.

Bryan Cantrill:

This is

Bryan Cantrill:

an important step.

Steve Tuck:

It so changing therapist.

Bryan Cantrill:

Okay. So, you are so, Abby, your oldest,

Steve Tuck:

husband's been born. Abby. Yeah. She has been born. Happy 10th birthday.

Steve Tuck:

Yeah. Thank you. Thank you. I'll I'll pass it along. She and Susanna share in common that their birthdays are more remembered for big events in joint history rather than the birthdays, the the the birth dates

Bryan Cantrill:

themselves. Right.

Steve Tuck:

Abby's for this, and then, of course, Susanna when, the CEO of Joyant was fired.

Bryan Cantrill:

Yes. Yeah. She was actually when I was describing that we're gonna talk about this because of Abby's birthday, she's like, oh, like, Abby and I are kinda like twins because because, like, that one guy was fired. And then she was actually, Steve, you appreciate this. She's like, did Steve fire him?

Bryan Cantrill:

Did did Steve fire that guy? And I'm like, no. No. Steve and I both worked for him. It was the he was the big boss.

Bryan Cantrill:

It was the board that fired him.

Josh Clulow:

The hand of God reached down and fired that guy. Yeah. That's right. Yeah. And then she kinda got

Bryan Cantrill:

a little wide eyed. She got a little wide eyed, actually. She was like, wow. That's wow. Bigger Boston's team.

Bryan Cantrill:

That's exciting.

Steve Tuck:

He must have been old. Right.

Bryan Cantrill:

So

Steve Tuck:

yeah. No. I was in the hospital. Abby was born on the 22nd, but she was still there because she was a little jaundice. She was getting some some extra extra fake light.

Steve Tuck:

And, I got a call, but because Brent called me and I did not have his number by phone, because he had just gone there. I mean, He had only been at Oxnard at at Joint, for, I don't know,

Bryan Cantrill:

a month? A month. Yeah. He had just gotten there.

Bryan Cantrill:

That's right.

Bryan Cantrill:

So we

Steve Tuck:

probably got, like, a call from

Adam Leventhal:

somebody has to call Steve. Somebody does. Brent. Like, is that your name again? You do it.

Bryan Cantrill:

You do it. No. I actually I you know, I had Does he you're

Josh Clulow:

a director of something?

Bryan Cantrill:

Yeah. No. No. He actually I mean, ultimately, I mean, he he was very new to the role, but this was I mean, he was running operations. This was definitely, like, in his purview.

Bryan Cantrill:

And, oh, I was very direct about you you should call Steve. Actually, you know, he that's funny. Steve, I'd forgotten how new he was because the thing that I remember is that he was very eager to help in any way he could. He he was great. I mean, again, this I think everyone through this was really good.

Bryan Cantrill:

And I think he was like, yes. I will call Steve. Great idea. So he just call you see a call through

Steve Tuck:

yeah. A known number. But I but I did recognize it is a Missouri area code, oddly, which I think goes back to one of the one of the things that I gained from my earliest days of my career working in a Dell personal computer call centers. You you start to understand area codes and where they're from. But, saw Missouri, knew he was from Missouri and answered.

Steve Tuck:

And he says, hey, Steve. I'm sorry to sorry to or I think he said, are you busy? And I said, yeah. Kind of. I've got a rundown for, like, another UV light appointment here in a few minutes, but what's up?

Steve Tuck:

And he said, well, all of u east US East 1 is down. I was like

Bryan Cantrill:

Okay. Well, I'm throwing I'm throwing up on myself now.

Steve Tuck:

Oh, I I mean, I did it didn't quite sink in. It for it took, like, a second or 2 to sink in, and then I had the same feeling that I had when Abby had just, you know, arrived in the world and was kind of dazed and stunned and looking for something to kinda brace myself with.

Bryan Cantrill:

Right. You're welcome. Thankfully, you're already in a hospital. Yes. Yes.

Steve Tuck:

So I was in a good good spot for that. But then his next words were much more troubling, which is, and we're not sure when it's gonna come back.

Bryan Cantrill:

Truth. That is the truth.

Steve Tuck:

He didn't say if it would come back, which was good.

Bryan Cantrill:

That would have been even truthier.

Josh Clulow:

I mean, a a a media didn't hit it. So

Bryan Cantrill:

Right. I think you might recall, and I don't know if this is true, but someone's because I I was not in centers and scroffs. I was always working on a book. But someone's saying something like, some people are working on recovering and other people are working on arranging flights to Virginia.

Bryan Cantrill:

Yeah. Probably. So I so my disposition on this was I did not wanna have happy ears about this at all, and I wanted to let's assume that we are in minute 1 of a multi day outage. And how would that affect if we knew that this is gonna last multiple days, how would that affect our actions right now? I definitely remember thinking that very in part because I'd seen this very good presentation from Mark and Briaco.

Bryan Cantrill:

Steve, you remember Fred, and I'm not sure if you watched that presentation at Surge, but

Steve Tuck:

a really good just thinking about that. Yeah.

Bryan Cantrill:

Yeah. He gave a great talk at surge. Unfortunately, it was not recorded, but about a Heroku outage that was a 66 hour outage at Heroku. And 96. Was it 96?

Bryan Cantrill:

96. Yeah. It was.

Josh Clulow:

That was a lot of hours.

Bryan Cantrill:

It's a lot. And sleep management became a major, major, major issue for them. And they, you know, because he kept hoping that the outage was gonna be, like, done in the next hour. And he he I one of the things that he took away from that was, I wish I had better planned for let's assume this goes on for much, much longer, and getting people off the keyboard and sleeping, so I had refreshed people coming back. How

Josh Clulow:

long how long did it go on? Because I always feel like when I look back, it was like Oh, this thing? Oh, this is already

Bryan Cantrill:

hours. Right? Not even 6:30 hours. No. No.

Bryan Cantrill:

No. It was even less than that. No. No. It was but I just because we were taking the system into a completely new state, it I I feel it could have been much much longer.

Bryan Cantrill:

Oh, yeah. I mean So we we could have, like, started returning exit code 1 thirteens from things that, you know, don't normally run, and we could've we just think could've.

Bryan Cantrill:

Yeah. It

Bryan Cantrill:

could've it could've snowballed in so many different dimensions.

Bryan Cantrill:

I remember having looked at the tickets for the incident itself quite a ways back. And, it was only about 90 minutes. Yeah. It was short. We had most everything up.

Bryan Cantrill:

And Yeah. Like, we mentioned earlier, the long pole intent was just that b n x issue. We had a few striders because of that. But otherwise, it would have been, like, well under an hour to to ring everything back up. Because the main thing was, like, the head note rebooted on its own.

Bryan Cantrill:

Good. Great. Put Manatee in one on right mode. Okay? Now it is serviceable, and then Booter is just, like, constrained.

Bryan Cantrill:

And once you unleashed Bootr, then pretty much everything could come up.

Bryan Cantrill:

Yeah.

Bryan Cantrill:

And that, like, that series of events of, like, realizing, like, okay, what's the next thing blocking this? That could have easily been, like, only 10, 15 minutes to get to the point where booter is unleashed.

Bryan Cantrill:

That's right. I think of that time. Yeah. Yeah. Yeah.

Bryan Cantrill:

It's short.

Bryan Cantrill:

And then after that, it's just, like, all these things that are hanging at boot time because the damn VNX driver or firmware, whatever it was.

Bryan Cantrill:

And so, I mean, we got we just got lucky in, like, many, many, many dimensions. It could have been much, much worse. I definitely remember that the moment that and Robert, I know you were also physically there, because as in my memory, Ben Rockwood, and maybe Shane, but certainly Ben, were working on failing the website over to US West. And, Brian, maybe you were involved in that as well, that we we needed to get the website up so we could tell people that we had not actually been nuked. Because it's like joint.com is down during this whole time,

Bryan Cantrill:

Adam.

Bryan Cantrill:

I mean, like, you go to joint.com and it's just like, yeah, not found. 404. Like, that

Josh Clulow:

this is I remember

Bryan Cantrill:

somebody was working on it. I remember Troy and Elijah working on doing the reboots, and everything just came back up well before. Like, there are people working and moving the website, and then it was like, oh, we might never I remember that moment.

Bryan Cantrill:

And Oh, no. I I remember that moment though a clear way. Because I remember being, like, someone saying because we had I yeah. I think you're right. I think it was gonna be Troy and and Ben and then Elijah.

Bryan Cantrill:

But I I, working on that. And I remember thinking, okay, great. Joyant.com is up. Someone said, like, joyant.com is up. And I'm like, great.

Bryan Cantrill:

They failed that over successfully to US West. And Ben was right next to me, and Ben's like, we haven't that done that yet. That was like, we're about to do that, but we haven't done it yet. We're we're gonna pop the gonna turn the key in, like, 2 minutes. And I'm like, wait a minute.

Bryan Cantrill:

If you have not failed it over, and it's up, that means, oh my God. We're gonna

Josh Clulow:

It's Christmas.

Bryan Cantrill:

It's Christmas. We're cut out. We're we're booting every that I definitely remember being elated about that. And because Josh, I remember like you were, we were all, I remember because we were physically all together when that was happening. And I remember that moment of just like, oh my god.

Bryan Cantrill:

We're gonna live. Thank god. I because, like, I think you you, like, you're not

Josh Clulow:

Huddled in that huddled in that tiny corner of the giant floor of one EC that we were, for some reason, renting at that time. Yeah.

Bryan Cantrill:

That's right. That's why on the couches there and and realizing that it's that that we were gonna live and that we now had definitely had enough to work with where we are gonna be able to bring the system up. Robert, one thing I was trying to remember is, like, we had I I don't think absolutely and I I I guess the the head node itself, was able to boot off of stable storage. I I feel like there have been work that had been done that that I've at least, anyway, personally unaware of, that ended up being load bearing here, in, with respect to our ability to recover from this. Is that right?

Bryan Cantrill:

I know I'm asking you to go into the, into the crypt here. The is there in terms of, like, our ability to actually Well The fact that

Bryan Cantrill:

the head node will always reboot. Yeah. Okay. All the sounds up. But if it can't contact the peer manatee instances and the peer, binder instances zookeeper

Bryan Cantrill:

Yeah.

Bryan Cantrill:

Then the database can't come up, and all of those services will throw errors saying that they can't connect to Morey. And that's where you're gonna get stuck. And so that that's why it's, like, how did we get past that? Right. You need Quorum, and if you can trick Manatee into going into one on right mode, that does it.

Bryan Cantrill:

And that was the thing that I hadn't forgotten about that. And that's

Bryan Cantrill:

Did Mark do that? Is that how we got there? Robert, do you remember? How did we get

Bryan Cantrill:

I would put good

Brian Bennett:

odds on it. So that that Yeah. That we did that. And I think the computer cache,

Bryan Cantrill:

which I

Brian Bennett:

don't think was always there.

Bryan Cantrill:

That's the the the the only reason we had the booter cache. The booter

Josh Clulow:

cache did not work at the time.

Bryan Cantrill:

Right? Because this is one of the problems that

Josh Clulow:

we had was that it it was full of files and it could have worked except that unfortunately, I think because it had started up again from cold and it had never been able to talk to the database, it was a teapot basically. And I think it was not looking at its cache because it's like, well, I'm waiting for the database basically.

Bryan Cantrill:

Even though I have quite

Josh Clulow:

a lot of information about, you know, stuff that I could be good at.

Bryan Cantrill:

I remember asking you a couple of years later, like, what would, like, what would happen in this case? Like, how would we how would we boot it up from cold if, like, all 3 nodes were offline? And you said something effective, like, okay, we're gonna, like, first we'll boot the head node, and then we'll punch Booter in the face until it allows everything else to boot.

Josh Clulow:

It does sound like something I would say. It really does

Bryan Cantrill:

sound like I don't know

Bryan Cantrill:

that you're gonna be able to deny that one. This sounds like something that you would say, Josh. I'm afraid. I I I think even your own lawyer is gonna say this is something that you would,

Josh Clulow:

Well, he's like lawyer's like, yeah. He says that to me all the time. But yeah. Absolutely.

Bryan Cantrill:

Okay. So we are beginning to like, we know we're gonna live. We are we now got this long shadow of of BNX and dealing with this DHCP issue and and being able to get leases there and ultimately having to just talk about punching things in the face, having to get these notes kinda bounced a couple times, get them all the way up. Steve, does someone call you back at some point to let you know how it's going? That I it occurs to me that I definitely asked Brent to call you during the outage.

Bryan Cantrill:

I don't recall asking Brent to call you back.

Josh Clulow:

I probably just went to the pub.

Steve Tuck:

Yeah. Yeah. The the the hospital bar. No. I did not get a callback from Brent.

Bryan Cantrill:

Right. Sorry.

Steve Tuck:

I feel as though I did get some text updates, though.

Bryan Cantrill:

Okay. Alright. So we let you know that we're

Steve Tuck:

I immediately because because I immediately started texting a couple of, like, ModCloth, 1ello, Voxer, you know, some of our key customers just to make sure that they knew that we kinda knew what was going on and that we were gonna be in touch with them frequently. I think what you you mentioned that Mark and Briaco talk, and one of the things that stuck with me in that was because the whole talk was about crisis communications when Yeah. You know nothing for 96 hours. And in their case, in Heroku, they're running their entire business on AWS. Their customers are running a huge component of their business on a Roku, and their customers, of course, are like, I'm down.

Steve Tuck:

When are you coming back? And Heroku's reaching out to AWS. We're down. When are you coming back? And hearing nothing.

Bryan Cantrill:

Yeah.

Steve Tuck:

So do you reach out to your customers and tell them every 15 minutes we know nothing? Every 5 minutes we know nothing? When you're when you're literally getting 0 incremental information about the status of things. And, I think the the things that, you know, we had control over, which was great, is that we did have internal information about what was happening. We were AWS, and we were gonna go share that information with our customers as quickly as possible.

Steve Tuck:

Even if we didn't know what was next, at least them hearing and understanding people were working on it, people had, you know, kind of some of the what we've talked about already today, just drip feeding that information to them because these were largely technologists and, and and could see that we were on a path somewhere. But,

Bryan Cantrill:

yeah, all all

Steve Tuck:

I can recall is just sending text messages to, like, 4 or 5 customers. More information coming, update you in the next, like, 30 minutes. Had no idea if we're gonna know anything else in 30 minutes.

Bryan Cantrill:

So, Steve, as I was going through my email from that day, I've got the email that Izzy sent out, which has the complete timeline to our customers. And it's like, okay. Well, it's, like, really great timeline. Izzy, of course, is now a colleague at Oxide. So, it's been it's been fun to to work with her again, but the I just reminded of of, like, how we all I mean, I because I think it was very important for all of us to be as transparent as we possibly could be Yes.

Bryan Cantrill:

With our customers. And I think that we, you know, I

Steve Tuck:

think we And, again, that was the thing with from that Mark talk. The the most difficult part was the lack of transparency from AWS. Yeah. Which is where just left them with nothing to share back.

Bryan Cantrill:

And I think it mean to be fair to AWS too, that was very early AWS. And they handled subsequent outages differently, in terms of their own transparency, and they obviously had fewer of them over time. But I think that they that was a during an era where I think AWS was because I think that, you know, in the talk that big EBS thing happened? That was in 2011. Oh, yeah.

Bryan Cantrill:

Okay. Yeah. And I think that the you know, the one of the points that I made in the the talk so I gave a talk, the go to Chicago talk was in 2017, and I wanted to talk about this incident because it was, I think, it was a searing incident for all of us, and it was, we could have been much, much, much, much, much, much, much worse. But to me, it was also an object lesson in the the dangers of this a system that seemed to work. And we had this kind of semiautomated system that was automated in certain regards, but was still very manual in other regards, and very prone to human failure.

Bryan Cantrill:

And I think that, you know, systems go through this in their life when you go through this kind of era where you are now, like, being broadly used, but or or you're or you're being used in a way that's kind of, like, beyond the capabilities of the system. And I feel like that was definitely the moment. That was definitely true in 2014 for the software. It was it was being used a bit beyond itself, and I feel we we learned a lot from this. And we we made a lot of improvements, to the system.

Bryan Cantrill:

I do you wanna Josh, you wanna talk about some of the specific things that we I I I mean, you you mentioned a couple of things. I know that, Dave had a bunch of things that we found in Nanta. Certainly, we had, which may we may have changed our options parsing a little bit in STC. I

Josh Clulow:

think so, I mean, I was just looking through my fortunately, I wasn't journaling in quite the same way I do now back then. So I only have like dot points about from, from the scrum channel back then, and it definitely does not reflect the severity of the date. When I look at the list of things, they're just all like, oh, yeah. Like, I think the probably the strongest thing here is, oh, replacement thoughts discussion for the day, which we then didn't do. But, but but but I did file a bunch of I remember thinking it's time that a safety culture was not particularly evident in the onage known tool at the time.

Bryan Cantrill:

I'm right here. I can hear you.

Josh Clulow:

I'm doing my best the as an aside before I, I feel like the the vendor name status.com pattern was not super prevalent back then?

Bryan Cantrill:

Yeah. I think that that's right.

Josh Clulow:

Because I I feel like

Bryan Cantrill:

Yes.

Josh Clulow:

We did we did not have, like, a giantstatus. Com.

Brian Bennett:

We did

Bryan Cantrill:

not have giantstatus dot com.

Josh Clulow:

But we did we did go and make one afterwards. We made one. Yeah. Yeah. Because I think that's definitely something that if you have SaaS or infrastructure stuff, it is good to have a well established, easy to find bulletin board to put these customer updates on.

Josh Clulow:

Right? Like, when everything else is screwed up.

Bryan Cantrill:

Right. And then you have to actually operate those transparently. So I think nothing is more enraging to a customer when it's like, this thing is all green and it is all down and you're not communicating.

Josh Clulow:

Like like, every time I like, literally every time I go to GitHub status. Com, and it's like, everything's great. That's like, it's really nice.

Bryan Cantrill:

Do you think GitHub is

Josh Clulow:

that I've never seen before. Like, come on.

Bryan Cantrill:

You think it would be like don't you think you should be asking me that question?

Bryan Cantrill:

Why do

Bryan Cantrill:

you think I'm here? Get up status.

Bryan Cantrill:

Do you

Bryan Cantrill:

think I'm here because I think it's not like, clearly, I'm here because I think there might be an issue.

Josh Clulow:

They're actually just using, like, the the right at which people view the status page to determine if they should look into something, I think.

Bryan Cantrill:

That would be kinda funny. They were like, hey, We're all green, but there's been a huge uptick in people asking us what's going on. So we're actually

Bryan Cantrill:

we're not gonna

Josh Clulow:

received more visitors in the last 90 minutes to the site than we have in the previous 2 months. What Could

Adam Leventhal:

think what could be wrong. Google tracks the flu by people searching for various symptoms.

Bryan Cantrill:

Yes. Totally. Oh, that's funny. Yeah. Funny.

Bryan Cantrill:

Yeah. That would be that would be interesting. Like, our status is green, but we're now we're less certain because so many people are asking the question. So but, it's

Steve Tuck:

On on redundancy, on the, one of the AWS outages, they had their status page Yeah. In only one location where that location was down.

Bryan Cantrill:

Which is what would then happen. But that that was our perspective. Yeah. That was

Josh Clulow:

our No.

Steve Tuck:

I know. But I'm just saying, you know, other others have done it.

Josh Clulow:

Wait. Yeah. Yes. I mean, it's until you like, unless you specifically sit down and and game out, like, what would happen if we turned off this entire data center? I feel like you could miss a lot of things like this, and at least the first time.

Bryan Cantrill:

For sure. And well, yeah. I mean, how did it color kinda our thinking going forward too? Because I feel it, like, I feel it had a a a pretty significant impact on the way we thought about the system.

Bryan Cantrill:

Well, like,

Josh Clulow:

with the website, for instance, I think we we took the DNS.

Bryan Cantrill:

I spent, like, a month getting that GSLB

Josh Clulow:

Yeah. That thing.

Bryan Cantrill:

Like spanning West 1 and East 1, like making sure that that worked. I was on that project for a long time.

Josh Clulow:

And that was that was a DNS based, like, failover thing. Right? Where you could. Rather short TTLs for, for,

Bryan Cantrill:

a fronting record and,

Josh Clulow:

and, and so for multiple locations.

Bryan Cantrill:

Yeah. Yeah. I had DNS servers on both sides, and Yeah. They would each reply with like, they would look up your IP address and figure out if it was a shorter route for you to go to east or to west, and it would return the the IP of east or west.

Josh Clulow:

I think it would also try to tell if the other one was working, right, as well.

Bryan Cantrill:

Yeah. It it had health checks to the other side. And if, if the health check failed over there, then it would stop returning that server as a valid destination.

Josh Clulow:

That definitely improved things.

Bryan Cantrill:

And and, Brian, this was a a consequence of this predict of this outage. This outage was a wake up call in that regard. Like, we obviously need yeah. Yeah. Right.

Bryan Cantrill:

We need to have something. Yeah. I I I also think

Bryan Cantrill:

moving the website over to West 1 did never get completed. But Right. Somebody had, like, created instances, and they were starting to bring it up. And I I think at some point someone was like, well, like, we got it halfway over there. Why don't we just make sure both of them are on just in case this happens again?

Bryan Cantrill:

Yeah. Yeah. Yeah. I think so I dropped the post mortem, in in in the chat. I think we did a, I thought that this is a good postmortem.

Bryan Cantrill:

This was one of the one of our first it would be the first of a of a couple. It's not the last postmortem, by a long shot. But, this is a and, I I just going through my mail, I I think that, Mark Cabbage took a big swing at this, and then, then I contributed to it. Brent contributed to it. Some other folks contributed to it.

Bryan Cantrill:

But we, we we got this out in a pretty timely fashion. So I'm not sure when exactly this was published, but I feel it was I I was we definitely I had a draft in my inbox, basically, later that day. We definitely this is a top priority to get this out. I think we did a good job of being transparent about what actually happened, and kind of, pointing folks be being as detailed, as we as we could be being detailed about the timeline. I also personally like the fact, this is why you can't really have lawyers read this stuff too closely, that we we apologized, which I actually think is really important.

Bryan Cantrill:

And I I feel we did a good job of kind of expressing the that we understood the magnitude of this. We really did not wanna so I I think, Steve, one of the things that was that I I think felt frustrating in Heroku and that whole AWS outage is, like, do you guys understand that, like, my business is off right now? I think we really wanted to express to people that we know that infrastructure is important, that you're, you know, you you're a customer because you want us to get this right, and we and we did and we we we wanna express that we understand the severity of that. I think we did a good job of that.

Steve Tuck:

Well, and, I mean, again, it's just that that helplessness when you have no way to report back to your customers, your team, your management, what the current status is, what's happening with any kind of conviction or confidence. And I I think, again, one of those things coming out of that great talk from Embryako is, like, you actually do on the side of communicating more frequently. Yeah. You do keep stay in communication with folks and just say, hey. We're still working on it.

Steve Tuck:

You know, we've got the entire team working on this thing. We're gonna you know, you'll you'll know as we get things that are meaningful, but we'll keep checking in. And it just allows people not to go go to the place of, like, well, this is not gonna get resolved. And and you you obviously want to give people all the information as you know it, but those the the, you know, they I remember him talking about how it had been 30 minutes, 45 minutes of being down with, like, no communication. And often custom companies are struggling with, do I wait to tell them until I know something, until there's something that's actionable?

Bryan Cantrill:

Right.

Steve Tuck:

And if you if you wait hours, days, while there's things that are not actionable, it actually does much more damage than than these frequent updates that don't have quite as material information.

Bryan Cantrill:

Them. Totally. And I think that that's something

Bryan Cantrill:

that people are speculating and

Bryan Cantrill:

Yes.

Bryan Cantrill:

Yeah. Those terrible rumors that come out based on 0 information, people are going to latch onto that and assume that that's real and that stuff will haunt you forever.

Josh Clulow:

Well, also, like, if 6 hours goes by or something, it's like I mean, I don't know. If 6 hours with nothing, like, I'm look I'm already beginning to try and resuscitate my software somewhere else. Right? And if if, like, once that process completes, then you're not really a customer anymore, which is I

Bryan Cantrill:

feel like I want to avoid

Josh Clulow:

it. Yeah. Definitely.

Bryan Cantrill:

And so then then it it Brian, what were there other things that that we changed operationally, coming out of this? I know we've there's there are definitely a bunch of other things that I wanna talk about the insurance side, but there

Bryan Cantrill:

were there other Yeah. Well, so one thing is, this happened before Triton was open sourced. And as a consequence of that, the, the rearchitecture of the SSC on each node tool, the ticket for that has never been made public, until now. And I'm Oh. Dropping this stuff.

Bryan Cantrill:

Oh,

Bryan Cantrill:

Wow. Look at that. What a what a reveal.

Bryan Cantrill:

To look at this. Yeah.

Josh Clulow:

I 2006 or 7. Right? Is that

Bryan Cantrill:

head something? This is the rearchitecture. So this was, after this event

Bryan Cantrill:

Yeah.

Bryan Cantrill:

This was all of the things, like, there was a series of tickets that were created as, like, sub components of this. But there was, head 2006, s c in each node should not default to all nodes. That was a big one. Yeah. Head 2007, that's what I meant.

Bryan Cantrill:

Require command

Josh Clulow:

Not years, but second not days. Argument.

Bryan Cantrill:

Oh, yeah. Yeah. Yeah. Yeah. Head 2006.

Bryan Cantrill:

That's right. And then head 2008, SDCH node should validate node list before execution.

Josh Clulow:

Agreed.

Bryan Cantrill:

And is those three things each one of those played a part in misinterpreting the syntax as written. And I I think Yeah. Any one of those would have prevented this from happening. They were all necessary.

Bryan Cantrill:

Well and I think so this is a very important point. Right? And thanks for making this this available for everyone to kinda look at. Because I I think that the one one thing that I think we did well after this event and definitely informed I mean, continue to inform the way we do engineering. But, like, you really wanna take this as an opportunity.

Bryan Cantrill:

This is an accident. It could have been much worse. It's an opportunity to improve so many different things. The system has now revealed new things about itself that we may have known kind of in the abstract, and now we know really viscerally, or in some cases, it's like, oh, boy, didn't we realize that was kind of a merchant behavior. And really taking the opportunity to enumerate all of them, and attack all of them, and get all of them resolved is really, really, really important.

Bryan Cantrill:

And I think I

Josh Clulow:

feel like I definitely wrote this ticket. It it says created by former user. But

Bryan Cantrill:

Yeah. That's because the Jira has changed hands a couple of times. And Yeah. So everybody is in there as the former user. And for the most part, I can't tell who it was unless I recognize the style of writing, and this is definitely you.

Josh Clulow:

The code is totally a relic from another time using such arcane constructs as sys. Pump. God, I wow. That's a

Bryan Cantrill:

Dis dot pump.

Josh Clulow:

It is. Dot I don't remember

Bryan Cantrill:

this stuff. Dot pump.

Josh Clulow:

But it sounds like the kind of thing I've chosen to forget.

Bryan Cantrill:

Yes. It is a raw from another time. This is from early and this is from 2011. I mean, this is from you.

Josh Clulow:

Pre 0.4, I feel like. Yes.

Bryan Cantrill:

Yeah. 0.4. This is in the

Brian Bennett:

It's 0.4 ever kinda case.

Bryan Cantrill:

Yeah. This is 0.4 ever era. Yeah. The the, And

Brian Bennett:

and also, Josh, how do you forget sys. Pump when you have your presentation on streams with, you know, magical tools?

Josh Clulow:

Yeah. That is a good presentation.

Bryan Cantrill:

You gotta drop a link to that. I'm not sure. I've seen that.

Josh Clulow:

It's actually not on the it's actually not public.

Bryan Cantrill:

Well, then make it public. Brian's still with this Which one? We make things public right

Bryan Cantrill:

now. Alright.

Josh Clulow:

Alright. Yeah. Yeah.

Bryan Cantrill:

What's the number?

Steve Tuck:

It's no. No.

Josh Clulow:

It's a it's a video.

Bryan Cantrill:

I I do think that, Brian, it would be great to make, ahead 2,006, 2,007, 2008 public if we can. I mean, I think it'd be interesting to

Bryan Cantrill:

Yeah.

Bryan Cantrill:

The but you this is this is old. I you the other thing is very funny is because I'm in my email. I, like, I I in my email from that day. Right? And the so I've just got all my email from that particular day and kind of the various thing in a in a much of my email on the day is due to this outage.

Bryan Cantrill:

Right. It's not all of my email, though. The other thing that is happening is that I I've I've got I've got several other things that are cooking. One is a, a customer who was trying to get spun up on Joins or the they seem to be not care about the fact that we rebooted the data center that didn't come up in those conversations. Like, okay.

Bryan Cantrill:

Fine. The the other thing that the there were 2 other things that was going on or that were going on. 1 is that the, TJ Fontaine is, sending a dispatch from the Node. Js front where, that there's unruliness, and things are really beginning to heat up in the joint v community fracture. Of course, this is because this is in May of 2014.

Bryan Cantrill:

It would be in December of 2014 that you would have IO. Js. Right? So this is like the the I I I I it was kind of weird to go back to that moment in history where it's like, oh, right. Yeah.

Bryan Cantrill:

Like, the tensions are continuing to build, and, ultimately, is gonna lead to this fracture of IO. Js and our own kind of reflection back on that. I can drop into the the talk that I gave also in 2017, actually. But looking back on Node, and then the other thing and, Steve, I thought you'd find this particularly funny. The other thing that I've got is at the same time, so the one of our VCs had identified a CEO that we should hire.

Bryan Cantrill:

And we had a CEO at the time, and I've got actually a bunch of communication back and forth with him on this day as you can imagine.

Steve Tuck:

Is that May 2024?

Bryan Cantrill:

May 2014.

Steve Tuck:

I mean Yeah. Or or 2014. Yeah.

Josh Clulow:

Yep. One of those.

Bryan Cantrill:

The the and in particular, Steve, what I've got is mail from Charles saying, hey, the and I kind of I've made reference to this in the past, but the yeah. May 24, we're gonna keep the current CEO. Thanks. The the I don't know. The the yeah.

Bryan Cantrill:

Yeah. Not May 2024, but May 2014, I've got mail from Charles saying, hey. Scott wants to meet with you for a much like, he he really enjoyed his conversation with you and wants to have a much longer conversation. And the one I I want you to come down to our offices to have that conversation. And ultimately, we would hire Scott to replace the CEO, to replace Henry.

Bryan Cantrill:

And that is happening on this day of all days. I remember and I have a mail back to Charles that is, like, pretty short in both in length and in tone of like, this is gonna be a big waste of my time. Because I'm at the time, I'm like, we do not need another CEO. We have a CEO. Why are you doing this?

Brian Bennett:

And Last thing we need.

Bryan Cantrill:

Last thing we need. And so I've got a mail back to him being like, yeah. This is actually sorry. This is not gonna be this is, like, kind of the wrong day. So when is this important?

Bryan Cantrill:

Can I get some additional context on that? I thought that was kinda funny that was happening at the same moment. You had these kinda other aspects that that were that that were going on, the these kinda other big things that were that were happening.

Josh Clulow:

Like, this was around the time you were getting a lot of haircuts.

Bryan Cantrill:

This thing called on a haircuts?

Josh Clulow:

Yeah. What is this? I feel like I feel like I

Bryan Cantrill:

was So you did I just stumble?

Josh Clulow:

You had to go talk to the new CEO at some point.

Bryan Cantrill:

Oh, I see.

Josh Clulow:

Oh, you had a lot of

Bryan Cantrill:

Oh, god. For a moment, I'm like, is is there something that I have I been is there something that people have all been remarking about my haircuts that I have done?

Josh Clulow:

Because you know what, I

Bryan Cantrill:

actually did have a barber at that time that was very hit or miss, and he and I would get into these like in-depth conversations, and I would distract him, and he would do a terrible job cutting my hair. So I'm I I I have this like latent this latent belief that there's a chat that consists of everybody except for me, all talking about my terrible haircut. Sorry, Josh. That that was, No.

Josh Clulow:

That's, No.

Bryan Cantrill:

What you're referring to is the fact that I had to do these this Scott was actually kind of vetting the company at the time. And I was having to invent these reasons to, like, duck out of the office in the middle of the day to go meet with him while he was supposed to be meeting with our current CEO. And, yes, it was a lot of it was it was awkward. It was

Josh Clulow:

looking very sharp.

Bryan Cantrill:

That's right. Frequently looking sharp. That's right. But, yeah. Oh, and sorry.

Bryan Cantrill:

So so we have made Brian, thank you. You've made those actually available so so folks can kinda see, the, Josh proposing. I I brought a new flag, dash dash all nodes if you wanna run on all nodes. That's a good

Brian Bennett:

idea. There's a

Josh Clulow:

lot of, leaning forward italics in these bugs. That's, yeah, like the this italics is in in lieu of leaping across the table at the reader, basically. It definitely the command the original command definitely this list. Yes. Possibly.

Josh Clulow:

The, I think that was a big safety point, right, was that the command would fail, open pretty well. Like Yeah. Like if you gave it like, trash, it was pretty easy to get it into the position where it would, like, just just do the most destructive thing kind of by default.

Bryan Cantrill:

Yes. Yes.

Josh Clulow:

Without without flags to like, it was like, I'm gonna do this on every computer unless you tell me not to do that. And it was pretty easy to like forget to or to be misinterpreted when, when trying to tell it not to do that. And it would definitely like do it pretty quickly as well. Because the other like, architecturally, I think the there was a RabbitMQ server that had a topic in it that every that the agent that ran on every compute the like the management agent, basically, on each computer that would execute the job was just listening to this topic. And if it saw something that matched, its host name or didn't include any kind of specify at all, it would just do it straight away and reply.

Josh Clulow:

And so the rearchitecture of the client end of this ultimately moved to also, like, if you specified, like, 17 hostnames that didn't exist, we would just try and do those. And like if there was a time out like for some of them like you misspelled one of them then it would still execute the other ones. And you were kind of, like, halfway through an operation, and you had to kind of pick up the pieces based on the output and figure out what to do next. So the I created a separate discovery phase for the command. So, like, if like, for starters, there's an all nodes flag now.

Josh Clulow:

So unless you specify the all nodes flag or the specific list of nodes, then you haven't specified, you haven't been specific enough and we won't do anything. And if you do specify a list of nodes, then we'll make sure that we've heard a broadcast from all of them in the discovery phase before doing anything on any of them, which felt much safer. Because then like

Bryan Cantrill:

Much safer. Yeah.

Josh Clulow:

Yeah. Because like, I mean, this, like so many other Unix things is is a stringly typed sort of interface, interface, but, like, there's a pretty limited subset of strings that are actually valid host names in most data centers. So, like, you can check them. So we we, we did start doing

Bryan Cantrill:

that. And we also have in in KanAPI, like, the discrete list of all hosts. Like, it cannot be outside of that list.

Josh Clulow:

Right. I don't know that we actually consulted the database on the thesis that the database might be like this tool was definitely the looseness of this tool definitely, assists in some circumstances, Right? Like, because if all, if the whole control plan is down, except for this thing, you can still get quite a lot done in trying to fix it, which was valuable.

Bryan Cantrill:

That it was valuable.

Josh Clulow:

So like, the fact that it doesn't actually our recall does not consult the database so much as it just broadcast and and then like, checks the the things that gets back to see if they match what you said should be there. And if they are, then we can proceed.

Bryan Cantrill:

Yeah. There's been quite a few cases where a compute node was hung and, or apparently hung, not getting any kind of response from it whatsoever, you know, can't SSH to it. Like, maybe it pings, maybe it doesn't. But it responds to SDC on each node, and we can, like, dump the process list and, like, see what's consuming memory, like, is the disk full. You know, there's a lot that we can do with this thing because that agent was still running and actually able to, like, still fork.

Bryan Cantrill:

Yeah.

Adam Leventhal:

So I'm curious guys, you know, 10 years on, do you find yourself thinking about this incident as you're building stuff, and and how does it inform

Josh Clulow:

Oh, yeah.

Adam Leventhal:

The the things that you're building today?

Bryan Cantrill:

Like, pilot. This incident lives rent free in my head.

Josh Clulow:

Pilot has that that is to say pilot, the tool that we that we have used to poke at, oxide rack stuff in in in the pre control plane era sort of as an engineering tool definitely has a separate discovery phase, for instance, and then an execution phase. Cause it's a lot of broadcasting and stuff, in there as well. That's definitely based on my experience with the original, on each node tool, for instance.

Bryan Cantrill:

Yeah. And I also feel that, Adam, like, we do not actually pixie boot in the oxide route. Right.

Josh Clulow:

Yeah. Definitely not. Good.

Bryan Cantrill:

And which and it feels because I pixie booting feels so great. And Yeah. You only have

Josh Clulow:

to update one copy of the whatever and It's

Adam Leventhal:

a stateless service. Nobody like state stateless nobody right on.

Josh Clulow:

Right. Except that like, turns out it's not actually that hot. Like, we like we kept the RAM disk part, and we just put the RAM disk on disks in all of the computers.

Bryan Cantrill:

And I feel Josh I feel that is 100% or a ramification of this bitch.

Josh Clulow:

Definitely that's like that that fashionable office chair is made entirely out of scar tissue. Like, that's yeah.

Brian Bennett:

Yeah. I think there's there are other things that happened that also were motivated getting away from pixie. But

Josh Clulow:

Well, so we we would have had to have invented a pixie thing as well, but, like yeah.

Bryan Cantrill:

Yeah. So with Robert, what were some of those other things? Because, I mean, for me, this incident definitely looms the largest, but, yeah, I I it doesn't surprise me that there were kinda other things that helped inform that from your perspective.

Bryan Cantrill:

I

Brian Bennett:

mean, just all the kind of big gems that you had. I mean, in that era of Triton and SDC, we had a big single L2 broadcast domain. PIXIE, not that you can't configure switches to deal with that or deal with, like, switches that that don't deal well with lack p, link aggregation and failing back.

Bryan Cantrill:

But Yeah. That was

Brian Bennett:

kind of a huge class.

Bryan Cantrill:

Properly

Brian Bennett:

procuring pixie is not impossible, but difficult.

Josh Clulow:

Pretty difficult.

Brian Bennett:

There's a bunch of stuff that Alex, Wilson, Josh, and I were looking at in the context of, you know, how we would use YubiKeys, how we would try to get some of the certs into the UEFI chain store or into Ipixie. Yeah. But it was it's messy. It's not impossible.

Bryan Cantrill:

It's messy.

Josh Clulow:

And it's it isn't it honestly isn't clear that there is a complete implementation of all of those things that actually works in a way that isn't just also like, by the way, I'll accidentally accept all kinds of signed code that isn't the code that you thought. Like, it's it's very difficult to even just test that stuff.

Bryan Cantrill:

Totally. And so we I mean, as a consequence, like, so we just as Josh mentioned, like, we we still have a RAM disk, but we don't actually download that over the network. That has actually been dropped down onto one of the local drives onto the app tattoos, and then we Yeah. Put off of that, which I I again, to me is a a very direct hit.

Josh Clulow:

As well. Like, I mean, it's quite it's quite snappy, and the whole rack can cold boot itself, which is nice.

Bryan Cantrill:

Okay. So okay. So this is another thing that I mean, I just think, like, in general, the rack cold booting is very important to us. That failure mode is very important to us. Another thing that I feel is a consequence, at least at some aspect of this outage, that this outage at least helped inform it, is the, the presence of the recovery path in the oxide rack, where we can lay an image down when we when the when we don't have an image on the M.

Bryan Cantrill:

2. And the fact that we use that for updating the system, we we use that recovery path in the update, To me, it, like, not that this, that you can draw directly back to this outage alone, but But I feel it's like the kinds of things that we learned in this outage, that that is that wisdom's kinda reflected there, where you it's important that if you do have a recovery path, you want everything, you use it for something. Yeah. I mean, Josh, is that So

Josh Clulow:

we definitely don't use it for everything. Right. But we do use it. We exercise it vastly more often than we ever exercised turning off 600 machines and pixie booting them all at once. Once.

Josh Clulow:

Right? Like, because we That's right. Had never exercised that. And even to be honest, even after we did this, we didn't exercise it again because it was just too hard to arrange that kind of environment as a like automated test or something, but we- there's just a lot of computers involved in that specific failure mode, and it wasn't clear how much effort we should spend on, like, building a TFTP load generator or something.

Bryan Cantrill:

Well, I I think another another thing that's been important and, again, I don't know if you would make this kind of a direct descendant of this exact outage, but the we have many different ways of simulating different aspects of the system. And I think that all, like, there's not one way of simulating the entire system. I think that's important, because it allows us to simulate a bunch of different kinds of failure modes without necessitating aspects of the rack. And I I mean, I think because I can be part of the the the kind of the fundamental challenge here is we had no way of having an environment that really looked like that production environment in terms of scope, in terms of, you know, hundreds of machines in development. And so we you need a different way of of being able to simulate aspects of that.

Bryan Cantrill:

And I feel that that that this incident was very educational in that regard.

Josh Clulow:

Yeah. And back pressure is another thing. Because I mean, like, really the the biggest failure the biggest failure in the boot path in this during this incident, I think, was the TFTP stuff, right? Like the because like there's not really no back pressure in that mechanism because TFTP is from, I don't know, 1978 or whatever and not very good and TCP also from around then, but built with like some understanding of what is going to happen in networks that are not behaving well.

Bryan Cantrill:

Yeah. Yeah. Yeah.

Josh Clulow:

And like congestion control is like a first class part of the TCP and that's why it works. Like that's why we're still using it.

Bryan Cantrill:

Right. And congestion control is one of those things that people without congestion don't care about. It's like, why would you care about this? It's like just use UDP. It's like when

Josh Clulow:

And when you're pixie booting something, generally, it's usually like you're doing you're pixie booting the one machine that you're setting up right now, and then you're gonna do the next one. It's pretty rare. Certainly, I think in 2014 it was rare to see, by the way, we're putting hundreds of computers all at once, all the time with pixie on one network. It just it was uncommon, I think, that people would have that level of automation routinely in their stack. And and and, like, because we had organically grown the size of the deployment up to that size over years, I feel like, you know, like a few machines at a time, basically.

Josh Clulow:

Yeah. Like Yes. In in batches and they were quite reliable. They stayed up for quarters or years at a time unless you explicitly rebooted them for the most part. So there was just never an option

Bryan Cantrill:

to fix it. We got to that otherwise was with Samsung Cloud where we had enough compute nodes there that even if we rebooted, like, 50 a day, we wouldn't be able to reboot every single server within a year.

Bryan Cantrill:

Yeah.

Bryan Cantrill:

And so we And and it was just a whole bunch every day. But even that volume wasn't enough to, like, overload booter.

Josh Clulow:

Well, and by the time we had HTTP booting by that point as well. So, like, the

Bryan Cantrill:

Which would all be the

Bryan Cantrill:

direct number. By then.

Josh Clulow:

Yeah. By 2,000 and 50th? Definitely. It was

Bryan Cantrill:

I I know that I know that it was available. There's a sappy setting for it. And I know that, at one point, the default was switched, but we didn't want that to just change on people when they upgrade without them knowing. So there was, like, if you had an install that that used, was from before HTTP booting, and you hadn't set the flag, you're going to stay that way.

Bryan Cantrill:

Right.

Bryan Cantrill:

And so, I know that at least in in JPC, we had to explicitly enable that, after it had been available for a while.

Bryan Cantrill:

And and certainly, though, the presence of HP booting is a very direct consequence of this edge.

Josh Clulow:

Yeah. It was like, what can we do instead of TCP that will actually work if 200 to to a 1000 computers try to boot at once? And that thing is NGINX will serve the images because it is, quite capable and and we will use TCP to get them to the computer because it coexists with other connections concurrently. And you asked earlier what we, what lessons that we we carry with us and I feel like the whole of Oxnard computer is a lesson related to pixie stuff in that. We are not using the firmware from other people as much as we can like the I mean recall that pixie booting like each new because because we were willing to try and support Smart Data Center and and then later Triton on on anything that the customer had, we had to deal with a lot of shitty pixie ROM stuff.

Josh Clulow:

Stuff that didn't work right. And then so we tried to work around that by chain loading to ipixie, which could at least do the HTTP thing pretty reliably, but even that wasn't a slam dunk because it had to work with the NIC that was in whatever box it was. There's just a lot of challenges, and so, like, by not using the Extant, BIOS or EFI firmware in the system or the Extant Pixi Rom on the NIC or or indeed any of the firmware or existing pixie mechanism protocol stuff at all, we've managed to sidestep a lot of those problems.

Bryan Cantrill:

Totally. And all of that is is a consequence of, as you say, the furniture made of scar tissue. And Right. The I mean, we had a a bunch of it that we were accumulating over the years. And this is an important chapter because also we had you know, it was the Broadcom issue that, the whether that was driver issue or firmware issue, but the like, that that Broadcom issue did cast a long shadow for me personally.

Bryan Cantrill:

Like, we had a this was not the only Broadcom issue we had by a long shot. And Right. Pretty frustrating, and it definitely informed, you know, Robert when Robert did our RFD on the on Knicks selection, which he very famously and appropriately titled before Knicks of the apocalypse. Is that one public, by the way, Robert? Or should it be?

Bryan Cantrill:

Maybe it shouldn't be. I don't know. But the, you know, in the the forenex of the apocalypse, everyone agreed that Broadcom is war because, it it it definitely feels like

Josh Clulow:

Who is the instillants?

Bryan Cantrill:

In Intel. Intel's pastelance. Yeah. That's that's accurate. Yeah.

Bryan Cantrill:

The, we we we stand by it. Yeah. The, we went with Famine, by the way. Chelsea, we decided this at Chelsea, valued partner, also, Famine.

Josh Clulow:

Maybe this is not the

Bryan Cantrill:

way we're supposed to see that with us. Maybe maybe this one wasn't supposed to be public. Oh, well.

Josh Clulow:

Famine famine keeps you lean.

Bryan Cantrill:

Famine does keep you lean. Exactly. Nothing wrong nothing wrong with a little a little elevated cortisol levels due to the the inability to to feed yourself. But the, I think that for me, like the broad combat cast a long shadow. And as you say, Josh, I mean, it's a very interesting point that in terms of, like, there were a whole bunch of things, many things over the years that helped inform the way, like, what we did and did not wanna rely on to build a reliable system.

Bryan Cantrill:

And this was one of those. Where there were it's it and wanting to be able to control our own flight our own fate completely, which, of course, we've we've done to a much, much, much greater degree at Oxide.

Josh Clulow:

And I feel like we also spend a lot of time thinking about the safety of, as we are creating API operations for operators, customers, people that buy the rack, I think we we put a fair amount of thinking into, like, what does it mean to create an operation that has no validation that lets you reboot everything? Like

Bryan Cantrill:

Okay. And so look, this is cheap, and I'm not trying to get out of responsibility, but, like, JavaScript wasn't helping out a lot of this stuff. And No. No. I do.

Bryan Cantrill:

I mean, I just mean,

Josh Clulow:

like, as a company,

Bryan Cantrill:

you know, it's like you put it in. But but but, honestly, I think that, like, there were several also many things over the years, and of which this is, like, this is the the same minor note on this. But,

Josh Clulow:

the washing washiness of of the type system definitely does not help you be rigorous rigorous with, like, like, even things like, the exhaustiveness of an enum check, right, is not really a thing that exists unless unless you do it by hand.

Bryan Cantrill:

Right. Right.

Josh Clulow:

And I think And enums are really just strings, to be honest, and, like Yep. Yep. Heaven help you if you misspell one of those or or miss one or whatever. Yeah.

Bryan Cantrill:

I mean, the the the the truth is it would actually be much harder even being in Rust and using Clap or your fine Getopt crate, not to take anything away from from Getopt, the not a custom option parser. But if you were to just building it in Rust with one of those components, a bunch of these it's much easier to make correct software. And it it it's much harder to kind of you'd have to go a bit more out of your way. And I think that that also I I mean, that for me, this is one of many things that helped inform the collective decision that we wanna actually build build the future in in Rust. I think and, Adam, you I think people might reasonably wonder.

Bryan Cantrill:

It's like, well, why did you do a new control plane in oxide? And it's not like, we got a lot of Triton working over the years. So it wasn't it's not a condemnation of Triton, but it's but we also saw an opportunity to go really fundamentally address some of these issues at a deeper level. And and maybe not have 113 as special test code. Right.

Josh Clulow:

Inband inband right. Inband signaling, I'm definitely is, like, my nemesis. Definitely. It's like anytime you see someone's like, alright. So like this string means this.

Josh Clulow:

Yeah. So if you put the value all in this string, it means something else. It's like, okay. Well, don't stop. Don't do that.

Josh Clulow:

Have Right. An object that has like, have an enum or something. Right? Like or an object with 2 fields or or whatever. Don't don't take the string value and jam them a couple of special string values in there.

Josh Clulow:

It's just not great. It leads

Bryan Cantrill:

to problems.

Bryan Cantrill:

And then so I think in terms of, like, the net net of this incident was a near miss in in every dimension. I think that, like,

Josh Clulow:

it could be 90 minutes and no one no one lost any data.

Bryan Cantrill:

90 minutes. No one lost any data. I mean, Steve, from a customer perspective, I recall customers being like, actually, like, we appreciate the commute. I I I don't recall losing customers over. I was very worried we would, but I don't think that we did.

Bryan Cantrill:

I I I think that

Steve Tuck:

No. No. I don't think we did. I think if if anything and again, I was gonna say one thing we carried forward for sure was being overly transparent in the midst of issues.

Bryan Cantrill:

Yeah.

Steve Tuck:

And, I mean, we even saw this our our very first install of oxide last year. When we would hit bugs, we would invite the customer in to kinda go shoulder to shoulder and join the troubleshooting internal calls. And they remarked at the end of the week just how refreshing that was and how different it was, and that that for them really speaks to a longer lasting partnership. And so now I think I think we gained more customer confidence. Well, also a bunch of questions about, like, how will this not happen again?

Bryan Cantrill:

But Right.

Steve Tuck:

Transparency won out for sure.

Bryan Cantrill:

Yeah. Absolutely. And I and I I

Bryan Cantrill:

just need to remember somebody maybe, like I mean, one of the customers having, like, sent a message saying, how impressed they were that we could have this level of an outage and actually recover that quickly. And that was, like, one of the reasons that they felt good Staggering large outages. Giant.

Bryan Cantrill:

Yeah. Like us. Yeah. We don't no one is more surprised to hear.

Bryan Cantrill:

That if this was at AWS, there's like, it would be able to day out. Like, there there's no way AWS could, like, reboot a data center in.

Steve Tuck:

I almost wanna say Constantine or someone wrote a public post. Yeah. There was there was something public, I recall, to that effect. Maybe it was just that email, inbound from the customer, but, yeah, they appreciate it.

Bryan Cantrill:

Well, I mean, I think in in in it was I mean, we had a a bunch of these over the years, but really helped sediment our own belief that the the you you really have much more to gain by being transparent, and being transparent when you don't know everything. And I think that's really hard for companies to do because they they don't wanna seem like they don't understand their own system. But I think being transparent about what you do and don't understand, and what you're doing to improve that understanding, and then I think also just being, understanding the gravity of it. I think that it all it all served to reinforce a bunch of things. Again, it could have been much, much worse.

Bryan Cantrill:

And, Steve, it's something that, Abby what we will associate with, with Abby's birthday. Every birthday.

Steve Tuck:

Exactly. 15th anniversary, 20th anniversary.

Bryan Cantrill:

That's right. Well, this has been, it's been a lot of fun. Maybe a little bit traumatic, but, as you say, Brian, this is a day that lives rent free, I think, in a lot of our heads, but, broadly I would also say the the one last quote I would put on this. I think people were a bit surprised, although, again, we were never it was never a question for us. They're like, wow.

Bryan Cantrill:

Like, that operator the operator still works there. Like, Absolutely. And I think that kind of the way we handled it in that regard, I think people found refreshing, and I think revealing of our own culture. One of the things I found in my email was Brent mailing to the team just kind of flabbergasted, you know, being new at Joanne. He's like, I'm really I I I was extraordinarily impressed about how everybody jumped in together to fix this.

Bryan Cantrill:

Of course, like, fine, dude. Of course, we went we were we had no business. No.

Josh Clulow:

No. Like, we'll we'll be back when the data center is back up. Going back.

Bryan Cantrill:

That's right. That's right. We're gonna let you know. Of course of course, we're always gonna jump in together and debug it together. But good stuff, exciting stuff.

Bryan Cantrill:

Happy 10th again to to Abby, and thank you again. Brian, thank you so much for joining us, and Robert and Josh Totally. And Steve, of course, and and reminiscing a bit. And, Brian, thank you very much for making those public. It's great for people to be able to go look at those tickets.

Bryan Cantrill:

And one of the one of the Jira tickets whose numbers, Josh retains. So you you know it's consequential. Alright. Thanks, everyone. Take care.

Rebooting a datacenter: A decade later
Broadcast by