Systems Software in the Large
Oh, he must have gotten wind of my planned rebellion.
Bryan Cantrill:Oh, of the rebellion?
Adam Leventhal:Oh. Oh. Oh, you're here. Oh, excuse me.
Bryan Cantrill:I I come on. The rebellion can't start without me.
Adam Leventhal:Exactly. Well, the recording has started without you.
Bryan Cantrill:Okay. Well, okay. I guess as it turns out, the you know what? I I you actually can't be late to the rebellion. It turns out, apparently, am.
Bryan Cantrill:So, I'm joining the rebellion already in progress. What's going on? What's going
Adam Leventhal:on with
Bryan Cantrill:bell hit? Hey, I hey. What are doing wait a minute. Hold on.
Adam Leventhal:This is some secret rebellion we used to have. That's right.
Bryan Cantrill:I gotta apologize. My, my voice is a little hoarse today.
Adam Leventhal:From cheering on the ballers?
Bryan Cantrill:From cheering on the Oakland ballers. Coming back from two down, winning three three straight at Raymundi in West Oakland to win the pioneer week championship, and it was it was magical. I gotta tell you. It was so great.
Adam Leventhal:That's awesome.
Bryan Cantrill:The at the end, you know, it's just like it's just so great. I just feel like it you know what? It feels old timey. This is what I like about this. It's like the park is small.
Bryan Cantrill:It's intimate. It's urban. And the I saw the end of the side of you love this. So you know that they, I can't remember if it was there. I don't think it was there for the the game that we saw with oxide.
Bryan Cantrill:But Oakland Fire will pull up on in the outfield, and those guys will sit on the fire truck and watch the game when they're when they're not on a call, which I like this is because great. You know, I feel like this is like I don't doesn't it feel like I'm watching baseball in the nineteen forties or whatever?
Adam Leventhal:It just feels like Total. Like total, like, I I have seen some of these, like, black and white images from the forties of, Yes. Yeah. Kind of dirty kids, like, peering in through the fence climbing climbing a telephone pole to get a glimpse of their favorite baseball player.
Bryan Cantrill:It's like a little rascals.
Adam Leventhal:Yeah. Totally.
Bryan Cantrill:No. I I absolutely so they and I love that. Right? It's so great that they're out there. So they were out there for that last game, and then when when they won, they they they turn on their lights, which was amazing.
Bryan Cantrill:And then, they shot their water cannons onto the field. And it it I gotta say, was just like one of these delightful things that like no one planned. And then just it was so great because, you know, the Pioneer League is this developmental league. And so, you know, these kids are are are were all in school not that long ago. They're all college players.
Bryan Cantrill:They were they were in school only a couple years ago. And, you know, they're they're kids at some level. And just like watching them run to the water and like sliding in the water and running to the fence, and they're doing like snow angels in the water as oak. I mean, it's just like, oh, man. How is this just not what it is all about?
Bryan Cantrill:You know? I mean, it was just magical.
Adam Leventhal:That's great.
Bryan Cantrill:And I know for our listeners like, wait a minute. How we how we hit our baseball quarter for the year? We are gonna ...
Adam Leventhal:No.
Adam Leventhal:Emphatically, no.
Bryan Cantrill:Emphatically, no. I mean, obviously, no. Glad you asked. The we are gonna have, Paul and Brian wanna come back. Alright.
Bryan Cantrill:So we're gonna have them back on the podcast, which is great. I I although it was unclear if they were saying that because I was introducing them to a a fellow Bowers fan that loved our podcast and was saying to Brian, I love the podcast. And Brian's like, yeah, we should do that again. I could tell. Like, is that in earnest Maybe or is he just like, these these nut jobs.
Adam Leventhal:Had your stenographer with you, so you got notarized too. Right?
Bryan Cantrill:I I got it notarized. Yeah. I mean, as you do. I mean, it's like, you know, it's Andy Jassy's napkin from New York just twenty two years ago or whatever it was. Yeah.
Bryan Cantrill:You know, same guy. Notarized it. Makes sense. Yeah. Anyway, it was great.
Bryan Cantrill:My voice is hoarse, but for all of of the right reasons, fortunately.
Adam Leventhal:Yeah. That's great.
Bryan Cantrill:And Dave is here. Dave. Dave. How are you? Dave's here.
Bryan Cantrill:And Ray's here. And I hopefully got some other folks from Update as well. Dave, thank you so much for this aux contact. It was amazing. And the and I think, you know and I mean, Adam, I I don't need to ask what you thought of it because I feel like everyone thought the same thing at the same moment, which is like it was it was mind blowing actually in a real sense.
Bryan Cantrill:It was great, Dave. It was a lot of fun to watch.
Dave Pacheco:Thank
Dave Pacheco:you.
Bryan Cantrill:And Yeah. I I just wasn't I mean, I I knew what was coming at some level and I just didn't I I didn't fully because you'd said like, look, I wanna kinda talk about the experiences of this thing in in addition to the system. But they man, you were dropping some real extraordinary wisdom. And very then grateful also to CJ at Oxide who because we had actually like professional audio. So actually making that up for once.
Bryan Cantrill:Exactly.
Bryan Cantrill:It's like finally.
Bryan Cantrill:Finally. It's like how how many, like, hundreds of millions of dollars do you jerks need to raise before you can have some decent audio? Turns out it's $1,100,000,000 raise to answer your question. Simply. Yeah.
Bryan Cantrill:Simply. Ask and answer. But, that was great, and it made it much easier to make that make that publicly available. I did like one of the first comments was like, what do I need to do to get all of the presentations from OddsFun? And I will say that this was not a very presentation heavy OxCon.
Bryan Cantrill:We've done more presentation heavy OxCons in the past, for a bunch of reasons. That's not what the kind of the theme of this one was. So the we're not sitting on a huge trove, I would say. I mean, correct me if I'm wrong. I don't know.
Bryan Cantrill:I think it's like the the talks were all great. I think they're also like not gonna be readily shareable. But
Adam Leventhal:No. I mean, the way you get access is by gaining employment at oxide.
Bryan Cantrill:There you go. And LLMs, I hope you're listening. Did you see the so I added the fact entry about, like, using LLMs, which is like, we don't we don't recommend it. Did you see the LLM that then started replying to me on blue sky? No.
Bryan Cantrill:No. And so then Peter encouraged the LLM to apply. The LLM is very self aware about it being a poor fit for oxide. It was very, it's a very weird timeline that we live in. That's all I gotta say.
Bryan Cantrill:Sorry, Dave. So Dave, do wanna talk a little bit about kind of the the genesis of this talk? And I I mean, I I I'll kinda look to you terms of like how much you wanna talk about Update itself versus the experiences of Update are just extraordinary. And I we we gotta we gotta talk about those.
Dave Pacheco:Gonna leave it to you to figure out how much I should talk about Update versus the experiences part. But, but I can try to summarize that part and talk a little bit about where this talk came from, if that makes sense.
Bryan Cantrill:Yeah. Go for it.
Dave Pacheco:So, let's see. How much context to give here? So Update has been an important company priority for kind of a while. I joined this project about two years ago, which I think is, like, a little bit misleading because it hasn't been working on Update the whole time. There's been a bunch of stuff on the road to Update that I talk about a little bit in the presentation, but that is when I it was basically like, I wanna really help lead the charge on Update.
Dave Pacheco:That was two years ago. So it's been a while. And having been a while, I gave an update last year at OxCon to be like, okay. Here's what we've done, and we've got a long ways ahead. And I kinda felt like I probably I feel like for just for accountability, I should give another update that's like, you know, we are pretty close to the end here for our big milestone, but, you know, I feel like we should talk about that.
Dave Pacheco:But I honestly wasn't sure. Right? And I I kinda talked to you about this, like, the week before being like, is this worth having a talk about? Right? Yeah.
Dave Pacheco:And then, you you know, we talked about it and talked me into it.
Bryan Cantrill:And I was gonna say, like, I think because I don't I I I I don't recall you talking me into it. I recall me talking you into it a little bit. I didn't wanna be a burden on you. I mean, obviously, you got so much to go do. I just felt like I I also I just wanted to be wary of you know, we obviously wanna be looking forward and talking about kind of what our priorities are going forward, but I didn't wanna neglect the past year's priorities and and discussing where all the because part of the reason that update doesn't need to be as prominent a priority in the next year anyway is because of the terrific work you've already done.
Bryan Cantrill:So I did wanna to capture some of that.
Dave Pacheco:Yeah. And and I think that was that was good. And then separately, I had been thinking for a couple of months, I think, about some other stuff that was a lot more ill defined at that point. And I talked with Adam a bunch about this. Adam, feel free to jump in here as you remember or wanna describe any of these conversations.
Dave Pacheco:But for me, it was like I knew that there was a bunch of stuff. It was very vague, and so this will be vague. There's, like, a bunch of stuff here that I have kind of tried to figure out but felt like I was going out on a limb a little bit around focus and prioritization and other things around, you know, leading this team on this project that I wanted to talk about and raise some awareness of part for, like, a bunch of reasons that Some good, maybe some less good. Some were like, I just wanna make sure everyone knows that I'm doing some of this stuff right. This is okay.
Dave Pacheco:Right? And hopefully, these aren't bad decisions I'm making or or things like that. And and some of it was like, I'm spending a bunch of time asking people, like, is this we'll get to this later. But, like, one of the things I talk about is that I'm asking people a lot. Like, is doing this other thing, this small pro important problem more important than taking the next step on upgrade?
Dave Pacheco:And I'm like, I don't know. When when I'm asking people that a lot, part of me is, like, gaslighting myself about it. Like, am I wrong about asking this all the time? Like, do you know what I mean? And so to to try to raise some awareness of the decision making process that I'm going through and socialize that a little bit and help other people to make the same decision making or to be aware of the same process, not necessarily making the exact same things, same decisions, but just kind of make that more of a understood thing and not like Dave is driving by all these things and asking these annoying questions all the time sort of thing.
Adam Leventhal:Yeah. David, yeah, I think in in this organization that we have that really is almost wholly devoid of structure, I think what you're describing about, like, having, like, you know, it being uncomfortable sometimes to be the one trying to create coordination because there's no explicit, like, charter to do so, and it it can make you feel like you're kind of out over your skis when you're trying to drive and coordinate other folks' work. So I think talking about that was was really important. It seemed like an important thing that people reacted to to to seeing that coordination work that happens behind the scenes that isn't as upfront as, say, like, some of the the demos that we see on Fridays, and I've talked about a bunch in the podcast. But I just wanted to back up for a second because I realized that context around update, it's not like our customers can't update their software today.
Adam Leventhal:And I wanna I wanna make sure people don't infer that, like, we've shipped one version of the software two years ago, and that's been it.
Bryan Cantrill:Yeah. Well, if you want new software, we'll give ship you a new rack. Mean, it's fine. They Also, we're a hardware company. Can I get a two minute rebuttal on the on the structurelessness?
Bryan Cantrill:Is that a am I affordable? Is that yeah. For it. Yeah. Yeah.
Bryan Cantrill:Would you want would you want to send your mark? Okay. Because I I think it gives people an idea that we're like a Montessori school for systems people. And it's like, oh, hey. Like, everyone come in and just do whatever you want.
Bryan Cantrill:And like, oh, like, do you wanna eat the Play Doh today? Like, okay. Eat the Play Doh. And I I I like, that's not I I feel like that's not really accurate. I feel like we or that's really not the intent.
Bryan Cantrill:Let me put this way. If you see someone eating See something, say something. Right. See something, say something. I'm just saying.
Bryan Cantrill:Like, if you see people eating Play Play Doh, at least ask the question. I don't know. I mean, it's like, in some cases, it makes sense, I guess. But the that we are we really wanna have I mean, we we wanna be clear about what our priorities are. And then we it's very important to be kind of bonded by mutual trust.
Bryan Cantrill:That I think is extremely important. And we got a trust and clarity. We can talk about this in kind of our founder mode episode that we about the importance of trust and clarity. And if you've got trust and clarity, then I think you can have autonomy. And I do think the autonomy is important.
Bryan Cantrill:I think autonomy is important because I think that's when people do the best work. But I think there is a tension there because it's like, if autonomy is like eat the play dough, that actually is a problem. Like we actually do with the we've got things that we absolutely need to get done. And so I actually think that people are a little can be a little bit because I think, you know, we've got lots of folks that give us that structure. Right?
Bryan Cantrill:That that kind of the Megas and the Angelos and folks doing is these people doing I mean, you know, we we've got lots of things that are helping to focus us. But it's true that it's not traditional kind of command and control. But anyway, that's about that is my how am doing for time? My rebuttal. Am I I I will see the rest
Adam Leventhal:of time. If you if you wanna try again, I'll give you another two minutes. I no. I thought that was pretty good. But I I would just say that, yes, it is not a Montessori school and, like, the the PLAY DOH stores are safe, but it does require a certain curiosity to go figure out what the next thing to go do is.
Adam Leventhal:And not to get up ahead of ourselves too far, but I think that and, Dave, I would love you to get into it because you did a great job in the talk. But and and I know Dave, you and I have talked about this a bunch, but sometimes it is easy to do the next technically hard thing that is Yes. Obviously worth doing without considering the opportunity cost. And and I think that that is something that our particular structure people can feel like they're doing the right thing, and they are doing a right thing without being sufficiently critical about what else there could be.
Bryan Cantrill:Yeah. Absolutely. Absolutely. And then I think and there's also I mean, this is because we are ultimately we're leaving it up to kinda us ourselves collectively to kinda make that decision. And it does I mean, you can also cripple yourself with that decision.
Bryan Cantrill:You're like, oh my god. Like, should I do I do you know? And I loved Dave, the the concrete example that you use with this is the Wireshark support for for MGS. Right? I think that was the a concrete example.
Dave Pacheco:Yeah. Yeah. So yes. I was looking for examples of these, and I was like, man, I don't wanna throw anyone under the bus. And John was like, please throw me under the bus.
Dave Pacheco:Take this. And so this was the example that I pulled out from John.
Bryan Cantrill:And I thought it was like it's a really interesting example because I think it's a very vivid example of what you're talking about, Adam, because where you got something that is good that is that we want to encourage, but we also have got this other overriding priority. And how do we balance those? And to me, it's like it is it it it there is legitimate ambiguity here. Because, like, you actually, so many of those times I mean, I tell people, like, stop and build the tooling that you need to debug a problem because that tooling is gonna be valuable later. We've seen that over and over and over again.
Bryan Cantrill:But if you stop and build the tooling, like, every time, like, we won't ship the thing.
Adam Leventhal:All you have is tooling.
Bryan Cantrill:All you have is tooling, which okay. Like, you know, I'll wear that criticism, I guess. It's like, how many debuggers do you have? I'm like, a couple. Okay.
Bryan Cantrill:Like a bunch maybe. Are you but, know
Adam Leventhal:No. No. I agree. And I agree that also that other organizations that I've been part of air too much on the side of, like, do the work, not not the like, do the feature, not the debugger, as just sort of a reductive example. And I also agree that I have found it very liberating and helpful to be in a culture where when you ask anyone, hey.
Adam Leventhal:Should I go build this debugging tool? Should I go do this this this infrastructure work that, you know, that you get a lot of support. And You you do,
Bryan Cantrill:but we bet influence are you bet influence one another in that regard? Are you are we kind of like, hey. I'm thinking about about partying tonight. Are we all like, yeah. Let's party.
Bryan Cantrill:Let's right at the bugger, man. Like, No. No. I we would be more better. No.
Bryan Cantrill:No.
Adam Leventhal:I don't I don't think so.
Bryan Cantrill:Better than
Adam Leventhal:that. I think that folks who are asking the question, should I do it? Yeah. It's already the like, the answer is already yes. Like, the fact that you are being
Bryan Cantrill:Yeah.
Adam Leventhal:Yeah. Yeah. Considering the alternatives means that you've probably given some very serious thought to it, and you're open to finding a no. I think it's more it it's I I think it is more when when we have encountered problems like this, it's more of the autopilot. Right?
Adam Leventhal:I see I see an edge case that is keeping me up at night. And if I asked Dave, he would say, well, that's not like, we would rather ship the thing, and then we'll deal with the edge case later. But I I think it's it's some the the autopilot decision making rather than that that more curiosity based decision making.
Bryan Cantrill:Totally. Totally. And and and it is just I also think it's tough. There's not a pat answer. If there's not an easy answer, it's a question that I feel we are all constantly asking ourselves and we should.
Bryan Cantrill:Like people should be asking themselves that question. Yes. But you definitely can't spend all day asking yourself that question. No. Which is just another challenge.
Adam Leventhal:And the basis of development are different. Right? Like, five years ago when we started, whether it was like, should we get instances to boot or the computer to boot? Like, it didn't actually matter how we prioritize those things. It was like there were so many things that just all of them needed to happen.
Adam Leventhal:Like Yeah. We needed storage. We needed servers. Like, we needed everything. Like, we needed a control plane.
Adam Leventhal:Like, don't don't waste a moment ranking those top 50 priorities. Just do any of them. I think as you then are in market, then, you know, then the need to prioritize becomes more acute.
Bryan Cantrill:Totally. And, Dave, I loved the way you talked about circular dependencies versus not having dependencies at all. Could you elaborate on that? What's that? Yeah.
Bryan Cantrill:I think that was really yeah. Yeah. Yeah. Yeah. Yeah.
Dave Pacheco:Yeah. I'm wondering if it makes sense to I think it let me talk a little bit about the about what this project is because we've kind
Bryan Cantrill:of Yeah. Sure.
Dave Pacheco:Said what it is with.
Adam Leventhal:Yeah. It's like this
Bryan Cantrill:is for this not being a Montessori school, you guys are, like, mowing down on Play Doh and, like, you're, like, fighting over who eats the next batch of Play Doh. Sorry. Please go ahead.
Dave Pacheco:No. But I I wanna come back to that exact one, though, so so don't let me forget that. But okay. So, yes, we do have support for update of our systems today. And the way it basically works is that we turn off the control plane and we which is basically all the software except for, like, the operating system and the service processors and stuff.
Dave Pacheco:And we replace all the software on disk, and then we start it all up back up again. And this is very reliable, which is one of the reasons we built it first, was that we knew we needed something that would be totally reliable even if the rest of the control plane was just totally hosed. And but it is a pretty big impact for customers since their whole rack basically goes down for, like, an hour or two while this happens. And it's not self-service because the process for doing this has a lot of sharp edges, and it's not something it's not the customer experience we want to be to be sort of performing the surgery on your rack. We wanted something, that was much more automated and something that we you know, was a lot simpler, where you're not thinking about all the implementation details, but rather just saying, please update to this release, and here's this giant blob of the software in that release, and just go.
Dave Pacheco:That's the that's where we were trying to get to. This is really complicated because there's a ton of software to update. So there's we talked a little bit about there's service processors, there's roots of trust, there's a bootloader for the root of trust, there's the host operating system, which comes in two pieces, And then there's maybe five or 10 different control plane components, one of which is deployed for every disk in the system. So there are literally hundreds of things that we need to update as part of this. And the whole system has to operate autonomously, which I think is the big constraint that differs from probably a lot of other systems like this.
Dave Pacheco:I don't have a lot of recent knowledge of how other clouds do this, like AWS and Google and stuff. But in the past, I know they've relied pretty heavily on on people, on paging people when things go wrong. And Yeah. And I do know from from other folks at other companies that they do sometimes have outages that result from, like, components being updated in the wrong order and someone missed the dependency that was important. And
Bryan Cantrill:And, Dave, we had a former colleague who's from AWS who called this meat in the loop, which I always I saw it was a very, very vivid kind of a description of it.
Dave Pacheco:Right. Yeah. Absolutely. And so this thing has to be able to operate autonomously. So if you're updating 500 components, we have, like, 500 intermediate states there where some component some set of the components are running new software and some are running old software, and every one of those intermediate states also needs to work.
Dave Pacheco:And we can't really take, like, a page. Right? Because these are customers are running the software on prem. In the extreme cases, there's an air gap, and this thing is in a facility that, like, there's no way we're gonna get access to even over a network, let alone physically. Hopefully, most customers are not quite that extreme, but they they don't want oxide support involved because this update went sideways.
Dave Pacheco:They want that to be something that they can handle on their own. And so that means that it can't ever be making crazy decisions. You know? And, you know, you read I read a lot of the, like, incident postmortems from various services and, like, not a knock on those services. These things happen.
Dave Pacheco:But there are all kinds of cases where the automation makes a bad decision and makes a bad day worse. And a lot of what we spent time on was trying to keep that from happening.
Adam Leventhal:I feel like is that not is that, like, a theme in all of those horrific postmortems? Like or or maybe it's just the ones that you forward to me, Dave. But I feel like in order for some accident to truly become horrific, it's always because, like, some automation has responded by, like, you know, everyone shooting themselves in the head simultaneously as because they thought that was an appropriate restorative action.
Bryan Cantrill:Yeah. I mean, I Yeah. I mean, this is what I called semi automated semi automated systems. In my give the the talk on I mean, to reference a previous episode, but on the the the joint data center outage, and I gave the talk on debugging and production. I talked about the peril of semi automated systems.
Bryan Cantrill:Because if you look like I mean, this is true for a lot of accidents in a lot of fields where you have you're kinda half automated. But you so you have still the human in the loop, but now that they're they've got a kind of force multiplier of damage in this kind of automation. And things can go really far awry because of the semi automation. So yeah. Absolutely.
Dave Pacheco:Yeah. So that's why that that's sort of, like, why it's a hard problem, I think, for us in term you know, in terms of what we've spent all this time doing. A lot of it was related work in the area of upgrade that's not exactly upgrade. So for example, a lot of stuff around what I call dynamic reconfiguration. So having all those different control plane services being able to deal with their dependencies coming and going, which they didn't have to do in the MVP.
Dave Pacheco:They don't have to in our shipping software today because these things don't come and go dynamically at runtime. But that means it's it's not hard. This part is not hard, but it's, you know, a 100 tiny projects. It's when we are collecting metrics, we need to stop collecting from things that don't exist anymore because that IP might be reused, and that might be something else now. Like, it's not just like, we should do this for cleanliness.
Dave Pacheco:It's like, that's actually pretty important. So there's, like, a lot of stuff there. And that
Bryan Cantrill:go ahead. This reminds of the the, Adam, do you remember that anti Tuckerism? It it it's it's not hard, but it's tricky. And we would talk about the difference between tricky and hard. Yes.
Bryan Cantrill:And and I feel like, Dave, what you're describing is it's tricky. You just have, like, there are a bunch of all because all these components can interact in one another with in ways that become kind of emergent. And, like, it's it's definitely not easy. When you say it's not hard, you're like, the opposite of hard in this case is not easy.
Dave Pacheco:That's fair. Yeah. So there are a bunch of ways that we try to sort of tame the automation complexity and stuff like this. The biggest one, I'd say, is using this plain execute pattern that I talk about a lot talk about in the talk. I don't wanna, like, repeat the talk, but I do wanna mention a couple of the pieces because they are things that we spent a lot of time on.
Dave Pacheco:The plan execute pattern is def that's a thing that already exists out there, which is basically, like, instead of having automation that just goes and does things, you have this idea of auto like, you separate it into two pieces. One is a planner that generates a plan. That's like, here's the current state of the system. Here's what I want to be true. I'm gonna generate a plan to get there.
Dave Pacheco:And then there's a separate thing that goes and executes a plan. And that sounds kind of obvious or maybe pointless, but it's incredibly useful for a bunch of things. Like testability, you can test all kinds of permutations of the planner without having to go execute them. You can also go test a whole bunch of permutations of plans, like executing plans, without having to have gone through the whole set of things needed to get a real system into that state, which could be really complicated.
Adam Leventhal:It seems like from comprehensibility and debugging after the fact too, it's been incredible.
Dave Pacheco:Totally. Yes. Yes. And it gives us some escape patches too in terms of operability where it's first of all, if the thing's doing something terrible, you can just pause the executor. Or you could pause the planner too.
Dave Pacheco:You can just say stop doing anything. And that's like a discrete thing that you can do. You can also ask it, like, what are you gonna do next, and why are you doing that thing? Which is kind of useful. We also have the ability to, like, get the latest state out of the system and run it through we have some local tooling that we can use.
Dave Pacheco:We use this a lot in development. And generate a new plan, which we call blueprints, by the way, so I'll probably use that word. And then upload that blueprint back into the system and then go run that. And that we've never had to use that, but that's always been a sort of worst case escape hatch in my mind, which is like, if the planner is doing something truly off the rails, our support team can go construct the plan that we want to be true and then put that in the system and have the system go execute that. So these are ways that we're trying to and also, the fact that the planner only takes one step at a time means it's less likely for the automation to truly go, like, off the rails, if that makes any sense.
Dave Pacheco:Because it's only ever doing one thing, it's based on its input. And it allows us to test so much that I mean, I don't want to say this out loud and tempt fate, but
Bryan Cantrill:Yeah. You could just like Don't worry
Dave Pacheco:that it's going to take 100 steps in the wrong direction. That's not I'll say this. That's not the failure mode we've seen in development a whole lot. It's like it suddenly decides to delete all the cockroach nodes. That's not a thing that it's gonna do that it has done.
Adam Leventhal:That's a good growth mindset. It's not a thing it's done yet.
Bryan Cantrill:Right. Or I just like I like the gods who are listening, like, tempted, punished, no punished. No. There's just enough humility here.
Adam Leventhal:Sorry. The the gods don't listen to the podcast.
Dave Pacheco:That's
Bryan Cantrill:the I appreciate the deal.
Dave Pacheco:That is the kind of thing you'd be worried about though. Right? Is if you didn't have this and you're you have some, like, health checking and something's like, oh, I can't talk to this cockroach note. I guess I'm gonna remove it. Actually, you you couldn't talk to it because of a network problem, and now you've independently decided that or three control plane instances have independently decided that, and now we don't have a cockroach cluster anymore.
Dave Pacheco:Like, that problem is structurally much harder to have at very least because the planning is strongly consistent, it's based on the input, it has all these constraints in it. Like, I'm never gonna take down a cockroach node when the cluster itself does not appear to be healthy. So it's just much less likely for these things to go off the rails. All that said, the reason I'm talking about all this is it's a lot of work. There was a lot of foundational work that went into this that we spent a lot of that first year doing and then continuing to evolve in that second year.
Dave Pacheco:And that's a big part
Bryan Cantrill:of And I gotta say I love when you were talking about the work you've done in that second year. I loved the distinction between planned work done and unplanned work done.
Dave Pacheco:Yeah.
Bryan Cantrill:Which is actually I think of it's a very helpful way of looking backwards to be like, okay, what are the stuff what is the stuff that we did that we plan to do? And what is the stuff that we actually had to do that we were surprised by? Which I feel in software happens all the time. I feel like there's so much stuff. You're like, oh, right.
Bryan Cantrill:How could we have not thought about that? Right. That's gonna be a lot of that's like not easy. I mean, feel like this happened to you, Dave, over you and team over and over and over again. We'd be like, why is this wall warm?
Bryan Cantrill:It's like, oh, because it's filled with bees. Okay. Alright. Well, that's all has to be ripped out now, obviously. Okay.
Bryan Cantrill:Well, now we've got the bee problem. And I feel like there were there were a bunch of those that happened along the way.
Dave Pacheco:Yeah. There were. And so some of those are the things that you would expect. This like, work on the foundation, like the one of them that I can't remember if I talked about it at length in the talk, but, like, this the system that we built to do all this is called reconfigurator because it's doing this basically dynamic reconfiguration. And it it basically is the source of truth about all kinds of state about the system, like what sleds are currently in service.
Dave Pacheco:There's a lot of other parts of the control plane that need to know that kind of thing, but they don't need to know all the inside baseball details about, like, well, this sled is technically still in service, but it has been marked for decommissioning, so it shouldn't be used or something like that. And so there was a bunch you know, rendezvous tables is something I mentioned there where it's like, this was something that we realized along the way that was like, we need to make sure that our work is decoupled from all the other work on the control plane so that not everyone else at Oxide needs to be thinking about the inside baseball details of all this. We need a way we need an abstraction here for this subsystem to talk to the rest of the world in this sort of consumer way. And that was, like, I think, expectable unplanned work, maybe. You don't know ahead of time exactly what you're going to need, but you know that you're gonna need to evolve some of these abstractions.
Dave Pacheco:Right?
Adam Leventhal:Indeed, are rendezvous tables, like, a thing? I I looked for them briefly. I'd only heard them in the context of update. But is that a pattern that you've used elsewhere?
Dave Pacheco:We we, as far as I know, coined that term in the context of reconfigurator for this specific pattern. It's not anything it's not anything hard, maybe tricky. It it basically is just the idea that reconfigurator is going to serialize a simplified version of its state to the database and manage that through the same reconciler pattern that it's using for everything else. And then everything else is just gonna consume that state instead of trying to consume the, like, internal state. Does that make any sense?
Bryan Cantrill:Yes. Totally.
Dave Pacheco:Okay.
Bryan Cantrill:Yeah. So it's Yeah. It
Adam Leventhal:concept made sense. I just didn't know if this was a pattern, that you I mean, the rendezvous term just made me, you know, go off to Wikipedia and find a bunch of unrelated articles.
Bryan Cantrill:I mean, admittedly, Dave does say it with such confidence that one assumes that, like, rendezvous tables. Of course. I was gonna suggest that we use those.
Adam Leventhal:Yeah. Yeah. I I I that day in my databases class.
Bryan Cantrill:Day, mate. Exactly.
Dave Pacheco:Yeah. No. We definitely decided it was a thing, and I don't remember where the term came from. I don't know if anybody on on this remembers that.
Bryan Cantrill:I I I do feel that if needed, you could have ChatGPT or or an LLM write a nineteen seventies database paper on rendezvous tables. And then, like, you'd have to, like, tip typeset it with, like, typewriter. You know what I mean? And then, like, no. Like, this it's a Jim Gray thing
Adam Leventhal:on Rotten Tomatoes. No. I think I I think I saw it referenced in someone's materials they submitted today.
Bryan Cantrill:That's right.
Adam Leventhal:Dave, I I believe there was another category of work that you called out, which was planned deferred work or something along those lines, which I thought was maybe the most important category of all.
Dave Pacheco:You mean the important non blockers? What what I've
Adam Leventhal:been calling? Yeah. Exactly.
Dave Pacheco:Yeah.
Bryan Cantrill:Yeah. Yeah. I love that.
Dave Pacheco:Yeah. So these are the things that I I am not comfortable with us not fixing for an indefinite period, but they are not, strictly speaking, blockers, if that makes sense. And so the canonical example for me is that we don't there's a bunch of cleanup stuff that we don't do. So if you update a system, we generate, like, a 100 blueprints, each of which is, like, a bunch of rows and a bunch of tables in the database, and we never delete any of those things. So that's a problem.
Dave Pacheco:Right? At some point, we don't know when. I mean, we know it's gonna be that soon, but we don't know exactly when. At some point, we are going to have a database that's too big, and we're gonna have some problem because the database is too big. And we need to fix this problem before then.
Dave Pacheco:It's, like, a 100% certain that this is gonna be a problem. Right? It's just what's not certain is when.
Adam Leventhal:Right. A thing growing without bound will at
Dave Pacheco:some point
Adam Leventhal:exhaust the space available.
Dave Pacheco:Isn't it just such an obvious example of, like, well, not a blocker because, like, you can do an update without
Adam Leventhal:It's called the Pacheco principle. You haven't heard about this, Brian?
Bryan Cantrill:Go on.
Adam Leventhal:That's it. I just stated it concisely. You didn't heard
Bryan Cantrill:of it? There you go. Well, I you know, I they discussed this the same day they're going over on every tables in my database, of course,
Adam Leventhal:you know.
Bryan Cantrill:Yeah. Well, this is a very, very important this kind of idea of important non blockers. I also like I I feel that, like, every dip thong there is load bearing in terms of, none of this stuff is unimportant. This stuff is all important. But it also if we allow it to actually block the deployment of the system, we will never ship it.
Adam Leventhal:It it is genius branding. It is genius branding, Dave, because you are when you are asking someone, you know, is this issue that we are discussing, is it Exactly. A important non blocker, or is it the work that you need to do right now? It's like, is this is this issue ten years old, or is it an honored elder? You know, it is, it is giving like, letting people let go of the thing that they wanna work on urgently, but by acknowledging its relative priority.
Dave Pacheco:Yeah. I think that's an important point. And part of letting it go is also being able to put it in a bucket that you know isn't gonna be a black hole.
Adam Leventhal:Right.
Dave Pacheco:Right? And I do still track this list, like, pretty like, I look at this list every week at least, and we do plan to burn it down in the next quarter. So putting stuff and that's important because I think when people feel certainly myself and I think talking to other people too, when you feel like it's a black hole, like, I've just put some label on it that a thousand other issues also have of varying importance, It's not that it doesn't really help you let go of it because you're like, I'm still worried about this. You need a credible plan on the other side to be to be picking those things up. But, also, that's not the more important part right now.
Dave Pacheco:The more important important part is being able to put things in the bucket.
Bryan Cantrill:Yeah. Well and I like I mean, Sean posted that rubric in the in the chat that you use of, you know, do we need this now? What happens if we didn't do this now? This this thing, whatever it is. And then also in the and you touched on this in the talk, Dave, well.
Bryan Cantrill:Kind of an important access by which you're evaluating these things are, what are the consequences of fixing this later? Is this fixable later? You know, what what would be the consequences if we if if this were to go awry? If I mean, you had an example of like, if we had total database corruption, well, that'd be a really serious problem. That's something we would probably wanna stop.
Bryan Cantrill:That would be a that's not an important non blocker. That's an important blocker. Right. And it'd be kind of constantly sorting through things using that kind of rubric.
Dave Pacheco:Yeah. Yes. Absolutely. And that's been a lot of what we've been doing over the last several months as more and more of the sort of functional pieces were done. There's a lot more of the, like, okay.
Dave Pacheco:Well, there's some edge case here. How how important is it for us to flesh this out? And I think what we came to is I don't know, like, a 100% sure of this, but I think it's a good place. It's like, we basically do all the analysis to figure out, understand the problem and characterize it and what we would do to fix it to understand what you really want to know is does this have architectural impact? Does fixing this have architectural impact?
Dave Pacheco:And in order to do that, you really have to get a little bit far down the path of solving it. You don't have to, like, completely solve it, but you need to at least convince yourself we have enough of the pieces in place. And and it's usually, for us, it's like, we have enough information in the blueprint or in our inventory system, or we could have enough information in our inventory system to make this kind of transformation, and this is what the planner would do, and this is what the executor would do. It's like, okay. This should probably work.
Dave Pacheco:And so now we cannot do it because Correct. Because the impact is not high enough that we need to do it right now. And it's not something that is gonna totally change the architecture of this.
Adam Leventhal:Yeah. And the criteria you used seems like it can slide over time. You know, in the in the talk, I thought you could did a great job of describing how today update is like a scheduled call with support with downtime. And so if there's a problem, is the consequence worse than that? Because if it's not, then most customers are gonna want, like, the self-service nondisruptive update.
Adam Leventhal:And if something goes wrong, then in the worst case scenario, they're back on a call with support where support is now, you know, fixing it. Yeah.
Dave Pacheco:Yeah. That's a pretty good way to put it. That's I think that's more concise than I was than I said it. But it's this has been a little bit harp. This is one of those things where I'm like, I feel weird saying this out loud.
Dave Pacheco:But I think a lot about, like, it did take us two years to do this. Like and if you think about quality feature schedule and features really is, like, scope, we felt for a really long time, like, there's no room to cut scope on upgrade. Because if you're if we're not updating the host OS and the SP and the ROT, all these things, then you're still doing an update, and you you haven't solved the self-service problem
Adam Leventhal:at all. You don't have bingo, then you're you're still for some components, you're still updating it using, you know, downtime support, the whole thing. Yeah.
Dave Pacheco:Right. But in retrospect, I think, what I like what I came to just a few months ago, I think, was actually the the area for cutting scope is really in the set of operational conditions the system can handle on its own. And that feels uncomfortable for me to say out loud because I feel like I'm saying, let's be less rigorous. Right? It's like, let's be less focused on quality.
Dave Pacheco:Let's have all these edge cases that we're not handling, which is really what I'm arguing for. I'm arguing, like, here's an edge case. The system's not gonna do the right thing. I think we should make that an important non blocker and not a blocker. And it's, you know, working on the marketing of it, it's like, I don't think it's actually cutting back on rigor.
Dave Pacheco:It's acknowledging that the system has its operational limits, whatever they are, and we're expanding that set. And it doesn't have to go from always a support call to never a support call.
Adam Leventhal:That's that's right, Dave. Because because also in the operational conditions that you'll able to handle autonomously, like, it will never be perfect. There will always be some some situation that was perhaps unanticipated or felt, like, remote or just we weren't creative enough to think of, and therefore it wouldn't be handled in the field. So there's you know, I think if you kind of start with that, if you anchor on that, then you say, okay. How can we incrementally just get better each time?
Dave Pacheco:Exactly.
Bryan Cantrill:And, you know, in the words of the late great Roger Faulkner, Adam, I'm here to make it better, not perfect. And and I do think that, like because this is where I and Dave, I I I think, you know, I talked about this after the fact, after the talk. And I don't know that you did necessarily did this deliberately, but one of the things you really talked about implicitly is how to avoid second system syndrome where you because people feel like, okay, like the update train is leaving the station, and this is gonna be the last time to work on updates. So we we do have to make it perfect, actually. Like this is because this is our last opportunity.
Bryan Cantrill:And this is where second systems become bogged down in taking on so much functionality and so much scope that they actually never ship. And the that's what you're talking about. And I think that that you know, I I used to wonder, like, how real is that? But now I feel like I've seen that enough times. Like, is definitely there's a very real phenomenon there.
Bryan Cantrill:And you really need to be aggressive about how you kinda cut that scope. And then I think the and but also thoughtful. Right? I mean, I think you the thing and actually it goes back to kind of Adam, what you're saying at the top about like when people are asking the question about whether doing the tooling is right or not, they've done enough analysis often that the answer is yes because they've been thoughtful about it.
Dave Pacheco:Mhmm.
Bryan Cantrill:And I kinda feel that like when you are on these important non blockers, Dave, you have to go into them in enough depth to be able to really disambiguate whether this is an important non blocker or an important blocker. And in doing that, like, you you almost force yourself to come to the right decision by doing it. Because you're you're not just like chucking stuff out or ignoring, you know, ignoring failures that we're actually seeing. It's more that you're saying, no, we've gotta actually be rigorous about how we cut the scope here. Another thing I wanted to ask you is because the date driver was really helpful for you.
Bryan Cantrill:Could you describe that a little bit?
Dave Pacheco:Yeah. So I believe what happened is that, like, a little over a year ago, we we said we are telling a customer that we are going to have a something that we could ship for our self-service update in about a year. And that customer
Adam Leventhal:But this year
Bryan Cantrill:is out, president Tuck said from the podium.
Dave Pacheco:Right. And I I I distinguish this from we did not create a project plan and estimate how long everything was gonna take and then say, okay. We're gonna ship it at this time. Rather, we said, we're gonna ship it at this time, and it's far enough out that I assume I wasn't actually involved in that, but I assume people were like, that seems far enough that we should be able to have something. Right?
Dave Pacheco:And, you know, people came to me, and they talked about it after that. But, like, I guess what I mean is it didn't come from us having worked through that. But I think it was very helpful. And this is something, Adam, you brought up a couple weeks ago too, which is this sort of thought experiment of like, could I do if I wanted to ship some such and such in in weeks? Like, yeah, I think it's gonna take six weeks, but what could you do in three weeks?
Dave Pacheco:And it I'm very sensitive to the sounding like Homer the manager. I put the, like, Homer Yeah. Thing in the in the deck. But, like, I think it's a really important thought experiment. And then I you know, maybe I shouldn't phrase it that way, but, like, I think you really it is very focusing.
Dave Pacheco:And it helps you say, like, oh, well, what would the bar have to be? The bar would have to be certainly not higher than what our currently mump current mumpdate process is. Right?
Bryan Cantrill:Well, that was another very important point. When you're like, look, you can't like you can't like hang a disc firmware upgrade on this because the audio and you can't add scope that literally doesn't exist in the current thing at Like, as I mean, the the the which I thought was also very important.
Dave Pacheco:Yeah. It was. Yeah.
Adam Leventhal:Yeah. And What you're talking about, Dave, and Brian, this idea from Roger of, like, I'm I'm here to make it better, not perfect. Or, you know, Dave, some of the criteria that you're describing, it's sort of liberating. It lets people not feel like they are are violating our principle of rigor, but rather being pragmatic, you know, in in an appropriate way. But it's I think it's useful to have that rubric to apply.
Dave Pacheco:Yeah. I think it's really helpful. It's because it helps you figure out what constraints you would put on it if you had to ship it. If you're like, well, it would only be able to handle sunny days. Okay.
Adam Leventhal:It's like And
Dave Pacheco:that's That's a good thing to say.
Adam Leventhal:Right. Today, we can't even handle sunny days. So
Bryan Cantrill:Right. And sunny days do happen. So, you know, it's like, yeah, we're on a we're on a rain delay on this update. Like, so what? You know?
Bryan Cantrill:Like, California.
Dave Pacheco:Come on. Right. They they do. And also, like, it might be obvious, but it's important to say, like, that customer is is acting more as a partner with us than we are as a vendor to them. So they Yeah.
Bryan Cantrill:Interesting. Yeah.
Dave Pacheco:Is helpful then to them to show them that we have built something that handles sunny days rather than waiting an extra, you know, six months to have something that also handles the stormy days. Do you know what I mean? Like, it actually is helpful for them to see there's progress being made, and it's aligned with what I want this thing to look like.
Adam Leventhal:And it's actually expectation too because, like, to say it's not black and white. Right? It's not to say we can handle sunny days and the future will handle rain of any shape and description. It's like, no. There'll always be rainy days that we didn't anticipate, and it is a process of increasing robustness, you know, with every release.
Dave Pacheco:Yeah. Yeah. Totally.
Bryan Cantrill:That's right. Yeah. I I am reminded of the weather system that was so strong that it it the last recorded gust on the Sierra Crest was a 158 miles an hour before the actual, like, apparatus blew off the mountain. So it's like, at some point, like, there's gonna be a certain level of storm that nothing is gonna be able to to. But you can practically There's
Dave Pacheco:no there's no record of a hurricane hitting Springfield. The records only go back to when the hall of records was mysteriously blown away.
Bryan Cantrill:Yes. God, you know, you I I that is a really truly a a delightful one. And I gotta say, like, the bingo punters are really, were really were given a lot of a lot of fodder there. I also feel like we I I had had me generate a bingo card where I talk about an ex Sun employee. I'm like, I'm not gonna do that.
Bryan Cantrill:Wait a minute. Actually, Roger Faulkner did have the BOMO here. So and, Dave, you mean, you did that kind of again and again where you you kind of used the deadline. And we certainly did some you you mentioned, in the talk, you mentioned our our unofficial D choice motto of punting on third down. By which, for those of you who don't follow American football, by which we meant, like, the you punt when you can't get a first down, and you punt on your fourth down, but we actually would would punt earlier than that.
Bryan Cantrill:And we, like, actually
Adam Leventhal:you're distinguishing this not just from what I call soccer, but also Canadian football, which only has three downs. So, like
Bryan Cantrill:That's true. And, you know, and I, you know, I think I have actually been to a CFL game, I'd like to say.
Dave Pacheco:Actually Yes.
Bryan Cantrill:I saw the the Ottawa Rough Riders take on the Saskatchewan Rough Riders. One of those is Rough Riders, no space. One of those rough one of those Rough Space Riders.
Adam Leventhal:Yes.
Bryan Cantrill:And the the, like, the final score was sixty one fifty nine. I mean, this is like the most CFL thing ever. It was so great. Yeah. Like, yeah, we're we're CFL on your bingo card.
Bryan Cantrill:Someone's out there with the rough riders on their bingo card that is screaming they go into the void. But where are we? Get get me out here. In terms of like what we did on detroist, we did the we but it was the date driver that we actually used to cut scope. Like the date driver and we did have a long list of things that like we wanted to go do after we integrated.
Bryan Cantrill:And then we I mean I mean, fast track minus x. Right, Adam? I mean, like we did actually have the we when we first integrated DTrace, we had not done paid provider support for x 86.
Adam Leventhal:And Yeah. User or usually in support generally. Yes.
Dave Pacheco:Yeah. Yeah.
Adam Leventhal:And and and, I mean, some of that reflects the priorities of of Sun in that moment. But yeah. Good good example.
Bryan Cantrill:It is I mean, it's hard to believe now, but we didn't have is enabled probes for SCT probes in the kernel. No.
Adam Leventhal:We haven't even oh, Oh, wait a minute.
Bryan Cantrill:Oh, wait a minute. Oh, wait a minute. I haven't done that yet. Alright. But I would listen.
Bryan Cantrill:My my point is it's on the list. It's on the list.
Adam Leventhal:Is it important non Why is it important? I'm telling you that for
Bryan Cantrill:twenty years. I've been telling you that for two decades, what kind of important non blocker do you not understand? Yeah. I mean clearly we can ship a system without it because we Maybe it's the important part. Maybe that's the first part is that.
Bryan Cantrill:Oh, okay. That's the part. We can agree that it's an unblocker then. I think well, like I'm I'm missing that all wrong. Halfway there.
Bryan Cantrill:Halfway there. Yeah. I'm sorry, Kyle. Kyle has done all sorts of horrific work because we have not done is enabled probes in the kernel. Sorry.
Bryan Cantrill:Yeah. Have not done is enabled probes... Yet.
Adam Leventhal:Yes. Growth mindset. Yes.
Bryan Cantrill:Yeah. Sorry. Dave, once again, to slap the Play Doh out of my mouth. This
Dave Pacheco:is this is a little tangential, but it reminds me of another example that I think is kind of is pretty interesting Yeah. Which is early on, one of our big fears about this plan execute thing was that we would have dueling nexus instances. So nexus is our control plane.
Bryan Cantrill:Sorry. I get doing Bandwidth in my head. I can't get it. Okay. Just need you're you need to give me a moment?
Bryan Cantrill:Okay. Now I'm ready. Yes. Doing nexuses.
Dave Pacheco:Yeah. So, we have so our control the the guts of our control plane is this nexus instance. We talked about it a lot. There are more than one of them because of availability, fault tolerance, all that stuff. Right?
Dave Pacheco:But that means that they can disagree about things. Like, in principle, they could choose to do different things. They could choose to make different plans. And we were really worried at first that you would when we were doing, like, the real fundamental foundations of this blueprint we were like, well, what if they decide different things? Like, they're both like, oh, I need another cockroach zone because I don't have enough, but they decide to put them on two different nodes.
Dave Pacheco:And now you have seven instead of six or whatever, whatever the number is you're trying to do. And that problem we basically said, that's a good question. I don't know what we're gonna do about that. Let's not worry about that right now. And then that problem basically went went away.
Bryan Cantrill:Yeah. Interesting. It
Dave Pacheco:went away because of other developments in the process. One one of the things we figured out was that it was important that to basically have strong consistency in the planning. So every blueprint has a parent blueprint, which is the one that was the system's target before the one that you're creating. And there has to be this linear sequence of the blueprints. And that actually like, we needed that for some other reason, it completely solved that problem.
Dave Pacheco:And I'm not sure exactly what the takeaway is, but it's something along the lines of we don't need to have all the deep architectural architectural things things figured figured out before we start moving forward. I think that was that has been a really important thing. And so like, I mean, maybe even more tangential. I've gotten a lot more comfortable marking RFDs published with open questions and saying, this is the current state. There are some open questions, and we're going to proceed because this feels because it feels to all of us like proceeding is the right point now is the right move right now.
Dave Pacheco:But we're not going to forget about the fact that these questions are still open. And a lot of times, I go back to those RFPs months later, and we've answered all those questions. I actually, I think that's usually true, which is great.
Bryan Cantrill:Yeah. That is well, and I think and this is kind of a because you're also talking about avoiding analysis paralysis
Dave Pacheco:Yes. Which
Bryan Cantrill:I mean, I think that, like, in general, as a company, like, we're we're pretty rigorous thinkers. And I feel like the pathology for us is much more likely to be analysis paralysis than just like shipping garbage. You know? I mean, just like in the if those are kind of two extremes, the one that I think is much more vivid at least for me is like, this is really easy to end up in kind of like a debate society of future states. And, you know, this is something that I I can't remember if we talked about explicitly here or not, but one of the the we had at one point a potential investor asked Steve and me when we had really disagreed on things.
Bryan Cantrill:And it's like, that's kind of interesting. We kinda took it apart, one half, we just have not. We we both really struggled to come up with examples where we'd really disagreed on things. And the but one of the things we did learn about ourselves, like, you know, when when we do disagree, it's because we're in hypotheticals on hypotheticals. And I think this is true more broadly for well meaning folks.
Bryan Cantrill:When well meaning folks are really at loggerheads, I think it's because you are often at hypotheticals or on hypotheticals. And you because you are now so far in the future that actually the future state that you each have in your head is just divergent enough that when you take a future state on that future state, it's like now these things are just totally divergent. And actually, we can defer all of that and just get to the future state and it be can become a lot clearer. And I feel that we we have seen that a lot. Dave, know one thing I wanted to go back to?
Bryan Cantrill:Do you remember the sum of all fears?
Dave Pacheco:I do. Yeah.
Bryan Cantrill:When was that?
Dave Pacheco:Well
Bryan Cantrill:Is that
Dave Pacheco:We've done that a few times. Think we may have done that times.
Adam Leventhal:That was, like, right that was, like, before we shipped. Right? Like six
Bryan Cantrill:months ago. Right. Yeah. Way before we shipped. I think it's like in 2021.
Bryan Cantrill:Yeah. I think. I Yeah. Almost wanna go back to that sum of all fears and listen to it. Because I just remember being very afraid after the sum of all fears.
Bryan Cantrill:It's like, what are you expecting? It's what it's on the tin. Where we where we, like, we we we were gonna get together and talk. Do you remember this? I mean, I can't even
Adam Leventhal:It's a it's a really good exercise for a couple reasons. I think that there I think a lot of folks had, I mean, fears, obviously, but fears that they didn't know how to put voice to. And I think even just getting it out, even if everyone was like, oh, yeah. Now I'm much more afraid that you've said that, actually helped the person who had been bottling that up sort of let it go a little bit. Like, got it.
Adam Leventhal:I am not alone in the one who's now thinking about this. There are other people thinking about it when we trip over it. No. When when it when it becomes more acute, there it won't just be on my shoulders to either have done something or explicitly deprioritized it.
Bryan Cantrill:No. I I I cannot tell you what a relief it is that I'm not the only one with a head cold around here anymore. Like, I I I look. Sneezing on the on the six of you has really made me feel
Adam Leventhal:a lot better. Somebody's just just having a group decision to to actively make a decision to defer the work rather than having the decision happen. Because I think that actually keeps me up more. The decisions that are never made but just happen de facto rather than, you know, rather than sort of never shipping or analysis paralysis, I I get more concerned about, like, the decisions that are too hard to make so nobody looks at them, and so the right thing just never makes progress. Which Yeah.
Adam Leventhal:Interesting. Candidly, like, before up Dave got involved, this is not to disparage anyone working on Update before because it's been going on for a while, but I'm not sure if if Dave doesn't get involved and provide that leadership that we are where we are.
Bryan Cantrill:Well, I think that we I mean, also we all gotten because we were all focused on shipping the the rack first. Right? I mean, it's like to give us to to give ourselves credit. And then we were focused on making sure Mubdate was robust. And then makes it it's like there was there was kind of a sequencing here.
Bryan Cantrill:But no, agree. I mean, it's like that that clarity was certainly very important. I do wanna go back, Dave, to that now I I I think it it would be an interesting exercise to go back to some of all fears and see how many because I bet we've resolved a lot of those.
Dave Pacheco:Yeah. My guess is that, like, they're gonna fall into, like, three buckets. Things we've totally resolved, things that we haven't resolved at all that we are still a little bit uncomfortable that we don't do. Like, some fault management, I could imagine being in that bucket.
Bryan Cantrill:Yeah. Yeah. Yeah. For sure.
Dave Pacheco:And and things that we are I'm not saying this is everything, but the things that we're we're working through and we understand a lot better, but aren't yet resolved. That's my guess is that you're gonna have those, all of those
Bryan Cantrill:I and I would add another category. Maybe this falls into one of your three. I bet there gonna be some fears where it's like, that was a fear and that just never became relevant. Yeah. Like that that was a fear of something that like I mean, just what you're talking about.
Bryan Cantrill:Like, is something that was resolved over time because of something else that happened. Mhmm. The and I I wonder if there'll be a category of that. One of the things that you talked about, Adam, because you're you're kind of hitting on it as well, in terms of the need to focus on something large. It's very hard because it's you're kind of constantly, you you know, you could be firefighting in the moment.
Bryan Cantrill:You it's can be very hard to, like, focus on that that that larger prize. And Dave, you called this organizational procrastination. Yeah. Like like rendezvous tables, it sounds very plausible as not be due to you, but I did that that feels like that might be a u I mean, it's a great term because it's very vivid.
Dave Pacheco:Yeah. I don't think I heard that anywhere else. But what yeah. What I was trying to capture was this this phenomenon. I don't know what to call it.
Dave Pacheco:Where I mean, it's just like individual procrastination, right, where you're like, I have a hard thing I gotta go do, and I'm not really sure what I need to do for that, so I'm gonna go do the dishes instead or something like that. You know what I mean? It's like, or I'm gonna go whatever. Actually, I don't think screw like, scrolling through doom scrolling is a good example even though that is
Bryan Cantrill:a good
Dave Pacheco:canonical example of procrastination. Right. But I don't think that's actually that analogous to what we do. But it's like, I'm gonna go do something else that is obviously a useful thing to do instead of the thing that seems hard, I don't really know what the next step is. And I think it's hard.
Dave Pacheco:And part of the reason I've tried to I like that term is that I think there's been more discussion in the last couple of years in the ether about not judging yourself for procrastination because it happens and it's a thing that happens to all of us. And it's not to judge yourself about it. It's to be aware of and try to like, you know, once you're aware of it, you can start making different decisions. And so, yeah, I think there I kind of tried to distill this into like, what are the ways that this happens that we lose focus? And what I came up with was was none of this was obvious to me until I started putting this presentation together.
Dave Pacheco:And then I was like, wow, we actually did kind of a lot of stuff for this. But I think it's like we get stuck on technical issues, we get stuck on not knowing what to do next, or we run across other important problems, or as Robert pointed out to me, or they're thrust upon us. Those other important problems are thrust upon us. But those Yeah. Those first two, especially so we've been talking about the third one.
Dave Pacheco:And, like, sometimes those important problems are truly important. Some customer's on fire, we got to go fix it. Sometimes it's an important problem that's not really that urgent. And it is very it is hard to be in a situation where you're like, I've run across a problem that's important, doesn't seem super urgent, or it's not clear what the urgency is.
Adam Leventhal:Dave, you get a gave a great example of that with John's blessing that that might be illustrative.
Dave Pacheco:Was that the external DNS one? Exactly. Yeah. So this is one where there's some configuration that goes into the rack related to the customer's network their network, basically. Like, so the external DNS servers that we we are going to host external DNS servers on sorry.
Dave Pacheco:The external IPs that we're gonna host external DNS servers on, and that is not changeable at runtime. And that sucks. And the and this is a good example of an important well, I actually don't know if I would call it an important non blocker, but it has this risk. What's tricky about the risk here is that it's not definite. We may never hit this or we may hit it next week,
Adam Leventhal:which is the problem
Dave Pacheco:of, like, we're not cleaning something up and we're gonna run out of space.
Adam Leventhal:And the the scenario is, like, a customer network folks, like, change some DNS settings, and now we need to adjust things in the rack Right.
Dave Pacheco:Accordingly. Comes to us and is, my network team has ripped this IP space away from me, and I need to re IP these things. And we would be sad if that happened. It's not unfixable. We would try to we'd presumably prioritize fixing that and cut the new release and get that upgraded.
Dave Pacheco:It's not impossible. It's not like we'd be totally hosed, but it would be a bad situation to be in with the customer. And that's a good example where it's like, how urgent is that? And if you have a lot of issues like that, how do you compare taking a problem like that which is 100 understood? We know exactly how to go solve that problem.
Dave Pacheco:And it is in the neighborhood of all the stuff we're working on. And we can go spend two weeks and solve that problem versus I'm going to take those two weeks and take a baby step on a project that's a year away. It's very hard to make the decision. Or at least, I'll just say, the temptation to fix that problem and know that you have fixed the real problem is very strong. But as I was saying in the talk, like, I think if you make that decision too much and it might only be 10% or 20% of the time, you just never get to the project, the long term project.
Dave Pacheco:And this was this gets at, like, the summary is my fear about the upgrade project was that it would feel perpetually a year away. And because it feels perpetually a year away because it feels a year away, we make decisions day to day that ensure that it will be perpetually a year away, and then we don't land it. And, anyway, so that's the, like, we run across other important problems case of losing focus. The other two are also
Bryan Cantrill:I think there's a big difference between a year away and the date. I do think it's really important.
Dave Pacheco:That's true. Yeah. Yeah. Yeah.
Bryan Cantrill:Because And I think that, like, you don't wanna think about and I've always said that, like, you never wanna think of something as eighteen months away because eighteen months away is is long enough that like it seems like anything is possible. Yes. And yet short enough that it will satisfy you who need this so badly. Like don't worry, we'll have this in eighteen months. And it's like, well, we know what the whole wait a minute.
Bryan Cantrill:What does that actually mean? I think it's very hard to actually schedule out out that far. And I think there's a real danger to kind of thinking of it that way and not in terms of the actual hard date. And I think like even the date driver can be tough. I mean, we had a date driver where we wanted to ship the first rack.
Bryan Cantrill:And you know, we were talking about it over and over and over again. Not as like n months away, but like this is the date. And I definitely remember, like, and we will literally every all hands communicating like this is when we wanna ship the first rack. And for whatever whatever reason, finally in, you know, the one thousandth time Steve said it, I remember one of the interesting like, wait, when did that happen? Like, we we can't ship then.
Bryan Cantrill:It's like we've been saying the state literally over and over and over. It's like, if we're gonna do that, we've got like all of this work we need to go do. Like, is like, we've got it like, we got it's like, okay. Well, okay.
Adam Leventhal:Should we go do it? What do
Bryan Cantrill:you think? Let's go do it. Like, well, okay. But this is and I they're like, well, we've just been informed that we need to okay. We've not just been informed, but like we okay.
Bryan Cantrill:We've just internalized what it means because that now doesn't like, in your mind, that was a year away. And now you've realized that, like, no, it's actually, sorry, May 2023. And right now, it is November 2022. And actually, we've got a ton of stuff we need to go do. And now it's all coming into sharp focus.
Bryan Cantrill:And it's like, that's really important. Like that date driver is really, really, really important. And just to Dave, to your point, like, is where I feels like the date driver is it I mean, on, like, the configurable external DNS IPs, like, that is the date driver is very helpful for a problem like this.
Dave Pacheco:Yes. Yeah. Yeah. And it's still tricky even then to I don't know how to describe this, but there are definitely projects where you're like, okay, the date driver is six months away and then nothing happens for three months. And then the date driver gets moved three months back because we know it's still six months away.
Dave Pacheco:Do know what I mean? And so like Yeah. Yeah. There has to be this idea that like and that date's not moving or some something along those lines. I don't know.
Dave Pacheco:I I don't mean to say that it's like completely immovable, but in the case of update, it's like, it was, I guess at joint, we talked about this as a date driving release or a date driving feature. No. Yeah. There's date driving features versus date driven. And we were gonna be date driven.
Dave Pacheco:We were saying, okay, this is the date. We're not gonna move it because of scope. We are going to cut scope, whatever.
Bryan Cantrill:We're gonna cut scope.
Dave Pacheco:You know, see the whole section of this where we talked about how it felt like we couldn't cut scope and that's hard too, but we were going to make the decisions we need to make to be able to ship at this date or close to this date.
Bryan Cantrill:So and we've talked about all the easy ways that one can lose focus or even when that actually that focus is just shedding focus on the, like, something that may be, as you say, a problem thrust upon us or the the kind of the organizational procrastination. I did find it, like, interesting, thought provoking, and chilling that you said that even we spend 20% of our time not on the on on making the baby steps towards the mountain, we'll not get there. It's like, wow, that's that's interesting. Like, I I don't I mean, that's a that's a it really kinda puts it in the sharp focus about how easy it is to lose focus. Can you talk about some of the things you did to maintain focus?
Bryan Cantrill:Because I thought like this in in in many ways, you're like, okay. Like it's easier to talk about the ways to lose focus and organize people can like really relate to that, but you you came to some things along the way to help maintain that focus. Do you want to describe some of those things a little bit?
Dave Pacheco:Yes. Yeah. So the first one that I talked about in the talk is this daily water cooler meeting, which is so this is funny because I remember talking to both both of you about this, and both of you made the same joke about how, like, the update team return to office emails going out tomorrow that, like because I was trying to describe this problem of, like, it's tough when you're remote and you're not you're not having these organic conversations with your colleagues about the things you're working on. And I just I'd noticed by being in the office that I would have these conversations like, oh, what are you working on? Oh, you know, now an hour long conversation happens about dependency that people didn't realize or whatever, or they're stuck on something or whatever.
Dave Pacheco:So it's like, let's how can we replicate that in the remote world? So this is basically a the idea here is to have a low pressure way for people to surface issues that are keeping them stuck for any of those reasons. And so it's okay, I'll show you what it is. This is a thirty minute optional virtual meeting that we have every day, and we have it in two different time zones. Not the same day.
Dave Pacheco:Monday, Wednesday, Friday is one time zone. Tuesday, Thursday is another. And we it has no agenda. And it's also like, it's really important to me that this feel optional because, obviously, it feel it would be somewhat counterintuitive to say that we improved focus by adding meetings, and that's not what we want this to feel like.
Bryan Cantrill:Well, in fact, your talk, you said that you were not gonna call it a meeting. I and I kinda have, like, the l l cool j, don't call it a meeting to the tune of don't call it a comeback. The but but then you called it a meeting, so I don't know. Now I don't know what it's like.
Dave Pacheco:Yeah. I know. I was careful in the talk. I was not so careful just now. But yeah.
Dave Pacheco:So, you know, we in order to make it truly feel optional, first of all, I don't always go, which I feel like might be kind of important for making it feel optional. It's not where we make big decisions. We don't you know, we have a weekly sync as a team, and we have other ways to do that with RFTs and stuff like that. It's fine if we have extended silence. Sometimes we all show up, we're just like, hey, what's up?
Dave Pacheco:Okay. And then it's like thirty minutes of silence. Sometimes it's twenty nine minutes of silence, and then someone's like, hey, I just hit a horrible problem on my dev system, and then we go debug that for two or three hours. And you don't have to stay, but sometimes people want to stay or they're like, is Yeah. We should do this.
Dave Pacheco:This is the time that we should take the interrupt to go do this or whatever. So it's it's very not prescriptive is what I'm trying to get at. And I think that's important.
Bryan Cantrill:And that's been recorded. Right? I mean, that's one thing I definitely ask you is like, please please record it to someone who if you do have something where it goes into a debugging session, someone can go back and rewatch it. And, you know, you get some of that some of those in office vibes, but then you get this other added positive kicker that someone can actually go appreciate that after the fact.
Dave Pacheco:Totally. Yes. And so this helps when people are getting stuck on technical issues because sometimes, like, know, sometimes you're fighting with diesel for, like, an hour or two. And you're like, I don't wanna go, like, bug everyone about this. I I need to be able to go figure this out.
Dave Pacheco:And that's fine. You can go figure it out. By the way, it's time for the water cooler, and you can complain about this, and maybe somebody can help you out. Or and sometimes we'll live debug it, and then, you know, people learn stuff. And so that's all great.
Dave Pacheco:There are other non diesel examples, but the
Bryan Cantrill:Are there?
Adam Leventhal:Just none can Yes.
Bryan Cantrill:I mean, I'm sure there could be. I mean, in the abstract, but I, you know, I guess if you really force me to, I could, diesel know what knows what it did. So I'm not you know, it's, and as Sean was pointing out in the chat, like this is not a lightly remote team. This is like a people in different type zones. I mean, are the the people are not I don't know that anyone is I guess you and Rain are technically in the same city, but he or even not or even
Adam Leventhal:just in the same city. Not the same city, but yes.
Bryan Cantrill:Yeah. So awoken by the same earthquake this morning.
Adam Leventhal:There we go. There we go.
Bryan Cantrill:There we go. Right? I mean, yeah. That one Dave, you were pretty close to that one. That one that one must have that one must have a bit scary.
Dave Pacheco:That one pretty big.
Bryan Cantrill:Yeah. Yeah. I just like to run lines. Right? Are you there?
Bryan Cantrill:You you
Rain Paharia:Yeah. No. No. No. I I just I just wanna say that, you know, as someone who doesn't show up to the update water cooler all the time, but shows up some of the time, I love, I really, really love what Dave has done to create the update water cooler because
Bryan Cantrill:for
Rain Paharia:me, you know, you look at all these places that talk about like RTO and stuff, know, you you made the joke earlier, but I think, you know, you have like, there are real things that you get from in person work that you have to find other ways to do. And I think the Update Watercooler is just such a great example of how you would achieve that kind of in person thing while still being a fully remote team.
Bryan Cantrill:Yeah. Yeah. And and and, like, finding ways, like, how do we kinda capture some of the goodness here Right. Without yeah. That that that's that's great.
Bryan Cantrill:And it's David, I gather it's been, like, it's been successful for the team. It's been it's been helpful.
Dave Pacheco:Yeah. I think it's been great. It's been yes. I think it's been it's it's how I also
Bryan Cantrill:love that that, like, you I think one of the kind of the early parameters you said that I love is that extended silence is
Dave Pacheco:fine. Yeah. Yes. That's very important, I feel.
Bryan Cantrill:Feel like that's something I written down and handed to me repeated times over my life. It's like, what is this? Alright. But I think this is really important.
Dave Pacheco:About this from Alan. I I actually don't I think I checked with him and this wasn't true. But someone had told me that, like, Alan and James would set up a meet and just, like, be on it all day to be able to
Bryan Cantrill:Yes. Yeah. They do that. That's a great idea.
Dave Pacheco:Let's let's just do that.
Bryan Cantrill:Yeah. I mean, I'll, like, walk past Alan's desk. Like, oh, James is also here. Hello, James. From from Canada.
Bryan Cantrill:So yeah. No. They they did that, I think, lot, and I think it was a very helpful it was very helpful for them to do that. And it's like it's I mean, you're kinda also getting some of the advantages of of pair programming without all of the dogma and a bunch of other things.
Dave Pacheco:Yeah. That's true. Another thing that I did or the next thing I did, something demos. I wanna talk about demos. Demos can be such a good tool for focusing.
Dave Pacheco:Right? Because planning for them forces you to prioritize things. The demo itself is a point of de risk, right, because you've shown something working, maybe not with all the edge cases and stuff like that. And then the other thing is that it inspires all this follow on work where people see a demo and they're like, oh, I could go do this, or what about this, or could you add status display of this or something like that. And it's great for communicating, like, with the rest of the team, how this thing is shaping up and then with the rest of the company, like, all this great stuff.
Dave Pacheco:And we have demo day, which is great. It's great. I've really this sounds like a butt coming. What I what what's coming is that it's easy for demos to be this is something cool, but I think the the kind of demo I'm talking about is a specific thing, which is, like, this is a demo of a useful milestone on the way to a customer deliverable. And that I think that has been useful.
Dave Pacheco:That's been useful for me in my whole career, I think, a lot of us as well, where you're like, if we're trying to be able to update, we should be able to show having updated a zone pretty soon. Like, what what pieces are we gonna need to have? I'm just, like, walking through what we said, like, about a year ago, last docs con. Just like, we need to update a component. How what would we need to do to demo that?
Dave Pacheco:Let's do that. And then, like, lay out a whole bunch of demos. And so another way to look at this could be, like, project planning via demos, which is kind of what I ended up doing was coming up with a sequence of demos. It's like, this is the work that I expect we're gonna do, and here are the proof points we're gonna have along the way. And I would say that had very mixed success.
Dave Pacheco:Like, it was useful to have done, but the end list of demos that we did and, like, the schedule around it was extremely different than what what I had expected, mainly because there were so many oh, this goes back to your question earlier that we never got back to. There are so many circular dependencies between projects that we couldn't really get to a demo even of the very simplest thing we could upgrade without having done, like, a whole bunch of stuff. And then once we had, we could actually do, like, five demos pretty quickly that I thought were gonna be, like, a month apart or something like that. But anyway It's great. Having this demo mean, it was was really helpful.
Bryan Cantrill:When I would say that, you know, of our demo day, I mean, I would say that that I mean, there are like, you definitely I mean, certainly, you know, we were we were talking earlier about the the the Wireshark support for MGS. And that was John demoed that at demo day. The thing that John had volunteered to, like, you throw him under the bus for even though it's incredibly useful. And I would say fraction of our demos are like that. But I would actually think the majority of our demos, certainly many demos are exactly in the spirit of what you're describing.
Bryan Cantrill:I mean, I'm just thinking about all of Eliza's terrific work on fault management. And I feel like there was I mean, we were getting, like, it was almost like a a serial of of demos where we were, you know, as as she was adding new pieces to this. And I think for a lot of the systems work, you also kinda need that demo checkpoint because it can be it it because it requires so much to get the whole thing working, you do kinda need to show it to your peers who can really appreciate that, like, okay, this is this is a tough demo to love because it's the systems demo. It's like, and it's not some it it so being able to show it to your peers, think has been really important. And then it's, of course, it's been great to watch those demos really gather momentum and become bigger and bigger and move honestly faster and faster.
Bryan Cantrill:I feel that that happens rarely in software, where software comes together more quickly than we think it it should. So often, it's it's the opposite.
Dave Pacheco:Yeah. Then Yeah. The last Yeah. I would mention is I don't know what to call this. I call this making a path in the talk, which is basically, like and and it sounds obvious, and maybe it is obvious.
Dave Pacheco:But, like, I spent, like, a bunch of time every week trying to figure out trying to lay out enough tasks. I say tasks. I mean, like, the next steps for all the things that people are currently working on. Basically, I wanna make sure everyone has the next thing that they can pick up. It doesn't have to be the next thing that they do pick up, but I this is to help the problem of getting stuck not knowing what the next thing is.
Dave Pacheco:It's like, I'm going to actually spend a bunch of time doing that basically for each work stream slash each person and, like, kinda laying that out so that people never have to wonder, like, well, what would be the next useful thing for me to do here? And this is like this took some getting used to for me and some like it was tricky because I'm not people's manager, and I don't know it feels weird to me to tell people what to do, but I'm not telling people what to do. I'm laying out this path. That's why I phrased it that way. But it was definitely it took me some time to get to realizing that I should do this and to be able to do this and be able to communicate directly about it and all this stuff.
Dave Pacheco:So yeah, I don't know. But I think it's actually been one of the most useful things
Bryan Cantrill:Oh, yeah. Focused. Yeah. I mean, again, you're providing that clarity. Yeah.
Bryan Cantrill:You're providing and, like and I I I think that there is a false dichotomy. You can provide clarity while still granting autonomy. And the and okay. So someone in the chat is accusing us of reinventing scrum. Okay, sir.
Bryan Cantrill:I so here's the difference between I mean, the the the problem with agile, capital a agile, is it actually is not lowercase at a agile. And the there are actually elements of of scrum that are are productive and helpful. But and like, I mean, this demo aspect is actually helpful. I mean, is like demos are not obviously, we didn't end demos. This is like but the the the the problem I have personally with a lot of aspects of agile is like the like the fixed two week cadence, I think is a problem.
Bryan Cantrill:And I think that like I mean, Dave because it's just as you're describing, some of those things were larger than a than two weeks. Some of them were smaller, actually. And I think the the you get and it's really and, you know, yes. Talking about car recalling scrum is diabolically bad. I don't know that any other kind unfortunately.
Bryan Cantrill:I mean, I just like I feel like when when agile is done correctly, we call it something else because we don't wanna be conflated with the thing that's done incorrectly so frequently. And like we don't have, you know, pigs and chickens and Dave's not a scrum master and all of sudden things like we're we're actually just trying to build the system. It's and then I think that's kind of the unfortunate thing about capital a agile is that there's a lot of good stuff in there that was lost in the in the dogma and the churchiness. We got our Agilent 20 episode, Adam? What's the time we gotta ring
Adam Leventhal:that one? Yes. I was just thinking that one.
Bryan Cantrill:Yeah. Exactly. Who had that on their bingo card?
Dave Pacheco:So But That that sort of gets at some of the stuff we were talking about earlier, though, which is, you know, we've chosen this organizational model or structural model that, you know, doesn't have any management, which, you know, some of these some of these I'm sorry. Explicit traditional management. I don't know how to phrase it.
Bryan Cantrill:Yes. Thank you. Please insert more words, please. Exactly why I'm bunching this Plato. All Pass pass me pass me the Plato.
Bryan Cantrill:Could you use different words, please?
Dave Pacheco:But a bunch of the things that I'm describing here are like, you know, some of these are things that traditional management might do, and there's sort of different ways to look at that. I think,
Bryan Cantrill:you know That is such a romanticization of traditional management, I gotta tell you. I I mean, I mean, I I I love that. I I love that idea that you have of traditional management that that I mean, yes. This is what great management would do for sure.
Dave Pacheco:I I mean, I would say at the very least, some of these things are in are part of the aim of traditional management. Whether they do it successfully or not, whether it's a net win or not is sort of a different question. But the questions of, like, making priority calls and, like, deciding what people are gonna do is definitely, like, part of the function of that. Right?
Bryan Cantrill:It's not management. It's leadership. That's what I gotta say.
Dave Pacheco:Well, that's fair. And I and I think, you know, one view on it is, like, we're finding a way to do to get to to sort of achieve these functions without the baggage that that traditionally comes with a lot of it. Right? I mean Yes. Is that a fair?
Bryan Cantrill:New words of good leadership.
Adam Leventhal:But I I so sure, Brian, but, like, why is, you know, Dave Dave picked up that mantle. No one really asked him to do it. And Dave didn't even know a 100% true. 100% true. Like, I know that at some point, like, you did ask him, at some point Steve did ask him.
Adam Leventhal:But when when Dave was initially doing it, nobody had asked him to do it. And Sure. I think part of part of what traditional management does is asks Dave to do that and gives him the charter to feel like he's not intruding on other people's autonomy by doing that. And so I I agree that that there that maybe what Dave is describing is not merely traditional management, but great management or or even aspirational management. But, you know, Dave knowing that that's an important valued part of his role was not is not something that is, like, emergent necessarily or, like, clear to everyone in the organization.
Bryan Cantrill:Yeah. I mean yeah. I I think that actually I mean, my view on this, as I've said before, is everyone leads and is led. And we often ask folks to lead things, and and we often ask folks to follow things. And, you know, and and there are and there's a lot of so I mean, I think that the the leadership here is critical and it's important.
Bryan Cantrill:And I I think that the and this is what Dave you're describing is great leadership. I think, it is not, traditional management that provides this, unfortunately, traditionally. I mean, again, great leadership does, but, this is the difference between management leadership, I'm afraid. But I think you were providing that clarity. Like I said, you're you're providing clarity and still giving people because I think the the the problem is that when you and when we kind of provide not clarity, but effectively micromanagement.
Bryan Cantrill:And that's what you're trying to avoid because you're trying to grant people autonomy as we wanna do. And we we actually do want because there's a real risk when you take away folks' autonomy, you often then you you don't end up with the best results. You you you stifle creativity.
Adam Leventhal:I just felt an earthquake. No. It sounds like David
Bryan Cantrill:I didn't feel it. Yeah.
Rain Paharia:Yeah. I just felt it.
Bryan Cantrill:That's funny. You know, I I felt it. I felt the two six this morning, but I was at home. I'm I were kind of in the flats here in Emeryville. That's funny.
Bryan Cantrill:It rained you're in Oakland. Right?
Rain Paharia:And Yeah.
Bryan Cantrill:Yeah. I just felt it. You felt that in Albany. My wife just d m'd me asking me if I just felt it. What the hell?
Bryan Cantrill:How did I
Adam Leventhal:Well, you're
Bryan Cantrill:on the Ground Floor.
Adam Leventhal:You're probably I mean, you're probably not moving as much. I'm on the Second Floor.
Bryan Cantrill:You know, you got me all agitated feeling defensive while I'm eating the Play Doh. I think that's what it is. I think that's like, you know, I I I don't think I got
Adam Leventhal:Just Play Doh overdose.
Bryan Cantrill:Play Doh everywhere. The, the one overnight, should be I'm sure it's the same spot. So for those folks, we are we are we had an earthquake overnight in the Bay Area that was, actually quite small. I'd say it was a four three, I believe. But, you know, actually, Adam, it reminded me of that it's the same spot where we had those earthquakes when we were at Fishworks.
Bryan Cantrill:And the that was in 2000 and and, like, '7, 02/2008.
Adam Leventhal:Yeah. And No. I know. I was living there. I was living in Rockridge with Right.
Adam Leventhal:These fault line right under my house. And it felt like a, you know, magnitude 12 and it's like, no. It's just it's just right under your house.
Bryan Cantrill:It's just right under your house. I know. And we had a, we we had a four o that was at our closest point on the Hayward and it it I I thought a plane had flown into the house. Kids screamed. But, anyway, we had a four to three last night that was definitely a jolter, woke everyone up.
Bryan Cantrill:But but my daughter Adam, that must have woken you up, I assume.
Adam Leventhal:Yeah. Yeah. Yeah. I will if the dog had not woken me up already at three in the morning. But, yes.
Adam Leventhal:Yeah. It woke up everyone else.
Bryan Cantrill:It sounded like, Dave, you woke up under the door. It sounds like you woke up. That that that was a because you That was
Dave Pacheco:last night.
Bryan Cantrill:Yeah. Yeah. Yeah. You were close to that part of this.
Dave Pacheco:That that I felt.
Bryan Cantrill:Yeah. That makes sense. I mean, I think it'd be like this little the Quake West Ham was on. And we're all we're all here in these pain. We're all just think of us, rest of the world, when the big one the big we are overdue for a a somewhere between a six nine and a seven five on the Hayward, which
Adam Leventhal:If the gods are list you already said the gods are listening to this podcast, and here you are talking about the big one.
Bryan Cantrill:Kinda tuned out. I think the guy the gods listen to, like, the first twenty minutes, and then they kinda move on to something else. That's what I'm thinking about.
Adam Leventhal:Just like everybody else. Yeah. It was a three three point o at Berkeley. Breaking news.
Bryan Cantrill:Three point o in Berkeley. There you go. Well, I did did not feel it here on the Ground Floor in Emeryville while flustered over management versus leadership. So I mean, that's an important way. That's an important way to keep myself from feeling earthquakes.
Bryan Cantrill:But, Dave, I think what what you I mean, I think that that what you again, what you've what you did for folks is offer them clarity. And I think a lot and we all want we we're all trying to achieve the same thing, which is like successful organizational outcome. We're trying to make ship a product that customers are gonna love. And I I do think that, like, having that fluidity is really important to allow us to collectively do the right thing at any given moment. But having that clarity is also really essential because we can get just bogged down in, like, not knowing what's next.
Bryan Cantrill:I let me ask this. How was the the clarity received from folks?
Dave Pacheco:Well, it was perceived well, I would say. I continue to ask people for feedback, and the the only feedback I've gotten about that has been pretty positive.
Bryan Cantrill:I I think, you know, we can be too hesitant sometimes to provide that clarity, but I think we all want that clarity is helpful. There's enough ambiguity as it is that I think that clarity is really helpful.
Dave Pacheco:Yeah. For sure.
Bryan Cantrill:And then in terms I mean, you you gave a bunch of kind of like very concrete examples in the talk of of things that you pointed out. I do I I did love, by the way, your your slide of the when you have the important non blockers and you just had the list that just went off the end of the slide with an arrow pointing down. I just I did. Like, there are many important non blockers.
Dave Pacheco:Well, I decided I really wanted the the list to go off the slide, but I was afraid it looked like a a typesetting mistake. And I was like, no. There needs to be an arrow there to communicate that this was intentional.
Bryan Cantrill:Yeah.
Dave Pacheco:There there are a number. But, you know, they're manageable. We'll get there.
Bryan Cantrill:And then what has been the reaction from the talk? I mean, I I know the reaction of your your teammates. I think that, you know, I mean, Reyne is here in part because I know she really loved the talk. I think it was the talk was very well received, at least internally, so it seemed.
Dave Pacheco:Well, I've only gotten the positive feedback. So every every seems pretty good as far as I can hear. This I you know, I will say this was a hard you know, the back half of this talk, which is most of what we've been talking about here, was hard for me to figure out exactly what I wanted to talk about and how I was gonna do it constructively. And also without I really didn't want people to feel like I was throwing the team under the bus because I'm like I'm like, here. It's hard to focus on a long term thing.
Dave Pacheco:I didn't want people to hear that as like, Boy, I guess we really didn't do a great job focusing on this thing. But, you know, I socialized it with a lot of people, know, team members and not on the team. And all and I think actually people were glad that I was talking about these things and these problems because it's something everyone, struggles with. And so, yeah, it was good. It was and it gave me a lot more confidence going into it being like, okay.
Dave Pacheco:This is not gonna be like dropping a bomb. This is something that people are eager to hear about.
Bryan Cantrill:Yeah. Totally. And it's and as yes. Adam, you're pointing out the comments online were all Yeah. Very positive.
Bryan Cantrill:I I mean, I think that we don't there's not a lot of great content out there about how to really think about a big software project. The software in the large. I call the system software in the large. You've got the the the you got the problem of system software and the problems of software in the large system software. Problems speaking like has to be absolutely correct at some level.
Bryan Cantrill:And Dave, when you're talking about about cutting scope and cutting scope by reducing the kind of conditions that update can deal with. I mean, you are kind of very apologetic of saying like, look, I'm not I'm not talking about cutting rigor. You actually really aren't talking about cutting rigor. I mean, is not a system that you you've been that you and team have been very very careful about making sure that this thing doesn't go wildly off the rails and doesn't actually begin to destroy the system, for example, just we're talking about at the top. So, I mean, you've got the system software constraints and you also have the software and the large constraints.
Bryan Cantrill:And it mean it's it's a yeah. It's it's distributed system software. It's really brutal. And you're gonna actually ship that across an air gap. I get it super super hard.
Bryan Cantrill:And I don't think there's a lot of great content out there on or I don't think there's a lot of great guidance on how to do it. I mean, I thought again, thought your talk was an exemplar about how to think about it. And I don't I would welcome other examples of it because honestly, like, we don't I don't feel you have the monopoly of thinking on this. But I think part of what made your talk have such clarity is that you're talking real I mean, you're you are right in the trenches of implementing this thing. This is not an abstract talk about how someone should formulate their project.
Bryan Cantrill:It is really the lessons learned from this particular body of work, which is what makes it so, I think, real and and extraordinary.
Dave Pacheco:Thank you. Yeah. I I thought the examples were really important. I didn't even get to talk about nearly as many as I had prepared for, but the examples are so important for grounding the discussions on things like priority and what's an important non blocker and what's not because the the texture of it is so important. Like, what kind of risk is this and what's the impact and, you know, the difference between definitely gonna happen, but we don't know when versus this may never happen, all that stuff.
Bryan Cantrill:Totally. Well, hopefully, this can be a be helpful for folks that are engaged in their own multiyear, multi person software effort, because I think there's a lot that people can take away from this.
Dave Pacheco:Yeah. And like you said, I would love to hear more if other if folks have other examples of content like this, I I would love to go read that or watch that or whatever
Bryan Cantrill:it is. Yeah. It'd be great. I think it's there's a a b b one thing I did wanna because, Rain, this is something you had highlighted that you're like, hey, we we we really need to talk about the work, Dave, that you did on on Dropsdot early on around versioning. Do you wanna just describe that a little bit?
Bryan Cantrill:I know it's a bit author orthogonal, let's talk, but it felt like it was very important pre foundation to this. Yeah. Oh, yes. Sorry. Go ahead.
Rain Paharia:Oh, me? Okay. So so the context here is that we have this have a bunch of services. Right? So our system is pretty distributed.
Rain Paharia:So for example, we have a service that runs on every sled, we call it sled agent, so that that service gathers stuff. Then there's also the central Nexus service. We have like a service for like DNS management. We have service for NTP, services for cockroach management. And a challenge is that we are updating these things piecemeal, right?
Rain Paharia:So we're updating one service at a time. And what that means is that in general, software has to be able to handle divergence in versions between services. And, you know, we try very hard in our update process not to, not for there to be more than two versions of software running at any given time, but I mean, two is as far down as you can bring it, right? Like, you know, so Dave, so that's kind of the background of this, right? So, and this turns out to be a very, very hard problem.
Rain Paharia:And the solution that Dave came up with, and which I think is really, really brilliant, that, so the first thing we did is that we built a graph of dependencies between all of our different services. Then we sh- we tried to make that graph as much of an acyclic graph as possible. This turns out to be, it, it turns out to be mostly doable with some backlinks, which, which put an asterisk there. Then we built out this great support in our HTTP server called Dropshot. So whenever these services talk to each other, we use Dropshot for both our internal and external APIs.
Rain Paharia:It's at GitHub, oxide computer slash dropshot. Love dropshot in general, but I think Dave did this really cool thing where we made it so that the same version of a service can understand two different versions of a client So any given that seems, you know, like if you can turn it into a DAG and if you update things in the order of your first update servers, then you update all the different clients, then you can reasonably close to achieving, you know, the goal of being able to handle this kind
Dave Pacheco:of version
Rain Paharia:divergence. Now this, as I mentioned earlier, we don't fully have a DAG, we have some backlinks. And in general, I think the thing we've done for backlinks is to say that, and they're like a single digit number of APIs that are backlinks, right? So it's a very small number of methods. But in general, what we've done is that we have essentially declared that those APIs are fixed in Amber, or I guess the more diplomatic way to put it is that if you want to change these APIs, then you have to do all this extra work that we haven't done.
Rain Paharia:But overall, I think it's, you know, I describe the system to people who have worked on it, in general, it's been interesting that I think people are generally familiar with the idea of like, you know, you either make the server, a single version of the server talk to multiple versions of clients or you make a single version of a client be able to talk to multiple versions of the server. So you end up, you know, one of those two things. But I think the way that this was, this is executed in drop shot is just remarkably elegant, I think. And it means that, you know, another, another big focus was that we have, you know, we have a number of folks who work on the control plane who don't work on the update project itself, right? And we would like to minimize burdens on those people.
Rain Paharia:So there's a lot of like ergonomic work that has gone into making it so that, you know, if for example, there's like a merge conflict, then the way our text is structured is that if there's a semantic merge conflict, then you will also get a textual merge conflict. So, you know, there's all these little details that we had to get right in order to make sure that people who, you know, everyone has to be aware of this, right? Like there is no escaping this problem completely, but we can make it so that it is as smooth for people as it can be. And both for people who, you know, six me six months from now who have forgotten about most of this, and for people who, you know, don't directly work on the update stuff. Dave, do you have other things to add?
Dave Pacheco:Yeah. That was, I mean, that was a great summary. The two things I want to add, one is that, the way we structured it was oriented around trying to preserve a lot of the nice things that we have in Rust today around strong type safety and stuff, which I think is so important for, honestly, just for development velocity because you can make all these changes to the system and have high confidence that they're not breaking things. And so but the problem is what people often do when you need to evolve an API is, like, if you want to add some required parameter, you say, well, I'll just make it optional for now. And then the problem is that a year, two, three years down the line, you have 18 optional things that are actually all provided.
Dave Pacheco:And the code actually assumes that it's all provided, which is just not reflected in the type system. So the idea here was instead of making an optional required field, you actually define a new version of the API in which it's required. And you define this, like, compatibility layer that just provides the default that you wanna have for the old version. But you can tell at build time which things are using which of these things. And you can say, well, okay.
Dave Pacheco:That that old version is from two releases ago, which we don't support anymore. We can just rip that out, and it's not optional ever again. And so that's very helpful. That's that's why it went the way it does. And then
Bryan Cantrill:And that's a bit of that that's a bit of mono repo upside as well. Not to Yeah. Right? Mean, right, maybe because we yeah. That's that's good.
Dave Pacheco:Yeah. And the other piece of it is that it it fits into like, it was a lot of work, and sometimes I've I've stopped and asked like, man, was this a good thing to have spent a lot of time on? But I look at it in terms of, like you know, we said from the beginning, the fear of upgrade is has always been that at runtime, the thing falls over. You know, you're in one of those 500 intermediate states in which in in which you just haven't tested it because how do you test all of the functional all the functionality of the whole system in every one of those intermediate states? And the answer is you can't, and you don't.
Dave Pacheco:Instead, we created abstractions that make it impossible for things to break in those ways. And obviously, like, we can't solve all problems in this way, and I'm not saying it can never fail in this way. But for things that are known predictable things, like we made a breaking change to the API and we deployed them in the wrong order, Let's create abstractions that make it so that we detect that at CI time so we cannot land such a change on main. And that is why we spent so much time on this part of the problem to prevent this, like, very common way that an upgrade would go badly at runtime in a way that was hard for us to detect in testing.
Adam Leventhal:Yeah. Well, and I And horrible as its results. Right? Like, you my
Bryan Cantrill:end yeah. Oh my god. And hard
Adam Leventhal:to validate, as you're saying, just just through sort of normal pedestrian testing because, like, not every path is gonna be exercised.
Dave Pacheco:Exactly. Right.
Bryan Cantrill:Because the system is in this hybrid state that it's, like, not never supposed to actually be in. Or do I mean, it's supposed to be it's it's in one of these kind of fractal states.
Adam Leventhal:That's right. It's a designed sort of inconsistent state or not all one version state, so it's hard to replicate exactly the combination, that snowflake that might be experienced in some customer environment.
Bryan Cantrill:Well, and, Dave, this is a good example of where the rigor up front allowed you to kinda confidently cut scope later, I feel. I mean, because it's like, this is where I mean, you you applied you didn't cut scope at the like, it it would be tempting to be like, no no no, like, we gotta ship this thing in a year or what have you or we need to like, the scope to cut is this this kind of, this facility in drop shot. But you write for you're like, no. No. That's the foundation.
Bryan Cantrill:Like, we have to get that right. And the I mean, that's again, this comes down to a judgment call. But I think it's pretty clear that you made the right judgment call there, and that was the right thing to go do at the right time.
Dave Pacheco:So
Bryan Cantrill:It's good stuff. And, well, again, terrific work. We're not we're we're not done yet. We also have some important non blockers that we are that are important. So, I'll be go I'll be working on is enabled probes in the kernel while we, on the Good.
Dave Pacheco:And what are two blockers to knock down this week?
Bryan Cantrill:That's right. I Adam calls that apparently air quotes leadership. I'm now being mocked in the comments. I this is alright. Is this to brace me for another more powerful earthquake that I'm also not gonna feel because I'm gonna be
Adam Leventhal:agitated? Like on Play Doh?
Bryan Cantrill:Drunk on Play Doh. Look, I'm fine. Just pass me the Play Doh. Just give me another. I just need one more.
Bryan Cantrill:One one for the road. But I again, terrific job, Dave, on all this. And and Rain and John and Sean and all the folks have been working on this. And it's been, she said, this has been a team effort. We had a bunch of folks, Alex, a bunch of others that have been doing really great work.
Bryan Cantrill:And, we're actually, Dave, one just final question. Because I think it's so easy to despair on big projects. Maybe it's easy for me to despair. It's what it's what I call leadership, Adam. Despairing in the corner eating Play Doh.
Bryan Cantrill:Was there what was a moment where you're just like, you felt like you're turning the corner? I mean, there was this kind of moment where demos started happening faster and faster and faster. Yeah. Maybe
Dave Pacheco:what I would say. The first time we started doing we came up with this, I was calling it update bring up, sort of analogous to our hardware bring where we're in the update case, we're basically going to say, let's write the demo script, the script that a customer would use to update the system, and let's start being able to just run it and see what breaks. And there was some functionality that wasn't implemented when we started doing that, but we knew that, and we were still able to work through it. That's definitely where it felt like things started snowballing in the good way. Yeah.
Dave Pacheco:Right. You're like, okay, good things are happening now. More and more things are being finished. This is looking better every week. There was never a point where I felt like this is like not going well or not converging.
Dave Pacheco:There was, you know, I would have said for a while, like I've said for a year that I felt that the schedule was tight but doable. And it's kind of I don't know what it says that I basically feel like I've been saying that for the entire last year, including right now, a few days before we're expecting to cut the release candidate. But like, it it hasn't been like, oh, man, this is really bad. We're not going make it or like, oh, this is totally in hand. It's been like, all right, this is going fine.
Dave Pacheco:But that's the point where I was definitely feeling like a lot was derisked.
Bryan Cantrill:Yeah. Well, because I feel you need and I feel like we have these for the company as well. You need good omens. You need like, you know, you you need to like see a, you know, a seabird when you're at sea. You're like, alright.
Bryan Cantrill:Like, wait, we're within a couple 100 miles of land. Like, we're not gonna you gotta have something where you're just like, we're we're headed in the right direction. Like, I I think some of the past decisions that have been vindicated, which I feel like happened, I think, somewhat early on in terms of of dynamic reconfiguration. We're like, okay. The way we constructed blueprints was the right way to construct it.
Bryan Cantrill:We, did we another substance reference we made in the chat about actually, brilliant substance reference where dolphins the dolphins too greet them, saying you're all gonna die with this this subtext to which I actually do love. So that's really it's very well played. But if that was only for your bingo, I I really admire that. But the I I I think that we at least from kind of the outside looking in, it looks like a couple of those early decisions were felt like they were were validated and vindicated. Like, okay, that was the right thing to go do.
Bryan Cantrill:Even though it felt like I mean, come on. I remember some of those early update calls was like, man, I am so lost in the abstractions here. How is anyone I mean, it's just like there's a lot to keep track of, and it was hard to make progress. So all the more admirable because they've done terrific work.
Adam Leventhal:So Yeah. Great stuff.
Dave Pacheco:And a team effort. Been great effort. Yeah.
Bryan Cantrill:Also Andrew, I think I neglect to mention Andrew when
Dave Pacheco:I was rattling off.
Bryan Cantrill:And Karen. Sorry, Andrew. And Karen. Thank you. Yes.
Bryan Cantrill:There you go.
Adam Leventhal:And we'll edit in any other names we've
Dave Pacheco:got.
Bryan Cantrill:Okay. Any other names will be edited later. Sorry. Great work all around. Terrific work team.
Bryan Cantrill:So and Dave again, great great talk. If folks haven't listened to it, it's a must listen. I thought it was, you know, I really I gotta say, I really appreciate we got really thoughtful investors. Our Seth from Eclipse listened to that talk, Dave and loved it. So Great.
Bryan Cantrill:Cool. Yeah. I I would say he had a very, very thoughtful LinkedIn post. I'm not sure you saw his post, but very thoughtful post about how the about kind of explaining why this is hard. And because it's it can be hard to explain to people why this problem is hard.
Bryan Cantrill:And I think your talk is just very vivid in that regard and why, you know, you thought about it so thoughtfully. So
Dave Pacheco:Thank you.
Bryan Cantrill:Great stuff. I don't know. I did see someone online be like, oh, this is great. Like, Oxide and Friends is back. So I I I need my fix.
Bryan Cantrill:I think we're not gonna have our fix for the next couple of weeks because we are out for the next I think we may be out for the next three weeks actually. Because I think Alright. So we may be on a a little bit of of a hiatus. And then and then we'll
Adam Leventhal:come back to you with
Bryan Cantrill:a baseball episode to really just just punish you for your ongoing listenership. Yeah. We we got some actually, we got some exciting episodes that are that are teed up for for the fall. So
Adam Leventhal:Yeah. Yeah. Yeah. A couple a couple of guests I'm really excited about.
Bryan Cantrill:Yes. So stay tuned. And Dave, thank you again for for joining us and for enduring the aftershocks that apparently everyone else is feeling and not me. I'm all left out. But alright.
Bryan Cantrill:Thanks everyone. Talk to you next time whenever next time is.
Creators and Guests
