The Saga of Sagas
Brian, Eliza, Greg, Dave, and Andrew.
Bryan Cantrill:Like the Mouseketeers are like this. Right? It's like a Mickey
Bryan Cantrill:Mouse Club.
Bryan Cantrill:You cannot make
Adam Leventhal:a mouse.
Bryan Cantrill:Yeah. Making a boomer reference there.
Adam Leventhal:Now there's there's the I had to suffer through some Mickey Mouse Club with the with the younger one.
Bryan Cantrill:Did you have a rebooted in the Mickey Mouse Club?
Adam Leventhal:Yes.
Bryan Cantrill:I mean, of course, what what am I doing? What am I doing to myself? What am I doing to myself?
Bryan Cantrill:It's like, really? They they turned that superhero comic from my Youku to a movie. That was surprising.
Adam Leventhal:Yeah. Turns out Mickey Mouse, still a valuable IP. Like, who knew?
Bryan Cantrill:Of course,
Bryan Cantrill:I shouldn't be surprised. This is great. Oh man. I'm, I am, I'm stoked.
Adam Leventhal:You know, don't you take this away from me. I'm more stoked. Okay? I You
Bryan Cantrill:have more you know what? You know what? You you can Leventhal me right now. I'm gonna I go ahead.
Adam Leventhal:You know how long I've had this on the list? Because we have we have a a document of of proposed topics. Do you wanna guess how many of us are concerned?
Bryan Cantrill:Our audience is like, yeah, Show us or get that get that fuck out. No way. Oh, I'll show it. No way you guys have a chance.
Adam Leventhal:I'll show it with the version history.
Bryan Cantrill:Yeah. No. But it's true. It's true. We have tried some modicum of advanced planning at some point in time.
Bryan Cantrill:Yeah. And, no. This has been on the list. This has been on the list for a long time.
Adam Leventhal:March 2023. It's been on the list a long time, and I'm very excited that we're here.
Bryan Cantrill:I am very excited. We're here too. And, I, you know, we call we call it a Leventhaling when you, when you tell me that, when I have an idea that you've said like, no, no, like this is the idea that I, Adam have had that you've been shouting down for, but in this case, I don't think it's 11th all. I think you'd say you're actually right.
Adam Leventhal:I actually think it is a Leventhal. We just have different definitions of what it means.
Bryan Cantrill:No, it's like that is a Leventhal because I am right. That's exactly right. That's what makes that a Leventhal. In any case, this has been an episode that has been, long overdue. And this I'm really, and actually it's, it's so overdue that I almost wondered if we'd already done it.
Bryan Cantrill:Is that what, what, what do you call that? I guess that's just early onset dementia, And I was
Adam Leventhal:That's right.
Bryan Cantrill:I don't know if that's got a special name. I also did love that you had a, this is a bit of a special episode in that you you got ahead of what you anticipated to be an objection, and we have a guest appearance from Lure from Futurama has made a guest appearance in our Discord here. Do you wanna Yes. Well give Lure's special disclaimer before we
Adam Leventhal:Sure. So, special disclaimer. We will talk about o Omicron. Omicron is the name of a repo that we use that's, like, our monorepo, but not really a monorepo. But it's, like, our our biggest repo, shall we say, for the control plane.
Adam Leventhal:We named it Omicron.
Bryan Cantrill:Portly repo.
Adam Leventhal:It's a portly. Right. Right. It is it is a mono among many. And, it was named it was a Futurama reference, for Lure who who
Bryan Cantrill:It is a Futurama reference. It remains a Futurama reference, I said. Right.
Adam Leventhal:That's right. It it is and was a Futurama reference referring to the Omicronians, the the folks from Omicron per c, per c I 8. And, a few months after that was named something else was named Omicron. But we were there first, I hasten to point out.
Bryan Cantrill:And and it did it took more than a pandemic for us to change our name. That's right. We did not actually even though there was a moment where people are like, you codenamed your project tuberculosis? That just feels importes.
Adam Leventhal:I mean, they should change their name. They're the ones who suck. Exactly.
Bryan Cantrill:Yes. Excellent office space reference. Very good. So, that that is our special disclaimer. You know, here's talk about Omicron.
Bryan Cantrill:We're not talking about the the variant of SARS CoV 2, although in that that's now becoming dated. Right? That sounds become one variant among many, so now
Adam Leventhal:it's not Mhmm.
Bryan Cantrill:And maybe that's back to being Futurama reference. Pretty great. So on the saga of sagas, did you go back and listen to our journal club with Katie McCaffrey?
Adam Leventhal:No. I didn't. I should have. I didn't.
Bryan Cantrill:I was not.
Adam Leventhal:I know it's amazing. Read some read some more of these, but, but not that one.
Bryan Cantrill:Okay. So just a little bit of of context on that, because this was especially in important. Oh, also, the people are in the chat are wondering if they're mispronouncing Omicron. I'm probably mispronouncing Omicron. I the Omicron or Omicron.
Bryan Cantrill:I don't know. I that feels like that could go either way. Idempotent, though, is definitely Idempotent. Right?
Adam Leventhal:Yes.
Bryan Cantrill:Okay. The, the that's right. So I'm just like Andrew's saying it's a it's a conflagration of his pronunciations. I you know, the I Andrew, it takes daily affirmation for me to say conflagration, not what that other thing that I
Adam Leventhal:What's the word
Bryan Cantrill:you're saying? Affirmation? I don't think that's right. I'm just Well, well played, I think. Oh, god.
Bryan Cantrill:Oh, that's well played. Right? That's a joke. Right? Right, someone?
Bryan Cantrill:But so we and this is especially important, I think, in that in the earliest going of the company, where we had a very clean sheet of paper. And on the one hand, it's great to have a clean sheet of paper. It's It's fun. Got a lot of design latitude. On the other hand, it can be really intimidating and challenging, and you're you're constantly trying to, you know, you you get that that always that challenge in engineering where, you know, at what point you kinda stop searching for an existing solution and then point your own so on.
Bryan Cantrill:So there there were a lot of challenges. One of the things we wanted to be able to do was institute a way for us to kind of evaluate outside work. And, Adam, you and I both experimented with this in a couple different ways. And one of the things we've done journal clubs in the past, and I think one of the things that was kind of the undoing of journal clubs, we certainly had joined, was we tried to do this kind of regularity, and it just kinda fell down. And so we did this kinda shrink to fit model where if somebody finds a work outside that's interesting and they wanna talk about it, and we have an RFP about it that I should make public.
Bryan Cantrill:I'm not the guy I have made that public. Is that 35? I will make that public if I haven't, on how we do journal club. And we just we have a very lightweight process to get everyone together to discuss it. And
Adam Leventhal:one of
Bryan Cantrill:the early ones around this was on Katie McCaffrey's work on sagas. And so a couple of things about that. Adam, do you know when that journal club was? The date?
Adam Leventhal:I'm gonna guess it's 2020. And I'd go, like, June 2020, July 2020.
Bryan Cantrill:That is that's that's a good guess in that, like, you're guessing this is an extremely long time ago. It was actually in very early 2021. Okay. Ominously, January 5, 2021. So a date that Nothing else
Adam Leventhal:going on. Right.
Bryan Cantrill:A date that will live in infamy, infamy infamy minus 1, and, I definitely am never gonna forget that day. I'm definitely gonna
Adam Leventhal:prove
Bryan Cantrill:that date, you know.
Adam Leventhal:Infamy adjacent.
Bryan Cantrill:Yeah. Infamy adjacent. You know? The Broncos played Monday night football on on September 10, 2001. Have not forgotten it.
Bryan Cantrill:So I feel like this is in the same boat. But so January 5th, 2021. So super early, and one of the things that I learned from relistening to that and so we record all this stuff, which we've talked about in the past. I think it's so important. I think it's something that, folks should definitely I mean, recording your conversations is so incredibly valuable.
Bryan Cantrill:There's this big kind of bathtub curve. Things are, you know, valuable shortly after that recording. You know, someone is out or something, and they can catch up on it. And then there's this kind of the the bathtub of it is that there's this kinda long tail where all of a sudden something spikes up and becomes really interesting. And it it became really interesting to me to go back and relisten to that recently.
Bryan Cantrill:And, Katie can't join us today, but she's gave us her well wishes, so she'll catch up on the recording. But, listening to Katie because we had Katie in to discuss her presentation on distributed sagas where she had kind of found this previous work on sagas. And, Adam, one of the things I went from that discussion is, you were the one that that that, that had that put us onto this. And I do you remember any of the origin of that? Yeah.
Adam Leventhal:So, as I recall, so we knew that we wanted Omicron to coordinate, activity from a bunch of different services. Like, we were I think I felt thought of that as sort of workflows. I don't know. There are probably other terms for that kind of coordination. And where you're talking about lots of different services, and if something went wrong wrong, you wanted to know how to unwind it and get yourself back in some reasonable state.
Adam Leventhal:So this notion of, transactionality across this distributed system. I had, first bumped into Katie's work around distributed sagas when I was doing something much more explicit, not looking at microservices. But in my previous company, we were talking to a bunch of different APIs and looking at what would it look like to create this kind of transactionality across disparate APIs, like an API from Google, an OpenAPI, and for your calendar software or whatever. And if you wanted to do this kind of, composite operation and part of it failed, how would you undo that? So that's how I stumbled onto it and I never got anywhere with it in the previous company, but it seemed like a good fit for what Dave was thinking about.
Bryan Cantrill:For sure. And Dave is a good, a good segue to to to you. I mean, you were, one of our earliest engineers, and you were kind of thinking about this stuff from the outset. And obviously, you and I had worked together at Giant. So this is not the not your first control plane.
Bryan Cantrill:Can you take us through kind of, your first explorations of workflows and sagas?
Dave Pacheco:Yeah. So, I I I remember this period. It's it's later than I expected it was. This is this was, like, last quarter of 2020, but but that's when we started thinking about this. And I remember thinking we're gonna have a whole lot of things that need to do these complex operations that touch a bunch of services.
Dave Pacheco:And the example we always used is VM provisioning. Yeah. VM provisioning, you've got to do a whole bunch of stuff, but that mean you could think of simple things like allocate a server pick a server, allocate resources on the server, tell it to actually go start this thing, but then we also have to program the switch with, like, routes so that things from the outside can reach that thing. So it's like just a simple example where you've got some database state and you've got 2 other services that need to be coordinated. And like, again, you need to make sure that if you fail to program the switch or something late in the process fails, then you also deallocate the resources.
Dave Pacheco:Right? And we would also talk about like SSD firmware updates might look like this. VM migration might look like this. Replacing a faulty server might look like this. And so we figured it would be a pretty reusable thing.
Dave Pacheco:And, you know, traditionally, like, probably maybe a long time ago, you would just have straight line code with a bunch of, like, if error go to out or whatever, you know. Undo whatever I've done so far. But then what if you crash at various points?
Bryan Cantrill:And it should be said that, like, we live this. Right? I mean, like, whenever I think of this, like, there's a reason we think about the VM provisioning example because that one is very at least vivid in my memory of this just long tail of problems, of instances that don't get cleaned up properly or that are kinda like in the zombie state where they're kinda half created because something failed along the way. And it I was just it was not good, I feel. I think we were we were doing our best, but, it was it was complicated.
Dave Pacheco:It's a lot. And and, like, it gets it you know, I mentioned those things, but it can get pretty complicated when you talk about like an instant create request could say, I also wanna create like 3 or 4 disks and attach them or I want to attach some other disk that already existed. You need to, like, make sure that you delete the disks that you created, but not the ones that you just wanted to attach and, like, it's just a lot of different. It's pretty clear, I think, that you don't wanna write ad hoc code to do all that stuff. I think you wanna find a way to decompose it into these sub problems where you just have, like, a simple thing that you like an action is what we ended up calling it and then something that you they can undo that and compose these in pieces and stuff like that.
Dave Pacheco:And we had done something like this at Joyant, but I remember it always being kind of a pain point in part because of the language it was written in and and it was pretty limited, like, you had these named things that you could put in order and it could undo them, but there wasn't certainly wasn't a lot of strong typing. There wasn't a lot of like branching and looping and stuff like that. It just it was very static and Also I think there was a big challenge around the component, like the dependencies between the components. So the workflow would live in one place, but it would call APIs and services that were in some other place, and so you had these flag days every time you changed everything. And I think one of the things I noticed, Adam, I'm not sure what your take on this was, but I feel like when we looked at a lot of the stuff that was already out there for this, a lot of it seemed aimed at solving an organizational problem as much as a technical problem, which is that you have, like, you've got like a 100 teams working on a 100 services and the the workflow thing winds up being a hub that coordinates everything.
Dave Pacheco:It's like the place that you put all the things that all the pieces that talk to each other or something like that. And we and we were really hoping we wouldn't have that problem at least not for a while. I think it's fair to say we haven't had that problem yet. We haven't we don't have a 100 teams yet.
Bryan Cantrill:We don't. And I think that that's a very good point about terms of Conway's law expressing itself in these workflow engines. And, David, in the chat, I just dropped a link to RFD 107, which was the RFD that, you and Adam were writing, exploring different, different kinds of workflow engines. So, I thought that was good. It was did you go back and reread this recently, by the way?
Bryan Cantrill:It's kind of interesting to go back and reread it. It's it's amazingly prescient, I feel.
Dave Pacheco:I haven't looked at it recently.
Bryan Cantrill:It holds up well. It's probably gonna get better.
Adam Leventhal:Today, I think, as I often am struck by some of these early RFTs, where we investigate lots of potential technologies, it's one of these ones where I'm like, why did we think any of these things was gonna work
Bryan Cantrill:for us? Like, clearly, we had to build our own thing. I mean, and and,
Adam Leventhal:obviously, it wasn't clear at the time. But that that's how I feel about a lot of these early ones. Such optimism about the the, you know, what's printed on the label and how it's gonna work. And
Bryan Cantrill:k. So and, Adam, now look. It's been a long time, so you can come clean. Do all of these workflow engines actually exist?
Adam Leventhal:Did you are
Bryan Cantrill:these real names?
Adam Leventhal:Oh, you got me. Did did
Bryan Cantrill:you in particular, are Uzi and Zibi both things? I mean, did you did you just come No.
Adam Leventhal:Those are real. It's it's n 8 n. That was the one that I made up. I see. I I snuck that one in.
Adam Leventhal:Just
Bryan Cantrill:to see if anyone's hey. It's just like one of these types of brown M&M for RFPs. And you just like, hey.
Adam Leventhal:No. When I said
Bryan Cantrill:Oh, yeah. Right? You may not remember. You you said you read every word of this RFP, and you thought it was great. Well, as it turns out, guess what you thought was great?
Bryan Cantrill:N a n. I made that up.
Adam Leventhal:Yeah. It turns out that was during the pandemic. I just I didn't have any childcare, so I all week, I just told you I was working on n h n, and that was just my code name for taking care of my infant.
Bryan Cantrill:No. You're sure that's not Nathan? I mean, is that, like I and also, I gotta say that now that I'm looking at section 37 n 8 n, there's, I mean, not a lot of detail there. It probably doesn't I mean, doesn't Google very well. It's just I'm just saying it's very convenient.
Bryan Cantrill:I'm I'm not I'm just gonna totally convince these things exist.
Eliza Weisman:That's all I got. I'm pretty sure n 8 n is
Dave Pacheco:short for Nubernetes.
Bryan Cantrill:You I do not know if this is this is like the affirmation pronunciation thing. I was just like, are we is this a bit
Adam Leventhal:This is the long con coming to Reus for sure.
Bryan Cantrill:The so we so you were but you were, investigating a bunch of these several workflow frameworks. Only some of them made up. Some of them real, presumably. And you're kinda coming to in a Adam, it's kinda interesting what you're saying, like, in retrospect, like, in retrospect, how could we I feel kinda the same way about Hubris, the operating system. Like, how could we ever have thought that we would have been able to use something off the shelf?
Bryan Cantrill:Of course, we need to do our own thing. But it was good. I mean, I think it was it it's a good I mean, it's always good to do that investigation because you don't wanna walk past something that actually does everything that you need. I mean, you really do wanna make sure you're doing all of your homework on this.
Adam Leventhal:Absolutely. For especially for something that feels like it could be a thing. Right? Like, it could be that other people have had. I mean, even as we describe instance provisioning, it is not unique.
Adam Leventhal:Right? Like, and we know, you know, for existence proof of Katie's work. Right? Like, clearly, other people have had similar problems. So have they distilled them into solutions that are sufficiently close to what we need?
Bryan Cantrill:Yeah. So so you you were familiar with Katie's work. And I think, you know, I'd run across it, but it the, but you got her in, we talked about it. That discussion was great. That was really interesting.
Bryan Cantrill:One of the questions that we had was, you know, is there a a crate, a library, a something that we're missing out there that does this? And she was like, no, this is very hard to make general purpose. And I mean, Dave, when you were, you know, as we were investigating it and except it felt like this is the right level of abstraction. Do you wanna explain what sagas are a little bit? Because that that would probably be a little bit that'd probably be helpful.
Dave Pacheco:Sure. I'll do my best and people can jump in where I've forgotten stuff or whatever. But basically, the the way I think of a distributed saga is that you decompose a complex operation into a bunch of these actions. But, there's also a terminology thing. I'm gonna I'm gonna use our terminology for a second.
Dave Pacheco:A bunch of these actions that are individually item potent and have associated undo actions, and then you build a DAG of these things, a directed acyclic graph. So you you have each of these actions be nodes in a graph and then you have lines between them, which are the dependencies between them, and basically say this thing depends on that other thing. And I'm afraid I have some background noise and is it is that a problem? And maybe
Bryan Cantrill:You're good. No. I don't I I think that we would in the spirit of oxide breads, we would ask that the clarinet lesson also kick out of there. It it that that's possible. I mean, I know the kids are young, but maybe that's
Dave Pacheco:Okay. I think, Just
Adam Leventhal:start them early.
Dave Pacheco:I think it's a little quieter here now. So what, what we ended up building was Steno, which is a framework for this in Rust, where you base you define your actions and you define the undo actions, and there's there's traits for these things. And then you can describe the shape of the graph. And Steno's job is to execute it according to the rules of distributed saga. So distributed saga is basically a bunch of rules that say, you know, about how to execute this thing such that it always converges to either all the actions having completed successfully or all of the undo actions having been run for everything that had been run.
Dave Pacheco:And, the other way it gets phrased in the talk is that you can sort of think of this like a non atomic database transaction, where you have a bunch of things that you're gonna do. You're gonna go all the way to the end or you're gonna undo the whole thing. But the things you're gonna do may not be database operations, they may be requests to other services or something like that. How how badly did I butcher that explanation?
Bryan Cantrill:That's great. And I think that what we learned from talking to Katie is that when she'd used this pattern for Halo, they actually hadn't needed to do a bunch of the undo actions. So they, they were able to fail forward, for for just the the the use case that they had.
Dave Pacheco:But That makes sense. Yeah.
Bryan Cantrill:Which I think is kinda consistent with with our experience. So and and I just think this is, like, a a helpful way to to to think about the problem in terms of, like, taking it apart and then allowing us to kinda reason about these pieces. And it's slightly smaller granularity in terms of, like, the action and the undo action.
Dave Pacheco:Totally. And if you squint at it, if you go back to the handwritten code I was describing earlier where you just like do a bunch of things and if they fail you do like go to out. And then you ask, well what would happen if I crashed at various points? This is basically serializing date for you at each of the points where you might have failed and keeping track of what you have to undo. So I I really think of it in terms of decomposing a problem.
Dave Pacheco:Like writing a saga is not like whole cloth work, it's like doing whatever you would have done, but taking the pieces and putting them into these small actions so that the framework can keep track of what you have done and what you have to undo.
Bryan Cantrill:That's right. I mean, in that regard, it is more of a of a pattern then. And it's it's makes sense why there are not a lot of and and maybe there are more now. I mean, because I, in terms of libraries or crates that implement this. And then Steno, I think is I mean, do you think Steno could be used outside of Omicron?
Dave Pacheco:Definitely. And I think a few people have. There we've definitely gotten pull requests from people outside who have at least tried to use it. I don't know how successful it was.
Bryan Cantrill:So we we so we took a swing at making something that was general purpose ish. Yeah.
Dave Pacheco:So Steno doesn't know anything about Omicron at all. It's it's it's just got these traits around actions and undo actions, and it plugs in certain pieces like so the the way the implementation of this thing works is it keeps a log of all the things that it's doing basically before it does them. I mean it's it's very simple. There's no it's not magic here. But basically before starting any action it durably logs that it's about to start the action.
Dave Pacheco:And then when it finishes, it durably logs that it finishes. And then it does the same for all the undo actions. And so, steno basically that's another trait that you implement for that. So our our component, Nexus, incorporates Steno and fills in these traits with our specific actions and with a storage back end that writes these things into CockroachDB. So that we know it's it's strongly consistent, highly available, and all that stuff.
Bryan Cantrill:So I don't think I realized that the storage back end is actually on the log off from Stena. That we actually if we we fill that in on the Nexus side.
Dave Pacheco:That's right. Yeah. And that was I mean, if nothing else, it was helpful for testing because Steno has these little command line tools you can run that spit the log as just a file of JSON records. And you can literally, like, delete you can, like, truncate the file and then, like, replay the saga from that point to simulate a crash. And the test suite does that kind of thing too.
Dave Pacheco:It it was a helpful way to factor the whole thing. It just be like it can be it can be stored however you want.
Bryan Cantrill:Right. And then and so it implements this in memory store inside of Steno to actually as I just said, to be able to go test this stuff.
Dave Pacheco:That's right. Yeah.
Bryan Cantrill:Yeah. That's great. Okay. So we start Question
Adam Leventhal:from chat.
Bryan Cantrill:Yep.
Adam Leventhal:How do how do sagas would you say differ from a workflow? I I know workflow is a pretty abstract term.
Dave Pacheco:Yeah. I'm not sure what is meant by workflow.
Adam Leventhal:Yeah. I think that's right. I think maybe Sagas is a type of workflow or our distillation of what we think of as a workflow. Is that fair?
Dave Pacheco:I think that's right. And one thing I'll say is when I hear workflow, I think of something that you that is decoupled from the specific workflows that it runs, for lack of a better word. So like and this is the organizational problem I mentioned earlier. Like, there's one service which just knows how to run workflows and like that's what it does. And it's other services that use it might upload some description of what actions it wants to take and that makes up the workflow.
Dave Pacheco:And ours is not really like that. It maybe could be, but it can't be I mean, right now it's it's the way we have incorporated it into the system that would be pretty hard. But that's by design because it introduces these dependencies, these API dependencies we didn't wanna have to deal with. Did that make any sense? Like it's not as general purpose in that sense.
Dave Pacheco:Like you can't upload you can't run a Steno server that to which you upload sagas and run it, I think.
Bryan Cantrill:Yeah. When I also think that, like, sagas are a pattern that are coming, and I just dropped a link to this this paper, in in the eighties, 1986, I think. It's coming from databases rather than coming from distributed systems. You're taking this kind of database pattern and applying it to a distributed system. And I I think that, for me, workflow is just a little too abstract as a term.
Bryan Cantrill:It just means too many different things. Didn't we have didn't I implement something that we called workflows at Fishworks? Said something totally different. Yeah. My I remember it it just feels like it just feels a little too abstract as a term.
Bryan Cantrill:And sagas, whether I think it's a great name. I mean, admittedly, but my wife is a is a literature PhD in Scandinavian studies, so, of course, I would say this. But, I I I like the, I I think it's it's not an overloaded term, which I think is is actually valuable here.
Dave Pacheco:Yeah. If if I can jump in, someone asked a great question in chat that I think clarifies the abstraction a little bit. They said, what happens if you do get the log start but don't get the log for end? Is a half completed operation able to be rolled back? So the question is like, this framework has logged the thing saying I started an action and then crashes and comes back up.
Dave Pacheco:It doesn't know what happened. Right? The framework will execute the action again, and that's why it's critical that the actions be idempotent. And even if it knows for some other reason that the saga has failed and will unwind it, it still has to execute it again so that it can run the undo action in some known state. Because you can't run the undo action for, like, a half completed thing.
Dave Pacheco:So that's how that gets handled.
Bryan Cantrill:And does one's undo action then needs to deal with the fact that this thing has potentially been run twice? Or is it just the the expectations of the actions all have idempotency, so it doesn't actually matter? It that the fact that the action ran at least once is all you should need to know.
Dave Pacheco:I mean, the hope is that that is all you should need to know, but it certainly it is true it is true that the action may have run more than once, or partially run more than once. And the undo action does need to work in that case too. And so I I don't know how much we wanna jump too far ahead, but we did run into problems like this. And the classic Yeah. Yeah.
Dave Pacheco:Where this where you run into this is if you if your action is to insert a record into a database with a primary key, that will work the first time you run it. But the second time you run it, it'll fail because the record already exists. And we've definitely had that bug. And, Sean built a really cool test. It's not it's not a framework or runner, but it's like a helper, I guess.
Dave Pacheco:That takes any saga at all and runs it up to every point, and then, like, will replay a particular action. So we basically test running every action more than once. And then there's a bunch of other we test inducing a failure and unwinding it and then running undo actions multiple times. And that thing is pretty generic. I think you can basically take any saga, which is pretty cool.
Bryan Cantrill:Yeah. That is cool. That's really cool.
Adam Leventhal:No. It's really cool. I mean, it it it highlights some of the benefit of of putting some structure around it as you've described, Dave. Whereas if we just had a bunch of conditional code that even with manual logging or manual replay or serialization or whatever, the fact that if you express your, you know, composite action as a saga, you get all this more or less for free is really neat.
Eliza Weisman:I just wanna, just going back very briefly to the discussion of what's the difference between a Soggin and workflow. I wanna highlight, Tef's comment from the chat, and they've written, yeah, I just really want this on the podcast. Tef has written, when I hear sagas, I think transaction semantics enforced at the application layer, which yeah. And when I hear workflow, I hear a DSL that doesn't have a 4 loop.
Bryan Cantrill:That's very good. Yeah.
Eliza Weisman:I wanted that on the record.
Bryan Cantrill:For good reason. That's a very concise way of phrasing it. Yeah. I think so too. I think that that that's, that's really matches my understanding.
Bryan Cantrill:Dave, is that Adam? That it's a pretty good description, I think.
Dave Pacheco:Yeah. That also matches my my feeling. I would not have thought to say it so well.
Bryan Cantrill:So, Dave, so as we're getting into this, so now we're kind of through 2021 into 2022, we're we're we're implementing more and more of this control plane using sagas. Where are we finding the wins, and where are we finding some of the challenges?
Dave Pacheco:Yeah. So, actually, before we even get there, I wanna mention there's 2 parts of doing this in practice that are really not addressed at least in the talk. I don't know if there are other places where it's addressed. One is SCC failure failover. So the SCC is the Saga Execution Coordinator, and it's the thing that, in our case, is running Steno, but it's running it's executing all these actions.
Dave Pacheco:Right? And at some point, you kinda have to decide or for availability, you may wanna say, what do I do if that thing goes away completely? And we spent a lot of time wrestling with this and ended up basically punting on it, because it seemed very hard to extend the model of sagas in the way that you would need for that to be implementable generically. Like in in other words, you wanna be able to say, I don't know, that 2 sagas can't commit the same result of an action. But the problem is you've Or sorry, it's the same action twice, like 2 executions of it.
Dave Pacheco:But the problem is you've already done the work. So you could have a different model where you have not an undo action, but a like, undo a duplicate action. But it just ended up getting super confusing. And so we ended up punting on that. And that's basically been okay, and like we can talk about how that actually works and isn't a problem for us, but that's getting a little off the rails.
Dave Pacheco:The other big area is sharing state. So the canonical example of distributed sagas, at least from the talk, is booking a trip. And it's like, you book a trip by booking a hotel and a car and a flight. But those don't share any data between them. Right?
Dave Pacheco:The case that we were describing of instance provisioning, you know, step 3 is we allocated a server, and step 5 is we or sorry, we picked a server. Step 5 is we allocate some resources from it. We need to know what server we picked in step 3. But how do we get that information? If we we can't just store it in memory because we may not have run step 3 in this instance of this process.
Dave Pacheco:We may have recovered the saga and after a crash, and that information is not in memory anymore. So we basically invented a way to do that and, like, I I don't know how much we wanna talk about that, but that was something we had to figure out and bolt on top of it. That wasn't something I think that we're certainly in the talk. Yeah. Interesting.
Dave Pacheco:Dive into those or not.
Bryan Cantrill:Yeah. I think that it is worth diving into because I mean, this is where I mean, this is where it gets kind of like thorny, but also interesting. Right? In terms of, what I mean and what were some of the kind of the trade offs there in terms of that implementation?
Dave Pacheco:I think the big one is like the sort of obvious way you could do this is you give each saga, like, a bucket of key value pairs that it can just write to. And you know, each action can basically store whatever it wants, remove key value pairs or whatever. Does that make sense?
Bryan Cantrill:Yeah.
Dave Pacheco:But it it felt like it It introduces a lot of weird questions like, Can you see the results of actions that haven't finished yet in subsequent actions that don't depend on them or even do depend on them? Well, I guess that you can't be running if you do depend on them, but if you don't depend on them. And would the behaviour then be non deterministic based on what order things happen to run-in? And it also meant that the undo stuff would have to be very careful about what possible, like, bucket of state it could get. Right?
Dave Pacheco:It would be totally, it's maybe hard to visualize without understanding the alternative we did. But, like, you are back at this question of like how do I manage this arbitrarily complicated key value pair state? Do all the code paths have to be right? And how do I know that I've checked all of them? What we ended up doing is saying the only way to share state is that each action produces a value.
Dave Pacheco:That value is immutable. And any action can look up the value produced by any dependent action. Or sorry, any action on which it depended. Does that make sense?
Bryan Cantrill:Interesting because the yeah. And absolutely. And so this was a topic of discussion with Katie, in that 2021 conversation about how do actions per consume? How do we pass data between a different elements of a saga, different actions? And how do I consume kind of a value that a previous action creates?
Bryan Cantrill:And and it would there there was ambiguity there. I mean, she was like in her case, like, she basically hadn't done that, but, it'd be interesting to think about. So that's ultimately we came to is that you you get kind of one value out of an action, and then an act an action that that is a is dependent upon that action can consume that value. Is that right?
Dave Pacheco:That's right. And to be clear, when we say one value, it's any JSON serializable value, so it could be as complicated as you want. It's just it's just that there's 1. And it's associated with your action. And to make it concrete, you could imagine, like, again, thinking of like step 3 is we pick the server.
Dave Pacheco:Step 5 is we allocated resources on that server, which may mean that we inserted some database records with some UUIDs. And the undo action for that needs to take those UUIDs and remove the corresponding allocation from the database. And you can imagine doing this with the bucket of key value pairs, and you can also imagine debugging nasty crashes in that where we expected some key to be there, but it wasn't because of some, like, some uncommon and hard to induce sequence of execution where we just didn't happen to set that key or something like that. Whereas this way, it's much simpler to think about it. It's like that that value is always there because it's produced by the action on completion.
Dave Pacheco:And that's that's implemented by Steno. So it's atomic with respect to completion of the action. You can't have an action completed and not have its output, is what I mean. And it always has a particular form.
Andrew Stone:Hey. Do you wanna go into, like, into, like, what the the actions are and how they're implemented? And how storing, like, this data in JSON, makes it tricky for us to upgrade and migrate the coordinators?
Dave Pacheco:Yeah. Do do we wanna jump into that?
Bryan Cantrill:Was it too early for that?
Dave Pacheco:I I don't think so. So what level are you thinking of in terms of the implemented bit, like the Rust, database work?
Andrew Stone:Talk about how, like, we write our actions in Rust and the code, like, the code itself is not stored in the database. Right? And so it's just installed code that's running. And so how we coordinate that with the actions and the, like, log that is stored in the database is kinda tricky.
Dave Pacheco:Yeah. That's a really important point because when you think of workflow things, as someone said, it's a DSL without a for loop. Right? You're often writing in some other language that well, whatever. In our case, it's Rust.
Dave Pacheco:And you could be in principle doing whatever you want in Rust. Although there's there's sort of limited facilities that you have in the context of an action. But it's arbitrary Rust code, and it's the return value of that function that becomes the the result produced by this action. And the downside of this, maybe downside, I guess, it depends on how you look at it, is that it 1 the state the serialized state of the saga is very tightly coupled to the shape of that DAG and the implementations of those functions. In particular, if you imagine yourself being halfway through provisioning an instance, which is again this, like, many step process, and when we upgrade the execution coordinator, so we have a new version of the saga running.
Dave Pacheco:You definitely can't just, like, load that log and match up the values by the names of the nodes and, like, rerun it where you were and pray for the best. Right? Because there may be new actions that it Or rather, the way we initially built this Yeah. Sorry. I should have said this earlier.
Dave Pacheco:The way we initially built this, the DAG only really The DAG was totally static at build time and it wasn't encoded anywhere. Not not in the database, not anywhere, except by, you know, the c by the Rust code that implemented that thing. So if you loaded the log from some previous version, we might try to line up actions and nodes and outputs and stuff like that, but it would not at all necessarily correspond. You may have nodes in the graph that don't have outputs that should and vice versa, and the whole shape of it might be different. Is that problem And
Bryan Cantrill:and and so just to be clear, this is a problem because we have updated the system while the saga is running.
Dave Pacheco:That's exactly right. That is the specific problem or the the specific case we're talking about here.
Andrew Stone:Yeah. You could imagine having, like, the code, you know, stored as, you know, serialized WASM or something along with that
Dave Pacheco:one quickly.
Andrew Stone:We did discuss it. And then and then the saga log would be stored with the code to run it. And so you could have, like, different versions of different sagas running. But, like, we did not have time to kinda implement that. And that has a lot of problems also in terms of, like, maybe you wanna update the Wasm Executor.
Andrew Stone:Like, how do you how do we even integrate that into the system when we're trying to build, like, 8,000,000 things?
Bryan Cantrill:Well, it also means, like, now what is an update? Because, by the way, there's some software out there that has Right. Stored itself into the database that is it is the old software now that is going to be that is gonna be running for and I mean, there's still a lot of complexity there.
Andrew Stone:Yeah.
Bryan Cantrill:And, yeah, we actually brought this up with Katie. That we brought up this idea that that this this WASM idea. Adam is implying that that we may have, we we may have had some mild alternate stuff.
Adam Leventhal:Look. Everyone has, like, hacked a bowl with, with WASM and, like, and gotten weird.
Bryan Cantrill:This is like man, it was the fall of 2020. The state was on fire. I mean, it was like you you we were inhaling all sorts of things. I mean, it was it was bad news. It's a weird time.
Bryan Cantrill:It was a weird time. It was a real weird time. So so, David, how did you so it feels like I mean, are we just arriving at the I mean, so how how do you first of all, to to Andrew's point, like, how do you square this this update problem? This is really, really thorny.
Dave Pacheco:I think what what we concluded was that these are intrinsically tightly coupled. That is So you I mean, you could imagine a world where like, let's say we'd gone and done the bucket of key value pairs. Right? You could imagine building that in such a way that the code can look at the serialized key value pairs of a previous version of itself and figure out, you know, what it should do. Like reconstruct itself in the new version.
Dave Pacheco:Like that's conceivable. And like in that sense, this choice made that problem harder, the choice that we made. But I think we just concluded like that's gonna be really hard to get right and to test it off quickly.
Bryan Cantrill:That's right. Oh my god. It's gonna be hard to get right. And it's also like it's the, in some ways, like the worst kind of code, because it's like, this is code that will only be executed under this, like, these limited conditions that will only exist for a finite period of time. You know, in 5 years, those conditions won't exist, but the code will still exist.
Dave Pacheco:Right. Yeah. And so that path seemed like a mess. And so we basically punted. I mean, we I mean, you know, you could say we punted, we we solved the problem at a different layer, but we basically said, we're not gonna do that.
Dave Pacheco:We're always going to say that a given version of the software is responsible for finishing or aborting all the sagas at that that were created at that version. So if we're gonna do a rolling upgrade of our fleet, we're gonna quiesce the old versions and let them finish their sagas and wait for them to finish and then remove them. And new sagas are gonna be created on the new version. And that means that, we can't have a situation in which the log that we have stored in the database is different than the one or corresponds to a different version than the one that is currently running. It does it does have one unfortunate downside though which is that you can't really fix a bug in a saga in a software update.
Bryan Cantrill:Interesting. And and
Dave Pacheco:that is it would require like, if the saga got stuck for example.
Bryan Cantrill:Right. If you have a if the if the the bug in the softwares that the saga does not terminate, that's gonna be a little bit ugly from Yeah. From an update perspective. But I mean, this is also there's a degree to which that problem. There there's a there is a certain minimum amount of correctness that you require in the previous version of the software, which is always unfortunate, and you wanna keep that surface area as tight as possible.
Bryan Cantrill:But in this case, the only way to win is not to play, to make a war games reference, and that we we you gotta say, like, sorry. Previous sagas have to have completed their execution, and we've gotta grasp the system with that record.
Andrew Stone:I think you have an even trickier problem if the bug is in unwinding the saga. It's like if you wanna unwind it to clean it up, like, you can't progress forward and you can't progress backwards, then what is that that lead you with? Like, you you have to go in and manually, like, clean things up? Or do you just write like, maybe you just write a new a new saga that goes in and cleans up all the old stuff. Like, we've done that for other things.
Andrew Stone:We've we've wrote written actions that run on startup with a slight agent to, like, migrate over old file formats and clean things up. So I guess that's one way one way forward.
Bryan Cantrill:But but I think this does illustrate, I think, one of some of the challenging constraints that we have that are a little bit unusual. One is that we are shipping this distributed system as a product. So when you have an oxide rack, this distributed system is running inside of that oxide rack. And we can't have, you know, Dave, Mark Havage's old term of meat in the loop. We can't have meat that which is to say a human being that is actually cleaning the stuff up.
Bryan Cantrill:We can't we we can't have an operational run book that cleans the stuff up. So we do need to be really carefully considering problems like update. That's not something that we can just kinda dismiss or dismiss to future selves. We wouldn't kinda considering that. And, I mean, it's hard.
Bryan Cantrill:It's hard to take it as as Andrew saying in the chat. It's it's hard to take a distributed system and actually ship that as a product. But that that's part of the constraint that we've got that I think is a little bit different, than you than one might have. If one were operating this as a service, there are would be corners we could cut here because we know that, like, well, if that happens, we're stuck in, and we clean it up. It's like, well, no.
Bryan Cantrill:We can't do that.
Adam Leventhal:Yeah. Worth noting that there are other piece of software not not here, in this domain, but that we evaluated. I remember one critique, Brian, you may recall this. Someone said that a piece of software was operated, not shipped. And that and that was like
Bryan Cantrill:Are we are we deliberately not naming that software? Why are we not naming that software?
Adam Leventhal:I don't know. I I feel like we didn't need to throw out I
Bryan Cantrill:feel that if we
Adam Leventhal:particular storage software.
Bryan Cantrill:If there was a particular storage software that it was operated and not shipped, what, yeah. That's Yeah.
Adam Leventhal:How many guesses would it take?
Bryan Cantrill:Yeah. Sorry. Sorry to sorry to keep Spoiler
Adam Leventhal:alert. Right.
Bryan Cantrill:Spoiler alert, Seth, is very hard to ship as a, I mean, there we've got other challenges with Seth to be clear, but, I think it it just and, also, in Seth's defense, it it highlights the challenge of this, of having a distributed system that not does not ship with a human being equipped with batch scripts. That's right. Which the reality is, like, most distributed systems have a human being with batch scripts somewhere. You dig deep enough, and you you you'll find it. So, so alright.
Bryan Cantrill:So, Dave, you've we've kind of made some of these simplifying assumptions. It does feel like that it was that a a bit of a breakthrough to terms of, like, okay, we're gonna have this kind of single value even if it's a complicated value, that we're gonna emit from every action, and we're go I mean, just, it feels like that's a you, you've simplified the abstraction, in a way that makes it easier to deliver a more robust system. Am I is that my comment too much?
Dave Pacheco:That felt pretty huge to me. Yeah. Yeah.
Bryan Cantrill:And I think it's like one of those things where, you know, it always feels to me like the big breakthroughs are when people realize like, hey, what if we actually don't solve a bunch of this problem? And we kind of take this, you know, Adam, I always think of the of what we call the simplifying assumption in which was, yeah, you're kinda when we were wrestling with a really nasty problem with respect to I think that was provider interface stability. Right? Probe
Adam Leventhal:interface stability. Right. Yeah. Yeah. How do we how do we articulate the stability of every probe?
Adam Leventhal:And we're like, well, like, there are millions of probes. How do we talk about it in millions of different ways? I think it was, like, basically my first DTrace meeting and said, well, what if we just do it per provider? And I didn't even understand the tangent that you and Mike then went on. But, apparently, I had said something smart, so I didn't say anything else the rest of the meeting.
Bryan Cantrill:They it was very, very helpful because it's it can be really helpful when someone it kind of offers up, like, what if we actually, like, simplify this system in a way that we've kind of implicitly thought is unacceptable, but what if that is actually acceptable? And then that is an acceptable simplification. And then it it is the the the gain that we get. And I think with that particular example is like, oh, god. Okay.
Bryan Cantrill:That was obviously a huge gain, and David feels like in this case as well. Like, okay. That's a that that's such a big win in terms of the way we think about this thing, that it's worth the some cost we may give up. And I guess the the update example is a more concrete example where, yes, there are gonna be some problems that we're gonna create by doing that. But, boy, the the simplification just feels like it's more than worth it.
Dave Pacheco:Yeah. That definitely felt that way for me.
Bryan Cantrill:Okay. So the so as we're kind of moving through, so what are some of the what are some of the kind of the other things that we're finding are a you know, where are we kinda finding the edges of this thing, and where are we finding it useful, and where are we finding it challenging?
Dave Pacheco:Yeah. So, the next so I I kind of did a lot of the early prototype work. And at this point, the control plan really was just a prototype. I don't know that we even had Sled Agent provisioning at that point. So the instance provision saga was kind of theoretical or simulated or something like that.
Dave Pacheco:And I think a lot of other people have done a lot of the work on those sagas and I'll be interested to hear their experiences so they could point out all the ways in which this actually turned out to be terrible. The the next thing that I got involved in was about a year and a half later in summer of 2022, it became important for us to make sagas more dynamic and to support what we call sub sagas, which was that like I said, at the when we first shipped this, the saga DAG was effectively fixed by virtue of the structure of the Rust program. Like, there was no way to change that. And at some point I mean, we have this problem where you when you go use the console, people have used anything, GCP, AWS, or whatever. You you know, when you go provision an instance, you can choose like, okay, I can have n disks.
Dave Pacheco:And for each of them, I could create a new one or I can attach to an existing disk. And if I create a new one, I can say how big it is and what image it is and all this stuff. Right? And that's a big complicated operation on the back end, right? To provision the instance, you may create, like, a whole bunch of disks.
Dave Pacheco:And we already have a disk create saga, we wanna use it in the instance create saga, we wanna have all the properties of, of sagas, namely if something fails in the middle of this thing, we undo the whole thing. But it's really not obvious how to do that, because creating a saga in the middle of a saga is not itself an idempotent operation. Right? Because if you if you run that action again, you'll create another saga, which you don't want. You want exactly one saga having completed the action.
Dave Pacheco:And so what we ended up doing, and this is a bunch of work that Andrew and I worked on, like I said, I guess it was like 2 years ago now, to say, to basically allow these DAGs to be dynamic, so we would store the DAG in the database as well and be able to embed subsagas in it, in an existing saga. So in the process of constructing the DAG for the instance create saga, you look at your arguments and you say, well, I'm supposed to create a disk so I just append the whole subsaga which is our disk create Subsaga. And that that was like a bunch of work and I think, you know, there were some interesting tricky bits but, you know, basically just worked and then we started using it. And it was great. That that part seemed fine, I think.
Dave Pacheco:I don't know, Andrew. What's your take? It
Bryan Cantrill:Well, you got an input Yeah. Sorry, Andrew. Go ahead.
Andrew Stone:Yeah. I was gonna say, like, I also think it worked great. Like, I I don't know how this like, the that problem kind of appeared, and I hadn't touched steno. And so I started I touched this problem so I could work on Steno, essentially. But, like, the the cool thing that I think you made apparent to me, Dave, was that we talked about passing down data through the nodes of the DAG.
Andrew Stone:And when you have this inner DAG, the outer nodes like, the data in the outer nodes is not visible to the inner nodes. And the inner nodes, like, once that inner DAG completes, like, its data is not available. Only, like, its last output node is visible to the outer saga or to the outer DAG. And so that, like you get this kind of privacy encapsulation aspect for free. You don't write code and, like, name, say, DAG 1, you know, dot key or whatever.
Andrew Stone:It just kinda works. And it's kinda this magical dynamic DAG building thing. But, yeah, I don't think we've had any real problems with it.
Bryan Cantrill:And and that Andrew, that feels like a huge feature to me that that that that you have that kind of that that you that sub saga can be, a a kind of a true abstraction that you can't have something that's kinda reaching into that sub saga and pulling out some intermediate data.
Dave Pacheco:Yeah. It's one of those things that seems like how how else could it possibly work? Like, it's it's a little bit like including, like, the local variables of a function should not shadow the local variables of the functions that called it. And you're like, well, I mean obviously, right? But like that was definitely something we had to realize about this and we were like halfway through the implementation and like, wait a minute.
Dave Pacheco:If you look up the output of a previous node, can you get a node from the saga that you're embedded inside of? That would be pretty weird. Right? Like you shouldn't really be able to do that because then you could compose sagas, then then the behavior of the saga could totally change because you composed it inside some other saga because it happens to pick up this other thing.
Andrew Stone:I think if you look at that PR, I had implemented it differently, which was like a much hackier way where essentially you do refer to that outer and inner nodes. Like, there's some namespace that's, like, explicitly declared. And then you, like, looked at me and you're like, yeah. This isn't good. And I was like,
Bryan Cantrill:well, how how should we be doing it? And then and then he fixed it up, and I
Andrew Stone:was like, oh, yeah. This this is much better. And so, yeah, it was a it was a good collaboration.
Dave Pacheco:I appreciate too that you haven't mentioned that that code, the Steno executor, especially, but all most of Steno is like some of the first Rust code I wrote. I mean, drop shot was before this. This was like a year into Rust, but, like, there's a lot of not great stuff in there that makes it kind of hard to work on. And Andrew has been kind not to mention it, but I will it's true. It's a problem.
Bryan Cantrill:Well, Andrew has been kind not to mention it to you. I mean, he just lights me up almost on a daily basis. True.
Andrew Stone:I sent Brian text messages. I'll have, like, 2 beers on a Saturday. Okay.
Bryan Cantrill:That's right.
Andrew Stone:Before I hit my 3rd beer, I just wanna mention that this code that Dave wrote, I'm really, really not enthused.
Bryan Cantrill:I've had to threaten to block him honestly just to just cool him off. But so so that is a a sub sagas are are kind of are an important breakthrough as we're making this thing really real. And Dave, there's also a really important point that you mentioned that I feel we've used in a couple of other places where you really wanna have, like, one body of code that does this thing, and you wanna then use that in lots of other ways. And, yes, this is like, you know, a a variant of not repeating yourself and a bunch of other patterns, but it's been really important for us. I feel that the that we don't have a bunch of different paths to to do these things and being able to kinda leverage that right next to it.
Bryan Cantrill:So as I was pointing out, it's like, tell me more about this library thing you seem to be describing. It's like, look, I know it's not a deep thought, but it's an important one. I think it's because it's, I feel it is easy to have accidental repetition of complicated things, not simple things, but things that were like that are it just it's easy for that to creep into a system. Certainly, we had that in previous control planes. So
Dave Pacheco:Yeah. And I think before we did this work, the instance create saga I mean, you're talking about, for example, not having the instance create saga just like create its own create all the nodes of a disc create saga, like in the middle of it or something like that.
Bryan Cantrill:Right.
Dave Pacheco:If you wanna actually run the disc create saga, you want that to be a composable thing with parameters and output and all that stuff. And yeah, we, I mean, we did think about it and just it just was like, yeah, that seems like it's gonna be way too hard to keep in sync and to to keep maintained. If you if you can If you're basically repeating these long complicated sequences of actions in a bunch of different places. But I mean, we did have some nasty workarounds for a while. The fact that it wasn't dynamic meant that if you had like 8 disks, we had like The the DAG for creating an instance would have 8 nodes that would go create a disk or whatever.
Dave Pacheco:And each of them would check like, am I I zero through 7 and how many disks were actually there. And if I'm I greater than whatever that, whatever the number actually is then I just do nothing. It's like very very hacky. Right? And the dynamics Right.
Dave Pacheco:And a lot of that stuff up.
Bryan Cantrill:Yeah. You gotta be thinking, like, alright. There's gotta be a better way. This is gonna be and it's always nice when you get when you kind of have alright. Having that better abstraction than being able to go clean up a bunch of that stuff.
Dave Pacheco:Yeah.
Andrew Stone:Yeah. Exactly. Like, that was I think that actually was the problem that got me involved was I was reviewing somebody's code, and I was really confused at what was going on there. And then kind of explained that this was our workaround. I was like, oh, well, I'm in in in between jobs.
Andrew Stone:Let's see what can happen here.
Bryan Cantrill:In between tasks inside of oxide, just to be clear.
Andrew Stone:I'd Yes. Yes.
Bryan Cantrill:You know, my my heart skipped a beat.
Andrew Stone:I think she's got a has a unique job. I know I divide the salary just to account for it.
Bryan Cantrill:There you go. Is this an is this a good segue to Eliza and the work that that that she was picking up in terms of like trying to make this stuff, trying to use this stuff for for new purposes?
Dave Pacheco:Yeah. I wanna hear Eliza's and James's and Greg's experiences here where where I know in many cases, the the abstraction needed some work. We'll say that.
Eliza Weisman:I actually think that the abstraction didn't need that much work. I think I just had to evolve my understanding of the abstraction. Actually, I think that the abstraction was great because it allowed us to do some, like, really ghastly things that I'm very proud to talk about.
Bryan Cantrill:Oh, god. That's just that that's like the tagline frog side friends right there.
Eliza Weisman:Oh, you're gonna ask me
Bryan Cantrill:things that were ghastly things that we're proud to talk about. Yeah. Go for it.
Eliza Weisman:So, honestly, this ended up being kind of one of the first really big projects that I've done in my career at Oxide. And it was really cool because our friend, Greg, who is also here on the podcast, at least I hope he still is. I'm still here. Essentially, left behind a very large document that was incredibly detailed and incredibly well written, telling me exactly what I was supposed to go do, and then he had, immediately disappeared for paternity leave, which was fun. And then, you know, he comes back and we we have to sort of, like, see how well I've interpreted this sort of ancient writings that he's left behind.
Eliza Weisman:The room. And, like, assess whether I actually understood the commandments that were laid down for me. But the sort of project that we were working on to give a little bit of background and, Greg, you know, feel free to cut me off at any point in this whole discussion.
Bryan Cantrill:But
Eliza Weisman:so we have, as has previously been mentioned, we have this component in the oxide control plane that's called Nexus, which is sort of the central component where, like, all of the actual control plane lives. And Nexus is the thing that's, you know, doing all of these run actually executing all of these sagas and also serving the, like, internal and external APIs, and interacting with CockroachDB and doing all of these all of these things. And right now, we are sort of dealing with the problem of how are you managing the state of the VM instances that are actually running. And where we're when Greg and I are sort of starting to talk about this stuff, that state is kind of smeared across a bunch of different components in what ends up being a pretty unfortunate way. In particular, there's another component of the oxide control plane, which has also been previously mentioned on this episode called the Sled Agent, and that's exactly what it sounds like.
Eliza Weisman:It's the Sled Agent. It's the control plane's agent that runs on each individual compute sled in the rack and, you know, does things like making sure that the VM instances that you said were supposed to be there are actually there, and so on. And, the the state of a running instance is at this time, in the talk by Katie. She references a really wonderful paper that uses a phrase that's going to remain stuck in my head for the entire rest of my life, which is feral concurrency. I do those 3rd party networks that way.
Eliza Weisman:Rewatch this.
Bryan Cantrill:Yeah. It's so good.
Eliza Weisman:Yeah. And so at the time, there was a great deal of feral concurrency control. In particular, we have, I I don't know how much detail it's really worth going into here, Greg. Do you I
Bryan Cantrill:mean, I I think that the How how would you say that the way that
Eliza Weisman:I problem?
Greg Colombo:So, yeah, I I think that the way that we we had set it up was we had tried to get into a world where, and by we, I mostly mean I. I'd been working on this problem for, for a number of months beforehand, but we had kinda gotten to a world where if you had a compute instance and it was running on a particular server, then the server on which it was running and the SLED agent for that server were sort of the masters of updating the state of that instance for the rest of the control plane. So you you would have your instance and it's starting or it's running or it's rebooting or it's stopping or it's migrating, and you're getting all this stuff from the virtual machine monitor that's telling you, hey. I'm doing this. Hey.
Greg Colombo:I'm doing that. And you need to, you know, propagate that up, from Slate Agent to Nexus, which can then, you know, get the appropriate things into the database so that these transitions are are visible, to to the users of these VMs. And that you know, having the clear, ownership is okay, but it starts getting complicated, I think, in a couple of ways. One that we we had a lot of trouble with from the concurrency control perspective was if you have one of these VMs and you're trying to live migrate it from Mhmm. One compute sled to another and you have this rule that says, well, sleds are the ones that are the masters of instant state, while you have a VM running.
Greg Colombo:Well, now you got 2 of them. You have 2 of these things running around and and and each of them like, is they can't see each other directly in the sense that, like, your your migration destination cannot see the updates that are being pushed by your migration source. And so, you know, properly concluding a migration and updating all of your states so that you were, like, pointing at the right sled where your VM is running now is doable but very, very, very finicky, and and and hard to reason about and takes a lot of, like, here are multiple TLA plus models showing that I think we're gonna do the right thing. Like, they upholds all the invariance I think I want it to hold. So that that was that's one challenge that we had there.
Greg Colombo:I think the other one that really motivated the work, Eliza, that you're gonna talk about here, was what do we do when an instance stops? And there's a lot of cleanup work that has to be done also when it finishes the migration, but when it stops is another big one. Where, you
Bryan Cantrill:know, we have to when
Eliza Weisman:it finishes the migration and stops.
Greg Colombo:And both of those things. But the, yeah. So there's there's a lot
Bryan Cantrill:of people
Greg Colombo:like saying there. Generally, like, I I think that it all kind of comes down. Yeah. That that that case is is one that came up like in the in in Eli's big PR working on some of this stuff. But, the the the the case that you generally have to worry about is one where you have your VM and it's running in one spot and then it's not running in that spot anymore, either because you moved it or because it just stopped.
Greg Colombo:And now you have a bunch of bookkeeping you have to do. You have to say, okay. I don't need to reserve resources on the slide that I came from anymore. I have to tell all the networking components, like, where this thing is at now so that the right traffic can be routed to the right physical thread or so we can just, like, remove the entry for the stopped instance or the routing tables and that kind of thing. And there's a lot of complexity in there because we had, prior to this, been doing all
Bryan Cantrill:of this work in the up call from SLED Agent to Nexus that says, Hey, here is my new state.
Greg Colombo:And so we've had bugs of the form, Oh, yeah. We started doing this and Slate Agent decided that it didn't think Nexus was going fast enough so it timed out, but it really needs to send this update because this is the only state that we've got. Know, I'm gonna send it again, and now I've got 2 Nexuses that are trying to do these things that they're competing with each other and just the whole it it was, like, not a completely we had bugs there. We patched over them, but it was a very, very difficult system to reason about. And so, like Eliza said, you know, back in, I think, March of this year, I kind of sat down and said, okay.
Greg Colombo:Here's what I think we should do instead. We should have more of a model where so let's just say, alright. Look. The VMM say that they're doing this, that, and the other. The migrations are in this particular state.
Greg Colombo:Here in Nexus, you can see all of these things because you can do, like, a big database query that will get you all of the information and and in an atomic read from the databases perspective. You go figure it out. You go reconcile all of these things. And then we had to figure out how to execute that reconciliation work safely. And so I I wrote I wrote all this stuff down.
Greg Colombo:And like Eliza said, I basically finished it and
Bryan Cantrill:then my daughter was born and I disappeared for a while.
Greg Colombo:And that's where, Eliza can go ahead and pick the story back up, I think.
Eliza Weisman:Yeah. So Greg had a very, very cool idea, which he had left for me to actually go do. And Greg's very, very cool idea was, you know so I think it's also worth just noting that right now, at the time that we're, like, embarking on this project, most of the sagas in the oxide control plane are initiated as a response to some kind of, like, user initiated action. Like, you say, I want to have a VM. I want my VM to be started.
Eliza Weisman:I want to delete my VM that I used to want to have, and now I don't want to have anymore. Right? And so those are all triggered by, like, an API endpoint that the user is hitting either through the Oxide CLI or through the, the web UI or through some code that they've written against their public API. But this is a little bit different because this is something that we want to do sort of mechanically in response to something has actually happened in the real world, and we've received some information saying that it's happened. And in particular, what Greg has kind of determined is that sagas let us do something really cool that I think is going to, like, tie into a theme that I had spent this entire project thinking about, which is that, you know, there's a guy on this call, who really hates distributed locking.
Eliza Weisman:Andrew, do you do you wanna say that you are the guy who really hates distributed locking?
Bryan Cantrill:Andrew, you have something you'd like to say?
Andrew Stone:Yes. I, I I yes. I do hate distributed locking.
Dave Pacheco:Is this some kind of a trick? Because I feel like I'm always the person that's like, no distributed locking. No. No.
Eliza Weisman:No. So there's there's actually 2 guys on this call who really hate distributed locking.
Bryan Cantrill:They've not been
Andrew Stone:talking to me
Eliza Weisman:and there's Greg. But we decided that we should have distributed locking. And in particular, Greg has recognized something that I think is really cool that I don't know if anyone else has, like, really meaningfully, like, realized you can do thanks to sagas, but maybe they have. But I thought that it was quite neat, which is that, you know, because you have this idea that a failed saga unwinds, you have essentially something that feels a lot like RAII in Rust. Right?
Eliza Weisman:Where when you acquire a a Rust mutex, you get back this guard, this, like, in some sense, real thing that you're holding on your stack. Right? And then if you panic and your stack unwinds, you drop that lock. Or if the function call that you're in returns normally, you drop the lock. So there's this kind of, like, guarantee that by leaving the context in which you have a lock, the lock is released.
Eliza Weisman:And with distributed locking, there are a lot of the same fundamentally, the same concerns that you get with not distributed locking, which is the like, you know, what happens if somebody gets the lock and then they just, like, die? Or, you know, what if they just, like, wait a really long time to release the lock and now the lock is just held and I can't do anything that I wanna do on the computer. And that's kinda sad.
Bryan Cantrill:But
Andrew Stone:Even worse than that. Right, Eliza? Like, you you have you have the situations specifically with distributed locking where you don't have coordinated clocks. And so, like, let's say your lock is based on a Lisa time out so somebody can determine that it dies. Well, did it actually die?
Andrew Stone:Like, maybe
Eliza Weisman:Right.
Andrew Stone:The time is different on the remote node that thinks it can claim the lock now, and the other node, is still running. Or, like, it just goes to sleep for a little bit and comes back, and its clock didn't change at all. And so now you now you have no lock at all. You just have 2 nodes, like, acting on the same resource.
Eliza Weisman:You're actually the one who's been dead. Right? Because you might be the one who's on the side of network partition that's cut off from all of your friends.
Andrew Stone:Yeah. It's a nightmare.
Dave Pacheco:Right. And part of the challenge is even even if you're checking for this, you can always, like, check the time and then, like, be off CPU for 10 minutes and then do the thing that you shouldn't do because you just you can't do it atomically. Right?
Eliza Weisman:But so the kind of neat thing about this unwinding of sagas is that you can do something kind of like RAII with your distributed lock. Releases the lock. But, if the saga fails, it unwinds and the compensating action Greg has told me now we said the word undo a lot earlier on the show, but, like, Greg has told me that he's decided that undo is actually problematic and we shouldn't call
Greg Colombo:it that anymore. My I think this is just my personal hobby horse. Steno documentation refers to this, to the reverse actions as undo actions. I have taken to calling them compensating actions just as a way of noting that in some cases when you unwind a saga, the the post state from your unwind is not the same necessarily as the pre state, before you started running the saga at all. And the the sort of trivial example of that is something like a generation number that you're bumping for optimistic concurrency control.
Greg Colombo:But we have other cases where that comes up where you have, you could imagine, for example, an instance provisioning saga that, you know, moves you from your creating state. And then if you've set the option, like, it tries to start your your instance right away. If that fit if if, like, the the starting bit fails, then, like, after that I mean, it's kind of a poor example because do you unwind the entire create? Depending on how you structured it, you may. But you may end up with something where, like, you know, my post date has actually stopped or something like that.
Greg Colombo:I haven't, like, restored the world to the pristine state that it had before. What I've done is I've restored it to some state that is consistent with the action that the user asked for it not having been taken. So but I think this is just like a you know, I run off and invent different terminology for things, like, on the regular, and sometimes it gets adopted, sometimes it doesn't. But I I think that's that's that's all that's meant there. So
Bryan Cantrill:The Greg, you're just putting it because of because of of time's arrow. Nothing can actually be undone. This is actually a undo should be just purged from our lexicon. It's not possible. Well, I mean, I I I think that I think the term makes sense.
Bryan Cantrill:Sorry, Dave. Yeah. Dave is the one,
Greg Colombo:I think, who added the term, and I think it makes sense, in in the in the context that, you know, it is much more suggestive of what it is that is actually, like, going on here. Like, we are trying to get back to something approaching the prior world state. Just that that may look a little bit different depending on the specific context of what it was that you were trying to do. I I'm sorry, Dave. Did you have more you wanted to to add on top of that?
Greg Colombo:I feel like that was your term. The The undo Oh, yeah.
Dave Pacheco:Total totally yeah. So I just I meant to mention this earlier. I kind of alluded to it, but I didn't actually come back to it that we chose pretty different terminology for a bunch of the distributed sagas things. So if you were familiar with sagas before, you may be very confused by a bunch of the terminology, like sagas called what we call actions, sagas calls requests. And I just basically changed a bunch of these because I thought they were confusing in our context.
Dave Pacheco:In particular, like these requests were not actually necessarily HTTP requests or RPCs. In some cases for us, they were like generate a UUID and that is like the whole action. So I found that confusing. Compensating Undo just felt more evocative, but I cannot obviously The argument that you're saying is not it's more compensating than it is undoing is totally fair.
Eliza Weisman:I certainly think that, like, far more people are going to immediately understand what is meant by undo than compensating. I think that it it Greg's observation specifically is that, you know, in a world with generation numbers, times arrow is very real. And, like, even undoing the state changes that you did might mean changing a generation number to a a number that it's actually never been before. So it's not really undoing, but maybe but the the point is this is, like, actually, like, a a argument over semantics. But this sort of neat thing is that, you know, you can actually have some guarantee that locks are always released reliably in a world where everything is a saga that can unwind and run these compensating actions.
Eliza Weisman:And that's actually that allows you to have, I hesitate to use the word fearless, but it allows you to have marginally less fearful distributed locking if you are willing to operate under the constraint that a distributed lock that you have created is, like, never acquired in code that is not part of a saga action. And so we have gone and done that for our instance update saga, which is the thing that is now rather than having the, like, as Greg was describing previously, rather than having this lead agent sort of go, hey, Here's some state that I've decided that this instance is in and turn it into JSON and send it to Nexus, which then just sort of blindly goes and puts in the database, and then hopes that the SLED agent has, like, incremented the generation number correctly. Instead, we have said, okay, the SLED agents are no longer responsible for owning the instance in its state because its its ownership might be kind of smeared across multiple SLED agents. And instead of doing that, we're going to have all of the changes to the, like, way that we represent the instance and also the resources that it sort of logically owns, such as, for example, we have provision encounters of, like, how many virtual CPUs have you allocated like, actually like, actually have that correspond to the number of, like, actually properly running VM instances that are are physically incarnated rather than just sort of, like, going up and never going back down because then you can't spawn any more instances ever again.
Eliza Weisman:And we like, it's very important that those resources be cleaned up. And then separately, we're also tracking the resources on the individual compute SLEDs, the sort of, like, physical rather than virtual instances, and it's very important to clean those resources up too. And, you know, you don't want to be, like, adjusting one of those sets of provision encounters without adjusting the other one. And all of this was actually not happening in a saga. It was just happening in, like, an API endpoint handler.
Eliza Weisman:They would just go and, like, try and do all of those things. But, like
Bryan Cantrill:It was it was it was feral, if you will.
Eliza Weisman:Yes. It was it was very feral except with this, notion of, well, the SLED agent sometimes owns the instance and sometimes it doesn't. And sometimes 2 SLED agents both kind of own the instance, but not really. And sometimes 3 SLED agents own the instance. And sometimes nobody owns the instance, and it's, like, kind of loose in the cat room.
Eliza Weisman:And you can't really go do anything about it because nobody knows who it belongs to, so it can't be stopped. And this would manifest in a bunch of bugs that would sort of be like, well, you know, I have an instance that, like, the u but it's not running, and it's also not stopped, and nothing is changing its state, and I can't delete it. So that would, like, often be fixed by, like, having our colleague, Alan, go in manually out of the database records, and that's not very good. So we came up with this scheme where all of these instant state updates would be published, and we sort of removed the notion of having the SLED agent own the instance, and instead, we only allow it to own the VMM. And we talk a lot in the oxide control plane about separating the concepts of a VMM and an instance, where an instance is sort of, like, the thing that the user is actually, like, thinking about, which is sort of, like, I would like to have my virtual machine and I would like it to be somewhere and I want it to exist.
Eliza Weisman:And a VMM is sort of, like, well, we have this actual, like, Unix process that represents the virtual machine manager that's actually currently incarnating a particular instance. And we have, like, separate database tables, separate API objects for these. And in particular, we determined that, okay, the flood agent can just publish the state of the VMM. And when a nexus sees that the state of a VMM has transitioned in some way that requires the instance, which is, like, these are 2 database tables, would it requires the state of the instance to also be transitioned. Right?
Eliza Weisman:Like, perhaps this VMM has shut down, and, oh, now the instance actually is no longer running, so we have to go back and change that. But before we can go back and change that, we have to release all of these virtual provision encounters, delete some NAT table entries because that instance is no longer actually occupying an IP and so on. And all of these things have to happen. So this we we've written a saga that does all of that, and we have introduced locking for each individual instance because we don't want to have like, of course, the saga itself is a dim potent, and we can run the same saga multiple times, and that's totally fine. But what we can't do is run 2 instance update sagas that are trying to handle different updates, you know, for instance, perhaps you you oh, god.
Eliza Weisman:I can't say for instance when I'm talking about this, can I? That's really bad. Perhaps you have migrated your instant from one sled to another, and you've been, like, doing the work necessary to update your understanding of the world to reflect the fact that that instant has moved. And then the VMM that it was running on has died.
Greg Colombo:Yeah.
Eliza Weisman:Right? Like, while you were still processing, the migration has finished. Well, there's been another state change, and you can't handle that state change until the previous one is completed because you're gonna mess up the previous state change. And so because of that, we have we've introduced this lock. And because we've now added this lock, we've created a way that, you know, the the sagas sometimes can't run, which required us to then go and do a bunch of work to make sure that they would always eventually run.
Eliza Weisman:And we did that primarily using Nexus' basically composed of 2 things, which are sagas and background tasks, and the background tasks are just the other sort of part of somebody mentioned the Kubernetes controllers, and they're kinda like that. They're what you might call in that context reconcilers or reconcilers. Somebody's gonna get mad at me. So, Andrew, we actually sort of do that now, in the branch that I merged, not to get too inside baseball.
Bryan Cantrill:We should be Andrew's question for the for the the listeners.
Eliza Weisman:Yeah. So Andrew asked in chat, I would like a way to make sure the sagas don't even have to run rather than the 1st node that does the log checks. And we actually did a bunch of work on that, and in particular, we sort of moved before as as part of adding the lock, we did the other sort of thing of moving as much state as possible out of the lock. For instance, we used to god, I can't keep saying that. I really have to stop saying for instance.
Eliza Weisman:We used to have a notion of an instance state, and we would just always give API clients the instance's state. And then we have separate notions of the state of the VMM that incarnates that instance. And we kind of realized that there are some state transitions, like if you are going from, stopped to starting to running. When you go from starting to running, the instance record that we have in the database doesn't actually care whether you're starting or running. It just cares, Is there a VMM incarnate in this instance, or is it nonexistent?
Eliza Weisman:Similarly, if the VMM reboots, you know, and then it's, like, now it's rebooting and then it's running again, the instance record doesn't care about that. It just cares about, do I have a VMM and do I have a foreign key that identifies that VM? And so we just changed the code that, like, synthesizes these externally visible API and, instant states to make that decision based on, well, if the instance has a VMM, we just look at the VMM state, which means that all of those state transitions now mean that you don't have to mess with the thing that exists within the lock, so you do less locking. We also made the compensating action that releases the lock, trigger one of these background tasks, which are sort of reconcilers, that these it it's the NexSys background tasks are, like, also completely stateless, and what they do is they read all of the state that they're going to operate on from the database every time they activate. And they're just, like, sort of periodically activated to do some work.
Eliza Weisman:And they do this by, you know they store no state across activations except in the database. And so we have a background task that looks for instances that, you know, may be in a state where we have to go and do something to update them. And then it just does that. And so we we can explicitly trigger that whenever we try to get a lock because we know we need to do some work and we can't get the lock, because it's already held. We can trigger that background task and say, okay.
Eliza Weisman:Well, whenever whomever is done using the lock finishes the update they're doing, there's another one that also needs to be run. So make sure that gets started. And in the event that that trigger also cannot acquire the lock, we also just run this periodically because so at the end of the day, we have, like, kind of guaranteed that eventually the state update will always be processed. We have a bunch of mechanisms to try and make it happen as quickly as possible because you don't wanna say, hey. I wanna delete this instance, and you click delete, and you're sitting there looking at the UI, and it says it hasn't been leaked or it hasn't been deleted, but it actually has been deleted.
Eliza Weisman:We want it to be, like, as responsive as possible to the actual state change. But failing that, we want it to have happened. And so we have sort of multiple layers of defense of making sure that, eventually, it happens, and the first one is actually moving as much work outside of the lock as possible.
Bryan Cantrill:But still within the saga?
Eliza Weisman:Well, no. Moving moving as much work the saga is doing all of the actual mutations about of the state of the instance as stored in the database. And it's doing that. It's contending with the other sagas that might be user initiated, like you want to start a stopped instance, you want to migrate an instance, and so on. Those are also sagas.
Eliza Weisman:And the actual sort of state machine that we have represented the various, like, transitions that the instance can go through, is what is sort of providing concurrency control there. And this is very much what, the feral concurrency control paper would call feral concurrency control. But I think this is sort of, like, the reason that you might want to have feral control is that sometimes the application has semantics that sort of inherently lock out certain operations. Right? Like, you can't migrate an instance that hasn't been started.
Bryan Cantrill:Right?
Eliza Weisman:And so that provides concurrency control between the migrate and start sockets. Right? If it's already started, the start saga won't run. And if it is not started, you can't do anything like migrate it. So that that state machine actually provides most of the concurrency control.
Eliza Weisman:The only place that we need to introduce locking is between multiple state updates that were generated, like, by something actually happening in the rack. I feel like I'm not actually answering the question you asked, oh, which was, about moving work out of the lock.
Bryan Cantrill:Moving yeah. Moving out of the lock.
Eliza Weisman:So no. It's actually that we have just sort of allowed, we have allowed more things to happen without a ban or without contending the law. In particular, and this is something that credit for this belongs solely to Greg, We have made it so that start and migrate sagas that fail previously would unwind by, like, messing with the instant state. And so, like, when you want to migrate an instance, you find a new sled for it to go to, you find a VMM process on that sled for it to go to, and then you set a foreign key on the instance record that says the target of the target VMM of, like, where that instant migrating to. And you do, like, all of the actual work.
Eliza Weisman:You draw the rest of the l o. You actually say, hey, Move some of this. But, like, in the database, you were creating this pointer. It's a I am not really a by life experience, not really a distributed systems person or really actually ever a database person until now. So I, like, think about these things always in the context of, like, I'm writing, like, code in one address space.
Eliza Weisman:So I'm gonna keep calling the foreign keys pointers, and nobody can stop me.
Bryan Cantrill:Adam will only try to edit you out. That's the only thing he tried to do.
Eliza Weisman:So you have this, you know, target VMM pointer. And prior to us making this change, what happens when, a failed migration attempt unwinds is the saga that's unwinding then goes and, like, unsteps the foreign key that it had previously set. Right? And this is something that can race with the instance update saga because it might be handling a succeeded migration. Or the, the migrate saga might have unwound because it actually found that the instance was already migrating and it couldn't start a new one.
Eliza Weisman:And so we would have that then go and mess with the instance record again. And Greg sort of and that would increment the instance of state generation invalidating all of the updates. The update saga has done, and then it would have to unwind, and would have to schedule a new version of itself, which would then take the lock again, and you would be having, like, all
Bryan Cantrill:of this sort of unnecessary lock contention, that would make it take
Eliza Weisman:a lot longer for Andrew called it in chat, is that he had realized, well, Andrew called it in chat, is that he had realized, well so and and this is where I think why we were having the this sort of, like, bike shed argument about whether they're undue actions or not, is that you can actually have the start saga or the migrate saga unwind by changing the VMM records that they've created to represent the, like, actual Unix processes of the virtual machine managers that the instance is migrating to or starting on. If you unwind a failed migration saga, you can go find the VMM record that you created and put it in a state that means that the saga that created it has unwound and then a subsequent migration, which previously would only be allowed to start if the foreign key was null, can instead look at the foreign key say, oh, there is a target VMM there. I'm gonna go check and make sure that target VMM is not actually like, just sort of a zombie that was left behind by a previous unwinding migration. And it goes and looks at it and sees, oh, it's actually just left behind by a previous migration unwinding, so I can just clobber it.
Eliza Weisman:And so the the sort of downside of this is that I had to write, like, a really gruesome SQL query to handle the fact that it might be null or it might also be in a state where it isn't just, like, okay to clobber, and it was far more gruesome to do that using Diesel.
Dave Pacheco:But
Eliza Weisman:that works, and that allows you to not do a change that, like, invalidates the optimistic concurrency control of the, of the generation number on the instance record, but it also allows you to not have to, you know, contend the lock. And it allows the instance update that's running to sort of keep running. And this is where all of this started to feel very familiar to me because I do have I have no background in distributed systems, really, but I do have a background in, concurrent data structures and algorithms. I've done some, like, log free data structures, and this feels a lot like that. And so I thought that was, like, really pleasing.
Eliza Weisman:And this is where I went, oh, you know, obviously, what we also have to go do is we have to add some kind of GC, right, because we are now abandoning these instances, instance or these VMMs in a state where, you know, they're not actually being used for anything, but they still have these database records and they might still be occupying some resources. So you need something that goes through and finds everything that's now unlinked and it just, like, gets rid of them. And so and that, of course, naturally made me think of QSBR even though it isn't really, which is, quiescent state based reclamation, which is where you sort of, like, explicitly annotate, like, the times in your program where you're not doing anything, and that's when you can just sort of delete resources that were abandoned previously. And it's not really like that because we don't actually do it in quiescent states. We just do it occasionally.
Eliza Weisman:So that diatribe is not really relevant. I don't know why I'm saying this. But yeah. And
Bryan Cantrill:then it wasn't how did you so given that and because this is always a challenge, certainly, when you're, you know, having a a highly concurrent data structure. So then how do you test all that to make sure that you've got because I mean, the the you're really you're testing for some pretty gnarly conditions there.
Eliza Weisman:So this is where, honestly, like, where I have kept all of my shame. I think Dave discussed previously that we have written these, test helpers for sagas that do things like, you try and run a saga through to completion, and then at every node in that saga, you run that node twice or that action twice. So that, like, sort of test the, like, a dim potency property. Yeah. I would very much like to get James talking about his locking scheme shortly because his locking scheme is actually kind of different from mine, and I think that's really neat.
Eliza Weisman:But so we have we have these test helpers, and those are quite nice. We also have a similar one for unwinding. But they actually don't exercise the potential race conditions that you might have between, like, multiple events happening that trigger different sagas at the same time. And that's actually not something that we've tested a whole lot because the complexity of the saga interactions was, like, not that much previously, and now it's, like, suddenly ballooned into kind of a whole lot of complexity and, like, a lot of rules that sort of govern these sort of complex, like, dance of interactions. And, honestly, we did a lot of manual testing, and that's something I don't feel good about.
Bryan Cantrill:But You know, It beats the alternative. I mean, in terms
Eliza Weisman:of pretty much does. And what we did have on our side is that Greg has, like, Greg likes TLA plus so much that he did an entire half of the previous podcast episode about how much he loves TLA plus And so Greg had written a lot of t l a plus model that sort of verified that all of these things, all of these rules that we've written out conceptually should work correctly.
Bryan Cantrill:And
Eliza Weisman:most of the things that had so he was even able to come up with sort of, like, I think these are where the races would be, and you could write a, like, non concurrent normal Rust unit test that just sort of simulates the conditions where I think there might be a race, and then we can just sort of go find that, oh, Eliza's code is, like, actually bad and there's a bug in it and, like, it would fail in the race. But the reason that it fails is because, like, it's not because we didn't imagine that there would be a race there, but that instead because, like, the TLA plus model had, like, predicted that there would be a race there, and I just, like, type of something. So that kind of answers the question, but we ended up just doing a lot of manual testing of, using our our stress testing framework and kind of hammering some of these things.
Bryan Cantrill:Well and and the the question being asked to chat, like, how do AWS and Azure and so on do it? And, like, this is just a challenge when you're implementing a control plane, especially for a public cloud. Yeah. And I can tell you that and anyone they'll tell you that was at a as AWS for a long time is that there was a lot of, like, gnarliness that had to be cleaned up manually. It's really speaking for Troyant.
Bryan Cantrill:We had a lot of well, you know, when these things would we would have a lot of these kinds of pathologies, and they would be get cleaned up manually. And it was really not it it was stuff that was not you couldn't ship it as a product. And the the, like, the challenge that we've gotten in, you know, I thought it was great that you mentioned just like the accounting issues, which are really important. You know? It's like but you kinda wanna dismiss the the the need to, you know, account for the the the resources that that a VM is using or is is using.
Bryan Cantrill:It's actually extremely important. And if you, you know, you have a migration that kinda fails halfway halfway in, and then it's, like, stopped, or and then that that VM is that instance deleted. It's really important that the accounting ultimately catch up to all that. Because it is really awful when the accounting is off, and you now have and I think that, you know, there are all the, you know, the answer, like, public clouds have all sorts of problems on this, where they are relying on some MapReduce job to ultimately run and hopefully figure out that the state of the world does not match the state that they think they've got. So these are really, really thorny problems.
Eliza Weisman:Yeah. And there are also problems that, like, the the SLED resource leak or the virtual resource leak is something that just gets worse every time it happens. Right? And then eventually, you just can't start VMs ever again because your your control plane believes that all of your CPUs are fully occupied, but they're and that's something that, like, if it happens once, it's easy to not even notice that it happened. But when a system runs unattended by its implementers for a long time, it can end up with the, you know, the customer's system is now in a state where they just can't start VMs anymore.
Eliza Weisman:And, you know, maybe when AWS has that problem, they have like a nice little dashboard and they're paying a bunch of people to look at that dashboard and go, oh, we're, like, actually leaking virtual resource and just sort of go in and clean it up manually. But you can't really do that when you sold the product to someone and know it's theirs, and they maybe, like, don't want your support people looking at their dashboard 247.
Bryan Cantrill:Well and very importantly, like, the oxide rack is designed to be air gapped. It's designed to, like we don't have we we we do not operate them remotely. So it's very important that these things need to be automatic with regard to this stuff. So it's which on the one hand is hard. Yes.
Bryan Cantrill:It's Andrew's kind of mourning the fact that we can offer them remotely, but it it makes for a much thornier problem, but it also makes for a much more robust system. Because, we had all of the problems that you're talking about and more, Eliza, in in in previous incarnations, certainly. And then even in this, you know, the the the kind of the the control plan prior to your work, like, we we had a bunch of feral concurrency that we needed to I and, James got a great James, do you wanna talk at all about the concrete example that you dropped in? Because that's, very vivid.
James MacMahon:I don't have a lot of experience with the particular so I I dropped Omicron issue 5042 in the chat where there's 20 steps between these sagas, interlacing and and running concurrently that lead to this accounting bug. And I just as an illustration of how hairy these bugs can be. And and it's it's very difficult to say, like, make a unit test for that or, like because because the permutations here are just insane. You you really can't account for them. It's what makes something like the stress tool, pretty interesting, that was referenced a little earlier.
Bryan Cantrill:Yeah. And that's Omicron stress. Right, James? Right. Yeah.
Bryan Cantrill:So the yep.
James MacMahon:I I was sorry. I was just gonna say I said earlier in the chat, you know, we had decided to launch sagas based on, API endpoints and and user actions, which perhaps was a mistake. But here we are. The Omicron stress tool will take, for example, like a disc create or a disc delete operation, spawn a arbitrary number of these actors, which they call antagonists, and just hammer our endpoints. So you end up getting a case where, you know, actor 1 called the disc create, actor or 2 called a snapshot create, somebody tried to migrate an instance, and then there was a disc delete operation, you end up sort of stressing out these these different permutations and, interleaving sagas, in a way that really it's kind of frustrating to deal with but also it's really interesting because you'll you really feel like you'd you have to go in and do this this giant post crash investigation and, you know, call on the National Saga Safety Board and and get everybody together and really figure it out.
James MacMahon:It it's it's quite subtle. Back to I think what, what Dave was asking. You know, Steno's great. The framework's great. But this you can really subtly get, saga code wrong.
James MacMahon:There's no, like, trait for something being a depotent or, you know, the compensating action can be rerun. Like, it's it's a lot of it is up to us and and sort of, getting this stuff right.
Bryan Cantrill:When instead of gives us a great structure for it, or Sagas gives a great structure for it, but ultimately, like, these problems are really thorny. And it's, you know, we we we've tried to maximize our opportunity for success here. The and and, Oiza, did you need to extend Stenr at all for the stuff that you were doing? Or was that
Eliza Weisman:no, actually.
Bryan Cantrill:Able to merely abuse it?
Eliza Weisman:I was a so we didn't actually talk about the, like, most ghastly thing that I, like, ended up having to do, but I I think it's actually, like, in some ways not that in. But Andrew and I talked about potentially making a change to Steno, which is that we had a situation where because this instance update saga that we've written might handle a variety of of fundamentally different state changes. Right? It might be doing a state change wherein migrated. It might be doing a state change wherein instance has stopped and been destroyed and is no longer incarnate.
Eliza Weisman:And the work you have to do to handle those state changes is really different. Like, in one of them, you actually don't have to delete those virtual provision encounters, and the other one you do because there. Similarly, the instance might have migrated out of a already shut down and so you also go and physical resources that that VMM embodied, or it might not have finished. And so you, like, don't wanna do that yet, but you do want to update the Instance so that it's in place, and so on. And so the the saga DAG that you're building depends on the update you're processing.
Eliza Weisman:But, and this is where it gets kind of catch-twenty 2, because we have this distributed locking scheme, the saga needs to be operating on an understanding of the instance that was read within the lock. Right? This is sort of a classic problem in concurrent data structures that's also a problem in concurrent multi process data structures. So, that's actually a little bit hairy because, in order to ensure that the lock is always released when a saga fails, you also can only acquire the lock when you're in a saga. But you can't build the DAG of the saga without having the lock.
Bryan Cantrill:The lock. Yeah.
Eliza Weisman:And this is actually I think this was Andrew's idea, where, like, I had come to him and I'd been, like, wait. What the hell am I supposed to do about this? Greg didn't actually tell me what I was supposed to do with this, and I don't really like, this is my first time writing a saga. I'm 12, and what is this? And so I'd come to Andrew, and he has said, well, we could maybe go, like, mess with Steno pretty drastically to change it to allow you to build a DAG within the saga, but that seems, like, quite complex and I don't really know how.
Bryan Cantrill:But instead he had the idea, you know, what if you just had
Eliza Weisman:a saga that starts another saga? Which is what we did, and we added sort of an inherit lock operation where you just sort of do a compare and swap, that, like, if I'm holding the lock now, I give my child the ID that I put in the lock, and then it tries to do a a database compare and swap for, like well, if it's still the parent ID, then I wanna put my ID in. Otherwise, they just wanna go die.
Bryan Cantrill:So this is a sub saga that you are basically willing to talk
Eliza Weisman:to? It's
Bryan Cantrill:actually not
Eliza Weisman:a sub saga. It's a second saga. The difference is that a again, so, like, Ixie has commented the database compare and swap. Is it? That's how I think about it.
Eliza Weisman:My my background is on is is in working on things where you talk a lot about compare and swap and not as much about, like, database transactions.
Bryan Cantrill:So that's I wasn't that that that's how I think about it too, actually. I gotta tell you. I I I love the translation layer, so thank you.
Eliza Weisman:Yeah. So what you're actually doing like, this is not a sub saga because the sub saga is something that it or in the to use the like, sub saga is sort of the term of our in steno specific. Like, other a lot of this is kind of Steno specific and not distributed saga specific. And I just sort of wanna make sure that that is, like, very Yeah.
Bryan Cantrill:Right. Yeah. Yeah. That's important. Yeah.
Eliza Weisman:But a sub saga in Steno, as as Dave described, is specifically sort of like calling a tree of functions where they can't see your state, so you can call them any number of times. But those are also built in the saga DAG when it's being constructed before it begins executing. And we use a sub saga actually well, because we have one for handling a VMM being destroyed, shutting down because you might want to handle both a migration target and a migration source both shutting down at the same time. So you might need to do it once or twice. Right.
Eliza Weisman:And so you might put 1 or 2 of those in your deck. But instead, what we do is we just have a saga, and it has 2 actions in it. And it's like, it's kind of a trampoline is how you might think of it. And what it does is it locks the instance record, it reads the database state, then it built and that's, like, one action. And then, the second action is to take the database state and construct the DAG of the actual saga that will do the real work and then start that saga.
Eliza Weisman:And then this is actually where I had to write some of the harriest code in the in here because there's you have to make a decision of whether or not the parent's saga needs to unwind and drop the lock depending on whether the child's saga failed before it had inherited the lock or not. Right. Which is kind of a complex interaction where we had some bugs that were actually nicely caught by the idempotency tests once I had fixed
Bryan Cantrill:the fact
Andrew Stone:that this is where, like Running. I think the Trent wins logo was your idea, actually, Eliza. But, like, we discussed changing steno to, like, how we record the state. So, like, right now, there's a trait in steno, the sec SEC store, and that writes to the database. That, like, writes to database logs.
Andrew Stone:So when an action is completed, the log gets updated. And it's kinda done automatically, but because it's a tree, it's built into the the sagas themselves. And so there's no way to, like, take the output of 1 saga or, like, even look at, like, start to run a saga or, like, proceed with that first state where you look at the lock and then decide not to log anything. And so, like, part of the problem was you could you could have a bunch of like, you can envision 3 different saga coordinators running, And they're all trying to, like, run the same saga, and they're all trying to take the lock. But if you have these trampoline sagas, they're recording that state in the database.
Andrew Stone:So now you do have this leftover state. And I don't think we we still haven't actually done any work to, like we've mitigated I think we've mitigated some of the worst problems with this. But, like, we still are gonna have a bunch of sagas that unwound solely because they couldn't get a lock, right, and record state in the database.
Eliza Weisman:Yeah. We try not we try really hard to not start sagas that are ultimately fated to die, but we are going to still start them sometimes because sometimes multiple state changes happen.
Bryan Cantrill:Well and and Dave is mentioning in the chat that the and, David, do you wanna elaborate on on what you just said in terms of, like, the things that are not a fit for sagas?
Dave Pacheco:Yeah. So I I I think I ran into this with some of the DNS stuff, like a year and change ago, Where, sort sort of the simple example is that we have this thing in the oxide system called the silo, which doesn't really matter what it is. But each one gets its own DNS name. And so users create a silo. We have to add a new DNS name to our external DNS servers.
Dave Pacheco:People remove a silo. We have to remove the DNS name for it. And it seems on the surface, like a great fit for a saga. Right? Because you have a bunch of steps that you're doing and, you need to potentially unwind them.
Dave Pacheco:But then there's all these weird cases that make it kind of nasty. Like, what if you've got, like, 5 DNS servers and one of them is down, like, maybe for a long time. Does that saga just never complete? And what happens if someone else goes and creates a silo in the meantime that adds a new DNS name and the first saga is trying to propagate an old version of the DNS to the DNS servers or something like that. And and what happened What about a DNS server that gets added?
Dave Pacheco:It needs to get all of the state from all of the silos that have been created. So it's it fit for folks familiar with what Kubernetes calls the reconciler pattern. It's a much better fit for this. But I think that was at least not obvious to me at first. And it took it took some waiting into it to to be like, this is actually like a different kind of distributed problem, even though it felt on the surface very similar.
Dave Pacheco:And so I've been I think we've been running into more complicated stuff in the last year or 2 that has been where that's been the better fit. Which is not to say sagas aren't still useful. I think they are still useful for the stuff we've been talking about. But although, you know, we've also mentioned how we've been using RPWs. We call them RPWs for reliable persistent workflows, but it's basically the reconciler pattern.
Bryan Cantrill:Yeah. And Eliza was I mean, we're using RBWs to implement. I mean, so the I mean, we're we're kind of as part of the of implementing this entire system, we're finding I and just answers you're saying in the chat that that there are there are things with that our sagas are gonna be a a a the right fit, and where it makes sense to then to add to add some of this complexity that allows it to kind of this knife edge that you're walking of, like, okay, how do I make a much more complicated thing, but without actually necessarily, like, forcing that complexity upon all users of sagas versus having RPW as well as workflows and where do we need those?
Eliza Weisman:Yeah. For me, as somebody coming into Nexus in the last 8 months, like, and not really seeing this history that Dave is getting out of, like, oh, eventually realize that we also need to have the concept of RPWs. It kinda seems to me that Nexus is basically something that exists in the interaction of sagas and RPW.
Bryan Cantrill:Right? It's just like those
Eliza Weisman:2 parts. And then there's, like, also an HTTP API on it, but that's kind of the easy part.
Bryan Cantrill:Right. Right. Well, this is, is great. And, Dave, I don't know if if there's additional stuff you wanna elaborate on there, but I think RBWs have been really important. James dropped, RFP 373 into the chat.
Bryan Cantrill:We made a bunch of RFPs public, by the way, that we're, formally not. So, if you find that there's those linked to an RFP that you wanna see public, we'll make it public, which we're trying to get all the stuff, out there. But it also, allows you to kinda see all of our thinking there around RBWs. We do end up using them, and then obviously, a bunch of links in the chat to specific, I mean, James dropped in the link to the trampoline saga as a, which is obviously very helpful to kind of see the specific concrete embodiment. Because, especially if you're new to distributed systems, or you're new to this particular problem, this can, I'm sure, feel very abstract, and you can feel like, you are, just kind of drowning in in abstraction.
Bryan Cantrill:And I know, Eliza, this is definitely one of the challenges that you had ramping up when Greg had kind of, you know, left his ruins and then disappeared. It's like, okay. So how do I I've I've got a lot that I a lot of alien technology that I need to
Eliza Weisman:Oh, for sure.
Bryan Cantrill:On here.
Eliza Weisman:Although Greg's ruins, I have to say, were I wrote a massive comment attempting to explain some of this stuff. And there's also a massive comment in a t l a plus spec that Greg wrote that is probably, like, mostly still holds, which just went past in the chat. So folks who are, like, actually interested
Adam Leventhal:Well, and folks, we're it'll all be in the show notes. For every time we've said in chat, just check the show notes.
Bryan Cantrill:That's right. And and this blog comment, Eliza, that I think you're about to drop is is really I and I I mean, I
Eliza Weisman:Myself published novel.
Bryan Cantrill:Well, I I I have to I mean, this is I mean, and I, you know, I know we've talked about our hiring process in the past, but I think about having a writing intensive hiring process is we've got got a very writing intensive team. And I love reading your blog comments. There's these are are these, just terrific narratives, and, Eliza, I I know and I I I think you're the same way. I mean, I love these, like, writing these comments because it helps me on my own understanding of the problem. Definitely.
Bryan Cantrill:And I I kind of you can but this was a yeah. Your your novella here. Actually, or the this actually may just be a a just a full on novel, honestly. It's, it's
Eliza Weisman:fairly long.
Bryan Cantrill:It it's definitely, Robert Mostockiesque, another colleague of ours who writes terrific, terrific law comments.
Eliza Weisman:That's very high praise, I have to say.
Bryan Cantrill:Well, I've I've very disruptive because it it is really, terrific to read that narrative. And, you know, if you had been wondering what makes this problem complicated? Why is it so hard? Like, we're just trying to provision the VM. Is it, like, is it really that complicated?
Bryan Cantrill:I think you hopefully have your answer. It's like, yeah, it's pretty complicated. A lot of things make it complicated. It is not just dealing with the the the sunny day case, which is a lot easier. It's dealing with all of the clouds and the storms, and these things fail halfway through.
Bryan Cantrill:You've got migration in there. You've got the fact that you need to update the software. You've got the fact that this is being shipped as a product and not being operated. There's a whole bunch of stuff that that makes it really thorny and really, Eliza, you know, another thing that I love about this work is it was very much informed by our own experience operating it and with our earliest customers and really beginning to see, like, okay, what are the kind of the patterns of issues that we're having? And how can we kind of address that in in a a more fundamental way, as opposed to just, you know, chasing the bugs, which you obviously, you know, you wanna fix the bugs for sure, but it's, like, you get to the point where I can no.
Bryan Cantrill:No. That that there's something structurally we need to do here that's that's deeper, and then that's what you've done here.
Eliza Weisman:Yeah. I kinda think when I picked this up this work up, I had found, like, kind of the first layer of what if we papered over some of these bugs with sort of, like, an incremental layer of fixes that I think, right, had sort of gone and done and then sort of immediately recognized. Okay. This doing me to actually solve these problems fundamentally. And so I think I, like, spent a good chunk of time just sort of deleting a lot of the, like, hacky fixes and recognized very quickly.
Eliza Weisman:Well, if we had continued to go down that path, we'd all be very sad
Bryan Cantrill:and Absolutely. Absolutely. And I think that, you know, again, that issue that that James drop in 5042 in the Omnicom Repo is a I mean, really, really thorny, complicated issue, great analysis from from Greg. And then, Eliza, it must have been very satisfying to watch Greg close that out as a result of your work, which is for sure. Really, I mean, I think you are, rightfully I mean, this is a this was a ton of work on your part.
Bryan Cantrill:It was a lot of abstraction to ramp up on, a lot of history to ramp up on. And, Dave, are are you how do you feel as I are you feel do you feel like your your children have kind of come home with facial tattoos here, or how do you feel about the,
Dave Pacheco:the the the I'm glad that you're talking literally right now. I mean, this this is the stuff that that comes out of really doing it, you know. It's a it's a nice simple idea in whatever it was, January of 2021.
Bryan Cantrill:And January 5th of 2021.
Dave Pacheco:It's really complicated when we start to do all this stuff. And, I think we're still feeling through how we can improve the abstraction. It's all about trying to make it hard to get it wrong. Right?
Bryan Cantrill:That's right.
Dave Pacheco:And we've we've been talking about a lot of different ways we have tried to do that and, getting
Bryan Cantrill:better. We we are we're getting better. We're getting there. Yeah. And I think that it is, you know, Adam, has this have have we lived up to your your expectations for our our the the the saga of sagas?
Bryan Cantrill:That we
Adam Leventhal:Absolutely. No. This is delightful. Long time coming. I'm so delighted.
Adam Leventhal:This is such a fun conversation, you know, from and in fact, we probably you're probably right to hold me off for a year and a half because we've learned so much about sagas, about, you know, what we got right, what we got wrong, what was complicated, some of the trickiest cases. So this has been an a a truly an epic of sagas.
Bryan Cantrill:It it is an epic of sagas. So, everyone, thank you. Dave, I obviously thank you for for, you know, your early work on this, And and Andrew, Greg, obviously, and then James, and and then especially Eliza for, all the work that you've done in kind of and and motivating this particular discussion with with this big body of work that you just got integrated. Congratulations. Very exciting.
Bryan Cantrill:And, future users of the oxide rack will, will not know that the pain they never had because of this. So, terrific, terrific stuff. Alright. Adam, we'll have to keep working on that list of the the the the the the dream list. We'll we'll keep working on it.
Bryan Cantrill:But, thank you very much, everybody, and thanks for joining us. We'll see you next time.