Oxide and Friends | Transcript: Debugger-Driven Development

Debugger-Driven Development

June 16, 2025 / 01:34:49/S5 E20

Adam Leventhal: 00:00

Was just talking to you with my microphone muted for a little bit, but it wasn't that long.

Bryan Cantrill: 00:04

Oh, one of those. One of those where I'm being like, I'm being ignored. It's like actually although, you know, I had one of those today where I actually I actually hand on heart thought I like, I had a proposal or something in a meeting and everyone's everyone's driving on I'm like oh I was muted. I went back and like I actually wasn't muted. Actually everyone heard me and really thought my idea was my idea was a stinker.

Adam Leventhal: 00:25

We just wished we hadn't. Yes.

Bryan Cantrill: 00:27

Exactly. He's

Adam Leventhal: 00:29

Oh, sorry. I was mute. Oh, nope.

Bryan Cantrill: 00:32

I was not muted. No. Never mind. Oh, god. It's happening.

Bryan Cantrill: 00:36

I'm not muted right now. Delightful. Alright. We gotta I are you yeah. You gotta we gotta bring it up.

Bryan Cantrill: 00:43

Let's bring

Adam Leventhal: 00:43

it We don't have to. They can just help themselves up. I know that, Brian, you feel like you need to be invited or invite yourself up.

Bryan Cantrill: 00:48

No. I do invite myself. Don't I invite myself? I I I but I have to, like, raise my hand to get up on stage, and I feel that they Sure. Yeah.

Bryan Cantrill: 00:55

Know your your hand's shaming me again. Yes. Like we've hey, are Tomax's name on here? Because I don't see them or do I? I do see them.

Bryan Cantrill: 01:02

There they are.

Adam Leventhal: 01:02

Yeah. Yeah. Yeah. Craig and G R. Yeah.

Adam Leventhal: 01:04

They're doing their thing.

Bryan Cantrill: 01:05

They're they're doing the thing. Alright. Wait. They were you know, I'm sorry. I was listening to our episode from last week, which is great with Steve Yes.

Bryan Cantrill: 01:12

Gavnek. And you had told me that you had done some auto editing that you were particularly proud of. And you you felt like I I I don't feel that I am exaggerating when I say that this is this is some really strong work. Well And Yeah. And you mean

Adam Leventhal: 01:32

I told you that? You're like, because I I think I I think my words were like, I'm not overselling it. This is the best thing I've ever done in my life.

Bryan Cantrill: 01:41

And I I just I emphatically agree. I when it actually when I hit it, I was I just burst out laughing. I really was laughing out loud. It was very, very fun.

Adam Leventhal: 01:51

Good. Well, the other thing I really enjoyed was the visual Easter egg in the YouTube video, which did cause me to, like, miss a a significant fuck up in the audio.

Bryan Cantrill: 02:02

But but in terms

Adam Leventhal: 02:04

of I have no regrets.

Bryan Cantrill: 02:06

If you if you undersold the the JJ horn, I feel that that one I mean, we're not gonna use the word oversold, but I Okay. I love that. Yeah. So that and then I also loved that we had the the chime. We had the the chime that we use whenever we reference a previous Oxide and Friends episode.

Bryan Cantrill: 02:25

I love the fact that our chime, it is not actually like a happy sound that like the the Mac SE would use when it boots for example.

Adam Leventhal: 02:33

No. It is it is the boot sound. It's the happy boot sound. I just took you on long error sound.

Bryan Cantrill: 02:38

Isn't it like there's no boot device found?

Adam Leventhal: 02:40

No. I took you on a long walk that that that was confusing when I introduced that sound. But no. It is the it is the happy, like, MacBook has survived post. Yes.

Bryan Cantrill: 02:51

I thought that was like we have not survived post. It's got a bit of an angry undercurrent to it. I I thought it it's got a little bit of a what I liked. I kinda like the fact that's like a it it it it that fight or flight reaction that it is. Anyway, we you were ringing the chime actually quite a few times in the the previous episode previous episode.

Bryan Cantrill: 03:09

But then when I actually prompted you to ring for the chime, you did not ring the chime. And people were wondering, is this like passive aggressive? And my view is like and I think it's actually the truth is just like, dude, you wanna try auto audio editing this thing? This is not this is a lot of work. This is like I'm actually very busy with this visual gag at the end of this thing that is way more important than anything else right now is that we are slowly panning the image to me on my phone, manipulating it while I guess it was at the meta kind of point of that?

Bryan Cantrill: 03:38

Yeah. Yeah. Okay.

Adam Leventhal: 03:40

But but no, it was aggressive, actually.

Bryan Cantrill: 03:44

It was actually wasn't it me?

Adam Leventhal: 03:45

No. No.

Bryan Cantrill: 03:46

Oh, go go go. Yeah. No. It wasn't. Was

Bryan Cantrill: 04:03

it I mean, I kinda

Adam Leventhal: 03:50

wouldn't But I would say that you and nobody else on Mastodon enjoyed it. Everyone on Mastodon was very clear that they don't like the chimes. And to the folks on Mastodon, I say, I'm sorry you don't enjoy them, but

Bryan Cantrill: 04:03

I was just enjoying them, but the the the chimes are staggering. We love we I I love the chimes. And I and I also kind of love the fact that like it's like nirvana. You can't seek it. You can't ask for the time.

Bryan Cantrill: 04:13

The time occurs because you've made a natural reference to a previous episode. You can't call for it, which I can all I can also like. So Exactly. I think was really the audio editing was really absolute top notch. The visual editing, pretty good.

Bryan Cantrill: 04:29

Pretty good. And I I I I can also see like reasons why I can see why that would be important to you and that you'd proud of it. So that was pretty good. It was I did was pretty good, honestly. I'm I'm being it it was it was better than pretty good.

Bryan Cantrill: 04:40

I I did enjoy it. Well, I feel like Slow pan into it and out of it. I thought

Adam Leventhal: 04:43

that Well, like an an audio I mean, a visual gag on a YouTube video that you have surely backgrounded and also plays out over the span of twenty five minutes is, you know, arguably a little bit subtle.

Bryan Cantrill: 04:57

You definitely you know, I'm sorry that I was not wearing a helmet cam when I was actually watching that because you would have been I had much more satisfaction because I was in the grocery store like shopping as I'm listening to it. And so I'm like, you know, I I'm like

Adam Leventhal: 05:11

Adam told me there's a visual gag. I better get my

Bryan Cantrill: 05:13

The visual gag, get the peanut butter, it looks like one thing, and then I'm over to getting like the butter and it looks like something else. I'm like, wait, what what did that happen? So it was like, you know, it would it would have given you much more satisfaction to like me retracing my steps and, you know, somewhere between the peanut butter and the butter actually.

Adam Leventhal: 05:27

You're welcome.

Bryan Cantrill: 05:28

Good stuff. It was good stuff and I I really it was a fun episode. That was a a fun conversation received. No fun that they actually had a another follow-up conversation on Matthew's podcast. So it's kinda weird.

Adam Leventhal: 05:39

Yeah. And like some other lesser podcast as well.

Bryan Cantrill: 05:41

It's it's Well, clearly. Yeah.

Adam Leventhal: 05:44

If and I mean, lesser lesser than Matthew's. I mean Right. Yeah. Apparently, it's really put him on the the podcast circuit.

Bryan Cantrill: 05:52

There there you go. Well, I

Adam Leventhal: 05:53

was So to our guest tonight.

Bryan Cantrill: 05:55

Yes. And and speaking of the podcast circuit, I am I am so excited about this topic. I was convinced we've already hit on exactly this. So we are are are are talking about debugger driven development. And this is prompted in part because we talked about our dev the demo Fridays that we do.

Bryan Cantrill: 06:15

And we had a demo Friday on last Friday, which was great. And all three of these cats here, Dave, Eliza, John had demos that were had at their epicenter a debugger and to to demo kind of other things. So I wanna kinda get into that because it it did remind me like, man, we this is this is such a great topic. And, Adam, I think he I don't think we've talked about OMDB at all. David, we have we even mentioned OMDB?

Bryan Cantrill: 06:43

I think we probably mentioned OMDB in the sagas episode. I think we must have hit on it at least a little bit, but I'm not sure you That's a question.

Dave Pacheco: 06:50

I don't remember.

Bryan Cantrill: 06:53

But we and you know that that the Dave, think the last time we had you on was like nine months ago, which is kind of amazing.

Dave Pacheco: 06:59

Was that the sagas one?

Bryan Cantrill: 07:02

The saga yeah. I think the sagas one was like Yeah. In August, I think.

Dave Pacheco: 07:05

Which is Yeah.

Bryan Cantrill: 07:07

Kinda crazy. Again, sorry to Jim. But I well, obviously, a lies on that one as well. The but I wanna actually maybe if we can rewind. So what we did talk about Adam and and you were referring me to our with what is a bug?

Bryan Cantrill: 07:22

What is a debugger? Right. Back from the that that was like

Adam Leventhal: 07:25

02/2021. Right? That was like an early Twitter space. Early.

Bryan Cantrill: 07:29

Yeah. Did you listen to it? No. It is early in all of the ways. Let me just say that.

Adam Leventhal: 07:36

Little rough.

Bryan Cantrill: 07:36

At some point, I accidentally mute you, and I can't I'm like, Adam what happened to you? Like where are you? Adam. It's like is Adam there? What's going on with Adam?

Bryan Cantrill: 07:44

Adam is gone. And then like I have muted Adam. So anyway it was there was a there a whole bunch of just Also

Adam Leventhal: 07:53

shows the evolution of editing because apparently we just left that in too.

Bryan Cantrill: 07:56

And we we just left all that in. Yeah. That that garbage is now living forever. But we which it was an interesting discussion, but it was really not hitting exactly the topic that we wanna hit today, which is really using a debugger not merely to debug a system, but as part of developing that system.

Dave Pacheco: 08:17

And, you know, I was

Bryan Cantrill: 08:18

kinda thinking back to mean, Davey, you as as I was kind of approaching you about the about the possibility of taking on the topic. You asked me what I'm not sure if it was like a rhetorical question or like a test of proof of life, but you did ask me like, do you mind if we start with like a historical perspective? I'm like, is this are you making fun of

Bryan Cantrill: 08:37

me right now?

Bryan Cantrill: 08:38

I mean, like, we're obviously gonna start with this. Right? Right? I mean, it's not.

Dave Pacheco: 08:42

I did. I was wondering if that was a silly question.

Bryan Cantrill: 08:45

It's definitely like I mean, it's I mean, for sure. I mean, again, I'm not sure if it's a trick question or not. But so actually, do you wanna kick us off a little bit and maybe hit on that historical perspective? I'll obviously I I think, you know, we've obviously got our own historical perspectives as well. But

Dave Pacheco: 09:00

Yeah. So, I mean, I don't know how far to go and, like, how much to dive into some of this stuff. I mean, for me, there's a there's a clear line here starting with MDB and Yeah. Going through a whole bunch of different tools at Joint. And but but I think definitely starting with MDB makes sense.

Dave Pacheco: 09:18

And so for me anyway but I mean, by the time I started my career, MDB already existed. It was new to me. Right? But it already existed. And MDB was the is the Solaris and the Lumos debugger.

Dave Pacheco: 09:31

Right? What what release was it introduced in?

Bryan Cantrill: 09:34

Source seven.

Dave Pacheco: 09:35

Solar seven. Yeah. I feel very, silly talking about MDB history because I I wasn't there and you were, and I'm like, what? I'm gonna get all of this wrong. For for me so so for folks that don't know, MDB is the debugger introduced in Solaris seven, and it is a lot of people, when they hear the word debugger, I think, think of a very specific thing, which is something that can attach to, like, a process or a kernel and can control execution of that thing and print out, like, very low level state, like, maybe, global variables, stack information, and stuff like that.

Dave Pacheco: 10:09

Right? And that is definitely in the realm of what MDB does. But what makes it different from most of the other debuggers like that that I've used, certainly at the time, is that it provides much richer support for adding your own commands for interpreting the in memory structures. So, like, if you're debugging the kernel, you can do things like walk process, walk proc, which walks over all the processes. And it's an iterator that just, like, enumerates all of them.

Dave Pacheco: 10:36

And then you can print the proc t, but you can also do something like, p s, which is a a d command that like, the p s command prints out basic information about that. And you can use this to do, like, pretty sophisticated things. So you can walk process walk processes and then pipe that to walk. Red, I don't actually remember. We have to do that if you're piping it then to stacks.

Dave Pacheco: 10:55

And stacks is this I think it's just a great, example because stacks takes a whole bunch of threads on input, or is it thread stacks? I don't remember which. And, groups them, essentially, creates a histogram of the unique stack traces and also provides filters and stuff. So you for example, if I'm on a system where I think the IO is hung, I might open up MDB and do stacks dash m ZFS, and that shows me all the stacks from ZFS, and it's like a frequency count of all of them. And very often, you're looking for, like, either the thing where there's a zillion stacks or the thing where there's only one stack and it's different from all the others.

Dave Pacheco: 11:29

It's often like one or the other. But the point is really that it's providing all these much higher level tools for summarizing the state and also diving into individual pieces of state. Right? This is for me, this is, like, actually the bigger takeaway from MDB. I used that part of MDB like, I I'd say 90% of my use of MDB was either on a core file or a stopped process or a kernel that was live, but not like, the fact that it was changing was not relevant.

Dave Pacheco: 11:57

I wasn't trying to control execution or anything. I was just trying to get date out of this thing.

Adam Leventhal: 12:02

And part of your point, Dave, I think is that what you can do is application specific. Right? There there were these modules and commands that were customized for the kernel for even subcomponents of the kernel, not just, like, read and write bytes, which it could also do, but there was this much richer interface that could be extremely application specific.

Bryan Cantrill: 12:24

Totally. And you and in particular, the modules themselves were written in the same language as the system, namely c. So you could there's a class module API, and you can write this dynamic module that can then be loaded in. And it it's got the the ability we've got a bunch of class notions in there that were useful for writing debugging commands. So the idea you mentioned today, the idea of of walkers, which is our way of it it one of the things we noticed, like, we're constantly iterating over things.

Bryan Cantrill: 12:54

And you wanna have the intelligence over how to iterate over something and separate that that out from the intelligence that knows how to do something with that thing. And that's you know, a lot of that came from actually I mean, we were using MDB to actually debug Solar seven. So Mike the history here is that Mike and I were the gatekeepers for Solar seven. And we're the the and we were debugging it fell to us to debug all these different problems. And I actually had developed something called dump information, DI, d I, that do you remember this?

Bryan Cantrill: 13:29

Never knew about DI. Yeah. And the and we really wanted to do something much more class. And so, Mike, when he started at Sun, one of the things he did was actually improve the Solaris two six boot time, which resulted in the invention of the p grab command, which has since been ported to many other systems. But p grab has its origins there in that work.

Bryan Cantrill: 13:52

And and then immediately after that, we started work on MDB, which stands for the module debugger, putatively. We all know it actually stands for Mike's debugger. And MDB by having this idea that modules were written in that you could write them in c with a really class API, you could do really, really rich rich debugging support originally for the kernel, but then also extended up into user land and kind of seeing what things that folks had done with crash, which is a crash is a an SVR for utility that was much more powerful we felt than than ADB, but then also had a bunch of limitations. So that was the the rationale for it. And then some of those the the early things that we developed turned out to be like, wow, okay.

Bryan Cantrill: 14:36

This is really clearly the right model. So we, you know, had the ability to walk every KMM cache. And then for every KMM cache, you could walk all the memory that have been allocated there. And then you could paste those things together to do colon colon what is, which allows you to take an address and ask what is this thing? And it would tell you everything it knows about that thing.

Bryan Cantrill: 14:56

And we were putting that together with some of the bugging support we had in the allocator. We were able to do some really, really powerful things.

Adam Leventhal: 15:03

Dave, this might sound quaint, but I'll I'll tell you what blew my mind when I joined the Solaris Kernel Group was I I think that we had just gotten type information into the Kernel and therefore into the debugger. So

Bryan Cantrill: 15:15

Yes. That's

Adam Leventhal: 15:16

right. Extremely exciting. But the colon colon list command that let you say, like, okay. Here's an address, and then here's how you get to the next item in a linked list to then iterate over them. I've like, that was just just mind boggling.

Adam Leventhal: 15:32

I know this sounds incredibly facile now, and we've built so much on top of that. But but without that, you're sitting there, kind of hand iterating through everything. It was it was a it was amazing in the moment.

Bryan Cantrill: 15:45

Absolutely. And then as think Jason's saying this in the chat that you could but by having pipelines, it's a class idea. You got all that power of Unix where you could actually string, you know, your con con list d command together with a bunch of you could have these very complicated invocations that allowed you to really run the the really sophisticated queries of the state of the system. And really, I I think extraordinary valuable stuff. So by the time, Dave, you joined us in 02/2008, MDB was definitely well established, I think.

Bryan Cantrill: 16:16

We'd For

Dave Pacheco: 16:17

sure. Yeah. We started We had a bunch of d commands and stuff for AKD already at that point. It's the the application we were working on at the time.

Bryan Cantrill: 16:27

That's right. Yeah. Yeah. Right. Right.

Bryan Cantrill: 16:29

Right. Yeah. I've I've forgotten. Because we we did a bunch of of AKD was the appliance get daemon that we developed at FISH Works, and we had done a bunch of of MDB support for it. I'd forgotten that.

Bryan Cantrill: 16:41

Alright. So, Dave, so MDB and I actually as I was kinda reflecting on my own history of if you'll forgive me, slight tangent on this, the of my own history of of debugger driven development when I was developing the Cyclix subsystem, which is how we do high resolution timers in the operating system, and replaced the the the traditional interval timer that we use for the system clock, The Cyclics facility, I very much code like, MDB existed before that. Right? I thought I introduced that facility in Solaris eight. We already introduced MDB in Solaris seven.

Bryan Cantrill: 17:14

And so the the d commands that that I used for Cyclex, I developed very much in concert with Cyclex, and I used the those d I've I've many of those d commands, I'm convinced. I am the only one in human history that has run a bunch of those d commands, but they were extremely valuable for validating my internal state on something that was still in development, which I think it the I I we for the that that's that's part of what I think our theme is today because I think what you all done is really taking that to to much grander levels. But that idea that, like, as you develop the system, you're developing the debugger support, and the debugger support is then helping you in the the it's accelerating you in developing the underlying system. Certainly true for me in Cyclics.

Dave Pacheco: 18:02

Yeah. Definitely. Yeah. That that's really interesting. That that I know exactly that feeling you mean.

Dave Pacheco: 18:08

I I love I have these commands as well, and and some other stuff as well that I'm like, I think I'm the only person that's ever run this, but it was so critical.

Bryan Cantrill: 18:16

Yeah. So do you wanna fast forward to kind of Joiant days and some of your use of of m d b there and kind of your your expanded thinking of this?

Dave Pacheco: 18:25

Yeah. So, in 2010, I left Sun and went to Joyent to work with Brian on cloud analytics, which I think it's fair to say the the very short version of that is kind of distributed DTrace with that was the vision, which was that you could, make some API request to a public API for this cloud and start having that run a bunch of Descripts on your application that was distributed on a bunch of nodes in the cloud. And I you know, obviously, we had to debug that thing. So it was a microservices thing. It had these instrumenters running on each of the nodes, and it had an aggregator service that was aggregating all the data and a config service that was taking all the requests and figuring out what to do.

Dave Pacheco: 19:04

And you could, at least in principle, run into all these problems where, like, this one node isn't instrumenting something it should be. Let's go wanna look at the state of that node, or maybe you wanna look at the state of the aggregator or whatever. And so we have this tool for doing that. It was a very ad hoc tool for basically querying each of these things. And it's pretty different from MDB in that it's it's use it's it involves the cooperation of the thing that you're trying to debug, and that is a pretty big difference

Bryan Cantrill: 19:33

that Yeah. Okay. Elaborate on that a little bit because I think yeah. That's actually Elaborate on that.

Dave Pacheco: 19:37

Yeah. So MDB on the kernel actually, MDB in general. On the kernel and when you're attaching it to process or using it on a core file, the thing itself that you're debugging, whether it's the kernel or the process or it's a dead process in the case of a core file, is not really helping at all. All the logic for figuring out, like, what to show you, how to interpret these things, how to walk these data structures exist in the debugger, and it's reading stuff out of memory or out of the image in the core file and, like, processing that in the debugger, which is very different than, for example, what this thing was doing is making an h t actually, was an AMQP message, but AMQP or HTTP request

Bryan Cantrill: 20:15

Oh, the eyes starts twitching. Yes. Yep. Sorry. Go on.

Dave Pacheco: 20:18

Not all the best choices, I think, but but a choice we made. But you're making a request to that thing, and it is answering that request with some information. And that very much limits the kinds of things you can do potentially. Like, you can't like, for example, in the AMQP case, if the problem was that it's not draining its message queues fast enough or it's not connected to the AMQP broker or something like that, well, you're not really gonna be able to debug that with this. But there are a lot of other types of things that you can debug.

Dave Pacheco: 20:46

What for for failures that are not fatal and not pathological in that way, it's still very useful. And it's kind of important because, again, we're doing this on a cloud where we might you might be instrumenting, like, dozens of systems at the time, which felt like a lot. And it's not you can't, like, attach a debugger to all these systems and inspect the state on each of these things. Right? Some way of automating that part of it and the collecting of information and it's I guess we could have tried to automate that using some in memory thing, but at the time, we definitely didn't have any way to read state out of the node process.

Dave Pacheco: 21:18

We ended up building that later, which I wasn't actually gonna talk about, but but, but, anyway, that that would have been a harder path, I think. Yeah. But at some point, I I what I was realizing was that this was kind of a general problem. Namely we have a bunch of we have some services with a bunch of objects. In this case, they were instrumentations.

Dave Pacheco: 21:42

And the services had different views on some of the same objects. Like, a user might create an instrumentation, and there might be 10 instrumenters that have their own view of basically what their view of the instrumentation is, what am I supposed to what descript am I supposed to be running? The aggregator has a view, which is, like, what data do I have? How am I supposed to be aggregating it? What frequency am I how long am I supposed to keep the data and stuff like that?

Dave Pacheco: 22:03

And we wanted to create a sort of combined view, but none of that problem was really specific to the application. Those are just they're just objects. Right? And so I tried to generalize that into a a component that I basically feel like never got enough love, I mean, for me, which which I called Kang because I had the image of Kang and Kodos in

Bryan Cantrill: 22:20

the spaceship Yeah.

Dave Pacheco: 22:21

A little bit being, like, foolish earthlings, totally unprepared for the effects of, you know, whatever this thing is. And so Kang but Kang was this really simple thing. It it you know, these were node programs, and it was a little inch TTP endpoint. It would be the exact same endpoint in every process. And the the endpoints were list the types of objects that you know about, and it would just be an array that's like the string instrumentation.

Dave Pacheco: 22:45

And then list all the instances you know about, which would be, like, a bunch of IDs, and then fetch, like, information about those instances. And then the CLI for this thing, you would point it with a bunch of, like, IP port pairs at a bunch of these things, and it would go collect what it called a snapshot and assemble, like it wasn't atomic, but a coherent summary of all of the objects from all of the parts of the system that that you pointed it at. And, this ended up being we never ended up doing this for cloud analytics, I don't think, But we did use this in Marlin, which is the compute, the compute component of Manta, which we did at joint. Right. It's like a it's a s three like object store with compute built in.

Dave Pacheco: 23:22

So the compute part of this would be that you could submit a thing that's like, I wanna run grep on my 10,000 log files. And those are distributed across, like, a couple of dozen or a couple of 100 servers, and we're gonna basically go run grep on all those servers and collect the output and make it available to you. And so this is totally a case where you've got job supervisors and you've got the things running the jobs, and they all know about jobs and they all know about tasks. And they know about other things too. Like, the job runners had this idea of zones, which were the containers we were running these things in, and we needed to know about the state of all those things.

Dave Pacheco: 23:55

And so all of this, we basically it was incredibly dumb and simple, but basically it would return just a summary of all these objects. But we were able to put together this dashboard that was incredibly useful. I have a screenshot here. I don't know if it's useful to can I, like, dump that in the chat?

Bryan Cantrill: 24:11

Yeah. Yeah. Definitely.

Dave Pacheco: 24:12

Is that, like, useful?

Bryan Cantrill: 24:14

Yeah. We we we've now granted the rights to drop images in the chat. I've been we've some permissions problems. Discord has not been without its Twitter spaces like challenges. But, yes, you can.

Dave Pacheco: 24:26

Did that show up? Yeah. So that what you're looking at there, I redacted just, like, some some, like, host names and stuff like that. This UI is very specific to the application, but the underlying transport is not. This thing basically just it was a paying client that knew how to go ask all these things, what objects do you know about, and then it would know, like I said, it would assemble the snapshot.

Dave Pacheco: 24:52

And from this, we got this dashboard. In this case, like, we I spend many hours looking at this. It's bringing back a lot of memories, not most of them not that great, but

Adam Leventhal: 25:00

the the

Dave Pacheco: 25:01

black boxes here are containers essentially that are running user jobs, and the white ones are containers that are available to run some users' jobs. And so we would use this to go whether it was an outage or if we just wanna see what was going on in the system or how subscribed it was, we could go look at this and get a really quick summary of what was going on for this distributed system. And and the reason I bring this up is the vision for this was definitely MDB for a distributed system. That's like

Bryan Cantrill: 25:27

But 100% what I was trying

Dave Pacheco: 25:29

to go for.

Bryan Cantrill: 25:29

I have not thought about this image in a long time, and you've just bought back a flood of memories. Because I mean, it's been damn near a decade since we've really I mean, not quite a decade, but it's been a long time since since we looked at this thing. Yeah. And, right, this was this was invaluable. This was so essential at the time.

Dave Pacheco: 25:51

Yeah. Yeah. Yeah. It was huge. I'm just looking at chat.

Dave Pacheco: 25:54

There's there's a bunch of good questions about this I'll have to come back to. The the disabled one is interesting. So what conditions result in disabled? Unspecified brokenness would result in some of these containers becoming disabled. And it was very helpful to get this at a glance view of, like, how many of these did we have and where were they?

Dave Pacheco: 26:09

And I think they're if you click through to some of the other tabs there, you could get, like, the error message or something like that. Those are basically, like, problems that we didn't expect to have happen, and we wanted to know about them and dig into them. Yeah. So the I really do view that as, kind of a middle a middle step here.

Bryan Cantrill: 26:27

A middle step here. Yeah. Yeah. Yeah. Right.

Bryan Cantrill: 26:28

No. It's it's like I completely forgotten about old Kang, And I'm I'm sorry. I'm sorry, Kang. But you're right. This it really is a middle ground between it it it was incredibly valuable at the time.

Bryan Cantrill: 26:43

So yeah. Interesting.

Dave Pacheco: 26:47

So I guess fast forwarding to oxide, I was going back to the history of this and I was shocked at how recent

Bryan Cantrill: 26:54

It's very recent. It actually is. Yeah. It's very recent. For something that it's yeah.

Bryan Cantrill: 26:58

It's it's an idea whose time that clearly come. Yeah. Oh, okay. When is it? I almost wanna write down a piece of paper and flip it over.

Bryan Cantrill: 27:04

But I am gonna say, I think that that it's gonna be like I'm gonna say like February 2023. Am I still way far off?

Dave Pacheco: 27:13

Not way far off, but it is later than that. It is after we shipped our racks.

Bryan Cantrill: 27:20

You know, I should have thought right. Right. So it it was like August, September?

Dave Pacheco: 27:25

Yeah. September. It was September 2023. And I'm surprised too because for me, the motivation for OMDB specifically was a project I worked on and I I had to have done that before we shipped the racks. So I'm surprised that to discover that this didn't land until after that.

Dave Pacheco: 27:43

Right. I'm not sure what happened with that. But I remember

Adam Leventhal: 27:46

I mean, Dave, you were you were very reticent about writing it, I think, is is what happened. Is you're like, I think I should write this thing. Should I write this thing? I I think the same thing that happens for a lot of us, for a lot of tools, which is I think we need this tool for this thing I'm working on. But maybe I should just work on the thing I'm working on.

Adam Leventhal: 28:02

And then and then after you've done that, you're like, I really wish I had the tool.

Bryan Cantrill: 28:06

Yeah. I I think also if you're calling up Adam and and asking if you should write it, I mean, I guess it's like one step shorter than calling me up, but it's definitely a you're definitely not dialing for no when you're up with it. Yeah.

Dave Pacheco: 28:18

I did not remember that. Wow. So I mean, part of it I think is that there there just wasn't a thing here. So, like, I mean, I don't know if gonna talk about this, but, like, it quickly became a place that you could just put a whole bunch of things that didn't I don't know how to describe it. Like, they didn't have to be super coherent.

Dave Pacheco: 28:39

I mean, I guess we should describe what it is this

Bryan Cantrill: 28:40

Yeah. Yeah. Describe what ODB is. Yeah. Exactly.

Bryan Cantrill: 28:42

Read the tweet.

Dave Pacheco: 28:42

So m oDB is it is a program. It is not a debugger in the sense of a thing that controls execution. It's more of a debugger in the kang sense. It's a thing that has a bunch of commands that go make requests to a cooperating component to fetch state about that component, and in some cases modify it. It can control some things as well.

Dave Pacheco: 29:04

And then summarize that for you. So it's intended mainly for people, although you can do some automation stuff with it. I mean, it has some facilities for that. And it its top level commands all correspond one to one with components of our system. So, you know, there's a DB command, which you point at our cockroach DB database and is a cockroach client.

Dave Pacheco: 29:26

It's not it doesn't go through any other component. It talks straight to the database. And it has a bunch of sub commands for summarizing state out of the database. There's another command for Nexus, which is our main control plane component, and it makes requests to the internal Nexus HTTP API to, again, fetch state about that thing. And, similarly for other components, our storage component crucible and sled agent and a bunch of stuff.

Dave Pacheco: 29:49

That's basically what it is. And on some level, like, that's all it is, and, like, that's not that interesting, I guess. But I think what's interesting is the process around it and the sort of, like, flow that it enables. That's been really good. So what I was getting at was, like, there's not like, if I wanted to have a way to get information about Nexus before o m d b, there just, like, wasn't really a place to put that thing.

Dave Pacheco: 30:10

You create an an API, which is something that's how it started. Like, you can create an internal API for it and you could, like, hit it with curl or you could make a custom CLI. This is basically how it started, but it was just there was just enough vision to be like, this probably won't be the only time we're gonna do this. But once it was there and there was a skeleton for it, I think people really went to the races in terms of, like, lots of stuff got added with the ability to talk to a lot more components, the ability to fetch a lot more state and summarize that state, dive into that state. And that has been just incredibly valuable on so many levels.

Bryan Cantrill: 30:45

And so, Dave, it was my memory that you demoed this, and everyone is like, where has this thing been our whole lives? And, I mean, it was just one of those things that, like, again, was an idea that had clearly come. And as you say, it's like, on the one hand, it's like, yes, you can do a custom COI, and maybe you can view this as just like an extensible thing. But it just like the way you did it invited other people to participate. I think this was the strength, actually, original strength of MDB as well.

Bryan Cantrill: 31:12

The that module interface invited people to participate and to do like, hey, you don't have to worry about the whole thing. Just write your logic. You don't have to worry about the COI wrapper and all the other kind of GUCs. You can just write the thing that is actually prints out state about your thing, and now you've got a way of getting it. And I wanna I wanna say that it was like the next week that you had some OMDB based functionality to demo.

Bryan Cantrill: 31:36

Am I remembering that correctly? It was very, shortly thereafter that other people were demoing OMDB based functionality.

Alan Hanson: 31:42

I I know it was too, but, I mean, what what Dave did was he he gave us all, like, a taste of crack. And it was like, look at this little it was so it was so addictive. Sorry. To create these little programs that could just give you this insight into what was going on.

John Gallagher: 32:03

But you can see this if you look at the PR history. You just search for OMVB and look at the closed PRs, and you look like Dave landed this on September 14. And within a week, like, Ben added stuff for Oximeter, and Sean added stuff for SLED agent, and Alan added stuff for Crucible. Like, all this stuff landed. It was it was like, nobody knew this thing exist like, this thing hadn't existed.

John Gallagher: 32:23

And then it showed up. And literally within, you know, seventy two hours, was like, oh, yeah. I need to put my thing here too.

Dave Pacheco: 32:30

Yeah. I remember one of the one of the things we did, like, probably right after the demo, we were like, well, we should have some, like I don't know if I don't know if we were, like, we should have some ground rules or what, but it would we needed, like, a little bit of, like, can I just, like, dump stuff in here? I do think there were people that came and were just, like, can I just, like, add something to this or whatever? And I added a block comment that was like the ground rules for OMDB, which you're probably linked to. The one Yeah.

Bryan Cantrill: 32:54

I'm not sure. I'm not sure. I think I've I've contributed to OMDB. I don't think I've read the ground rules yet. I was I was is there no running on the deck?

Bryan Cantrill: 33:00

Am I not supposed to do I have to shower before getting into the pool because No.

Dave Pacheco: 33:03

No. No. It's it's very much the opposite. I mean, for the most part. So the rule Okay.

Dave Pacheco: 33:07

Is like there's there's not a

John Gallagher: 33:08

lot of ground rules.

Bryan Cantrill: 33:09

It's basically I've been running on the deck all the entire time. That's great. Okay. Good.

Dave Pacheco: 33:12

It's it's it's explicitly because I because I think this did come up because people were like a little bit reluctant to be like, can I just like dump some random stuff in here that would be useful to me? And I was like, please do that. So the thing is, like, if you feel if there's something here that is useful, you should add it. I think there's there's a thing about safety in there, and we can talk about safety a little bit later. Like, it's a little bit you don't wanna have something that's, like, if someone runs this command with dash dash help and you, like, aren't parsing that argument and it just goes and, like, you know, shoots Nexus and or something like that.

Dave Pacheco: 33:42

You don't wanna have people accidentally doing bad things to systems. But but certainly read only stuff that's not gonna be disruptive, like, yeah, definitely put it in there. And then we built some some facilities later for making it so that it would be harder to accidentally run destructive things. The ground rule that I thought was pretty important and is not a surprise to anybody here is, debuggers should never lie. And so here.

Dave Pacheco: 34:06

I'm gonna I gotta find a link to this block comment.

Bryan Cantrill: 34:11

Debuggers yeah. Debuggers should never you know, it's funny because we actually talked about this in that that episode from years ago. And and I I'd phrase it if debuggers should only lie if they have to choose between lie and killing the patient. But I was just looking at myself. I'm like, what kind of a choice is like, what condition do you have to lie or kill the patient?

Bryan Cantrill: 34:30

Is that a choice that a debugger off? What kind

Adam Leventhal: 34:32

of trolley problem is this? Right?

Dave Pacheco: 34:34

Like, the poll

Bryan Cantrill: 34:35

debugger trolley problem. It's like, I go no. I actually question the premise. Like, surely there's a option where we could decelerate the trolley without actually killing anybody. So I I yes.

Bryan Cantrill: 34:44

Don't kill the patient. Also, don't lie. You can do both. Hold yourself to a higher standard.

Dave Pacheco: 34:49

And it it sounds obvious when you just say it. Right? But the the example in the comment is like, we think of things like the list of instances on a sled as like a well defined thing. And in a working system, that is a well defined thing. But there are many different things you might mean by that.

Dave Pacheco: 35:03

There's the list of propolis processes. There's the list of things that sled agent is tracking. There's the list of things that nexus is tracking that are running on a sled. And in a broken system, any of these might be different. And so the advice here, such as it is, was basically like be precise about what you're saying.

Dave Pacheco: 35:18

You know, don't say this is all the I mean, that one's just vague potentially, but, like, you don't wanna confuse someone into thinking that it you're they're looking at something that they're not. And so if anything, it's probably overly pedantic because a lot of the stuff I write winds up being kind of overly pedantic about this, but in the name of really trying not to lie. And then there there was a one that got added later that was just basically, like, pry hard to keep going even when things aren't quite what you expect. So as an example of that one, the d b subcommand talks to the database, and it uses our regular, library for doing that, which the thing it does is check whether the database schema matches what it expects it to be based on what version of the software it was built for. And if it doesn't, our production stuff just, like, bails out because, like, something horrible has happened.

Dave Pacheco: 36:03

But the debugger tries. It's, like, it prints out a message. It's, hey, this isn't quite what I think, but I'm just gonna run anyway. I just wanna let you know that this might might be wrong, but it might also be exactly what you need right now. I don't wanna Yeah.

Dave Pacheco: 36:14

Not give you information because I'm not sure it's right.

Bryan Cantrill: 36:17

So and this is where you do get into the the kind of the MDB zeitgeist of not having total system cooperation when you're debugging it. So with this is where we may have a system that itself is pathological. So things may be unexpected in the system and we still want to try to drag ourselves along and do the best that we can. Right. I thought that was there

Alan Hanson: 36:39

from the beginning, Dave.

Dave Pacheco: 36:41

That oh, that rule? That rule?

Alan Hanson: 36:43

Yeah. It was I remember you you I remember you explained to me, like because we're things are messed up and we're trying to figure out what's happening. Do the best you can at not blowing up.

Bryan Cantrill: 36:55

Alan, are you just rude to emphasize the fact that you have read the ground rules and the rest of us haven't? I feel like this is this feels unnecessarily sharp. I mean, I just feel like this is pointed. I don't know. Maybe I'm I'm overly personalizing it, but

Dave Pacheco: 37:05

I don't know.

Alan Hanson: 37:06

I found you.

Bryan Cantrill: 37:10

The so but the it's other thing I think is interesting, Dave, is like this so this OMDB has been hugely helpful for us. And this is as, as Andrew on Blue Sky pointed out, that the I'd asking that or actually, excuse me, asserting that Oxide is secretly a cult dedicated to producing as many new debuggers as possible. That we are which is not true because we're actually I I don't know how that can square with us being a podcasting company. Or maybe that's what it is. We're secretly a cult dedicated to producing as many new debuggers as possible for purposes of content generation, which is I I I I buy that.

Adam Leventhal: 37:44

Sounds right.

Bryan Cantrill: 37:46

That's right. The because I mean I mean, this is not the debugger we've created at oxide, and Humility has also been has got some similar zeitgeist, but I don't think you've used Humility that much, Dave. Is that

Dave Pacheco: 37:57

I mean,

Bryan Cantrill: 37:57

I think you've I don't think you this is actually OMDB is coming much more out of the the kind of common engineering approach we have across the company rather than direct inspiration, I think, humility. I think he's much more inspired by MDB than humility.

Dave Pacheco: 38:12

I think that's right. Yeah. But I mean, there's obviously common threads there. I mean, common ancestry there. Right?

Bryan Cantrill: 38:18

Yeah. For sure. They're they're cousins.

Dave Pacheco: 38:20

Yeah. Right.

Bryan Cantrill: 38:22

Yeah. And not by marriage. They're they're they're blood cousins, humility and all MDB. But humility has also been really important for us as we have developed hubris. We have it's similar I I just reading your ground rules for yes, the time along that.

Bryan Cantrill: 38:37

Although I do love the fact that the the the ground rules, that there aren't a lot of ground rules makes me feel a lot better about not having read them. But it is similar kind of like because you're trying to give people the the permission and the flexibility to develop debugging infrastructure quickly. That's I'd be I feel that's kind of the the the meta point of that. Is that is that a fair read?

Dave Pacheco: 38:59

That's exactly right. Yeah. I mean, the the rules such as they are, we're actually supposed that's like an invitation. Like, here's the balance. It's pretty wide.

Dave Pacheco: 39:05

Please Yeah. Stuff.

Bryan Cantrill: 39:07

Add stuff.

Dave Pacheco: 39:08

Great. And and people did.

Bryan Cantrill: 39:10

Yeah. And and John, as you're pointing out, like, people all kind of lunged in. Do you wanna describe some of your some of the early things that you built on top of OMDB and how it was important for your own development process?

John Gallagher: 39:20

Oh, no. Ask me to go back two years. Man.

Bryan Cantrill: 39:24

Or or you you could just go back to Friday, honestly. If if it's easier to go back to Friday and talk about your most recent stuff, that's fine too.

John Gallagher: 39:30

Well, think I think maybe the easiest thing to talk about is just reconfigurator in general. Right? So I think

Bryan Cantrill: 39:35

we have Absolutely.

John Gallagher: 39:36

Reconfigurator has been a topic on the podcast to an extent, think. Is that right? I Have we talked about that?

Dave Pacheco: 39:44

I don't think we have very

Bryan Cantrill: 39:45

much the only way we're gonna know is when we re listen to this after it's edit and listen for a chime. If we don't hear a chime, then we haven't talked

Adam Leventhal: 39:53

about it before. Wow. You've got a lot of faith in the editor.

John Gallagher: 39:56

Well, I I mean, I should probably see the floor back to Dave. Dave, do you wanna give some context for what reconfigurator is?

Dave Pacheco: 40:02

Oh, boy. Yeah. So reconfigurator this this is the shortest version I can do. Reconfigurator is the component of the control plane that's responsible for changes to the topology and configuration of the control plane itself. So if we need to go add components, remove components, upgrade components, and stuff like that, they all go through Reconfigurator, and it uses this plan execute pattern where it creates these things called blueprints, which are pretty complete descriptive descriptions of the system as it should be and then hands that off to an executor that is basically a reconciler pattern thing that will go and make reality match that thing.

Dave Pacheco: 40:36

And so when you wanna go do a system upgrade, you go through reconfigurator, you go through, like, a sequence of, like, a whole bunch of blueprints and then execute each one. And each one of those is like one step in the change. Does that make sense? Is that a good summary?

Bryan Cantrill: 40:50

That's a that is an extremely good summary, I feel. And and this problem is important. This is not an abstract problem in in a in a rack scale computer because this is like, you know, if you have a component fail in a sled and that sled needs to be removed, the way the way you actually wanna do is reconfigure the system around the absence of that sled. And that is not trivial to do. That is actually and and when you add wanna add a sled, it's like how how complicated could be to add a sled?

Bryan Cantrill: 41:22

Just like plug it in. It's like, well, yeah. But now you actually need to reconfigure the system around the presence of this new sled. So this is this is a very concrete problem that was very important to us and to the the earliest users of the oxide rack.

John Gallagher: 41:37

Yeah. I think to put a really fine point on the thing you just said, what if the sled you just lost was the one that was trying to reconfigure the system in some other way? Right?

Bryan Cantrill: 41:45

And you

John Gallagher: 41:46

have to be you have to be able to handle that kind of problem. So reconfigured it right. We've we've mentioned this in passing. It it's it's sort of the underlying system by which we can add and remove hardware, add and remove software, upgrade software components, like all of these things. And for a lot of this stuff, like, we have a we have a mental picture in mind of what we want the operator experience to be.

John Gallagher: 42:07

Like, upgrading the rack should be you upload the new software, hit go, and then it tells you when it's done, basically, is the sort of the user story of that. Right? But under the hood, it's this very complicated distributed system. Things can fail along the way. Reconfigurator is sort of the the insider baseball of how that's gonna be applied to the rack.

John Gallagher: 42:25

And it has been we've been working on it for, I don't know, maybe eighteen months is probably the right ballpark.

Bryan Cantrill: 42:30

Yeah.

John Gallagher: 42:31

And OMDB has allowed us to ship very incomplete pieces of reconfigurator that are still useful for, like, support operations and support operators and also development in a way that, like, we haven't like, the pieces aren't all there to give the operator experience we want yet, but there are enough pieces there to be dangerous and to be useful on a system that needs some kind of minor repair or needs, you know, development in that in in in a way that you need to be able to poke at a live system with both read only operations as Dave mentioned and destructive operations. Like, I need you to go and make actual changes to this in a way that, you know, might be error prone at best to do by hand or even impossible, but we can so I'm sort of rambling here. I'll I'll try

Bryan Cantrill: 43:20

to No. No. No. Actually, no. No.

Bryan Cantrill: 43:22

I think this I was almost gonna make you make you repeat this because it is what you've made is such an important point because it it it allow I mean, this is to is to me the essence of this is that the presence of OMDB allowed us to incrementally develop and deploy aspects of the system that don't really have user visible ramifications. It's like how do you deliver something that like you've got this end goal that is a you know, you you've gonna you're gonna put a new foundation in. But there's gonna be a bunch of things that you wanna do along the way that are hard to, like, appreciate and see and hard to, like, demonstrate. And I feel that OMDB was just absolutely essential for all of that. And in turn, like, those demonstrations, I think, were really important for if if if only internally, but I think externally as well.

Bryan Cantrill: 44:15

But getting the idea of, like, oh, now I get it. Like, I I Dave, I mean, I feel that with some of your early blueprint OMDB demos and being able to, you know, show the deltas between blueprints and kind of, like, what needed to be done to get from one state to the next, it that's very catalyzing for someone who's trying to just understand all of the things that are required for this very complicated change to the system.

Dave Pacheco: 44:42

Yeah. Totally. And it's it is kind of interesting that we added those commands very early, and I think we've added very few o m d b commands for reconfigurators since then because those commands wind up encompassing what most of what we needed. So, like, we had the blueprint show, blueprint plan or regenerate, which is, like, run our planner to generate a new blueprint and, like, obviously list and delete and setting the target, enabling and disabling. And, like, that's kinda what we needed.

Dave Pacheco: 45:08

And we kind of built the system around that abstraction to begin with. What changed was what show output and what diff output. You know? Yeah. Like, each new demo had more stuff there and more stuff that was actually carried out by the execute phase.

Bryan Cantrill: 45:21

And great Ascii art. Is now the time to mention the Ascii art? Because I think the Ascii art is really it's it's really very important. I think the Ascii art is really I mean, it's which I mean, this is the MDB tradition as well, I feel. It's like really killer Ascii art to give you a picture of this to give you a picture in Ascii art of the system that you have.

Bryan Cantrill: 45:40

Like, I mean, the the colon colon stream does this in MDB. Right? Where you get an actual visual for what the what the stream looks like, which is very helpful to actually kinda map your understanding of the system with its actual implementation.

Dave Pacheco: 45:53

Oh my god. I don't know what colon colon stream is. What? Really? This is consistent with my thesis that most of us know like 5% of MDB and it only barely overlaps.

Dave Pacheco: 46:07

Well, every time I'm looking over someone's shoulder, learn some new part of MDB. And and here here I've learned another.

Bryan Cantrill: 46:15

Well, and I'm I've been just hoping that I haven't, like like, an early version of GPT just hallucinated the whole thing. Like, this is a this is where you'd be like, I I went is no what are you talking about? Oh, I'm so sorry.

Adam Leventhal: 46:25

You're absolutely right.

Bryan Cantrill: 46:26

I'm so sorry. Yeah. How can I I so I'm hoping I presuming I haven't hallucinated it? Yes. The also, as long as we're just going back for a quick MDB, Adam, you you did want us to mention colon colon flip one.

Bryan Cantrill: 46:43

It feels like we I feel like we need to give a tip of the hat to it.

Adam Leventhal: 46:47

Flip one, it I I got I was looking, know, I I was hired at Sun. I started looking through random d commands, and I found this command flip one, which you give it a number, and it prints out that number if any of its bits were flipped. I'm like, what the fuck is this?

Bryan Cantrill: 47:04

It it it prints out all iterations with one bit flipped.

John Gallagher: 47:07

That's right.

Bryan Cantrill: 47:08

And I'm like, what the

Adam Leventhal: 47:09

fuck is this? How what is this possibly for? And I feel like it was like, oh sweet summer child, like sit on the floor and gather round while I tell you a tale.

Bryan Cantrill: 47:19

Tell you a tale about a ship called Viking that had an eyecatch that was not properly grounded out. And that is the the on Viking, the and they ultimately quote unquote fix that. This is like the Viking rev rev level two, which is what is is still in the help message. This is Sun four m, did not have the and this is a bug that Bonwick discovered. So the because the iCache was not properly grounded out, if you and I believe if you had the zeros would flip to ones.

Bryan Cantrill: 47:53

If you had enough zeros in there Yeah. They would they would start. You'd like, that's a lot of zeros in your iCache. Have a one. I'm sick of zeros.

Bryan Cantrill: 48:02

It's all zero this, zero that, zero the other thing. And I feel like feel like you need a one. But like try a one. I think it's kind of like, you know, it's like the I actually, Adam, as I was listening to that episode from several years ago, your now elementary schooler was a three year old toddler. Mhmm.

Bryan Cantrill: 48:19

I'll be interested to see if his future lawyers force us to take that episode down. That'll be I don't I don't know. I'll be just be curious to see how that works out.

Adam Leventhal: 48:26

Like, real really ring the bell on that one.

Bryan Cantrill: 48:29

Yeah. That's right. The chime turns into a Claxton. You got that chime when we reference a future episode. A Claxton when we warn lawyers in the future that they may wanna really carefully

Adam Leventhal: 48:41

Exhibit it for a crime.

Bryan Cantrill: 48:42

Exhibit it for a crime. The but the so the and Bonwick discovered this this bug when it's like I mean, and he's got a great tail. And Dave, have you ever heard this? I'm sure yet. This is maybe this is maybe one that's a truly an Australia special.

Dave Pacheco: 48:59

I don't think so.

Bryan Cantrill: 49:01

Oh, this is an amazing story where so you would because it's in the iCache, you would go to branch to a target and the you'd get a different instruction because you'd slip a bit.

Adam Leventhal: 49:16

Right. How about outer space? Like, of the intent destination, why not just jump to outer space?

Bryan Cantrill: 49:21

Why not just jump to outer space? And or or just just jump over here to this other, valid program text for a while and see where that goes. I don't know. That's let's see what happens there. And the and so he would discover that that'd be actually be good to get.

Bryan Cantrill: 49:34

I'm sure that Jeff still vividly remembers this. But as it was retold to me because this had been debugged before I'd arrived at Sun. But this is when when you have an issue like this, it truly drives you out of your mind because meaningless changes change the the the the presentation of the bug because they change the way program text is organized and changes the actual like address at which particular text is located. Right? So you could do what feels like nothing to the program and the bug would go away when they finally had this thing.

Bryan Cantrill: 50:03

And Barron was the one who finally cracked this thing open and realized that, if I branch to this address, and again, it was I it's either zeroes flipping to ones or one flip one slipping to zeros, that the he was able to show that I could stuff the iCache with this thing, and it just starts losing bits all over the place.

Adam Leventhal: 50:18

But you're right. When it's in the iCache, you're like, I look at the debugger state of a live system. I see the value. Yeah. And how could I have gotten here?

Adam Leventhal: 50:27

Yeah. Impossible. Red light is. Yes. Divergence between data and iCache.

Bryan Cantrill: 50:33

It's brutal. And the fix for that was to turn off three quarters of the iCache, which is not big air quotes on that one. It's really

Adam Leventhal: 50:43

kind of Yeah. Mean, the the thing that was astounding to me I mean, was I mean, so many great lessons in this. There was a further lesson in terms of what it taught the field organization. Because for generations then at Sun, salespeople would say, don't you shouldn't buy the extra cash because it really doesn't help.

Bryan Cantrill: 51:03

Because what they had what they had learned powered off. Cash is bad.

Adam Leventhal: 51:07

Because what they had learned from this processor was like, I don't know I don't know the details, but I will tell you, like

Bryan Cantrill: 51:13

Cash is not good.

Adam Leventhal: 51:14

You don't really get what you pay for on this thing. But it but it was just so interesting to me about how this this oral history made its way through Sun, and out the other side came the Salesforce aphorism, you know, don't really pay for the cash because it doesn't help that much.

Bryan Cantrill: 51:30

And an m d b d command that we still have. That is actually that is actually occasionally useful. That is the the way you can actually because the reason you want to flip one is because you wanna be like, okay, if I'm at this crazy instruction that doesn't make sense, if I flip one of these bits, is that the actual instructions I wanna be at?

Adam Leventhal: 51:43

That's right. So the way you run it is you say, crazy crazy value to flip one and then pipe that through, interpret all of these as symbols. If I Yeah.

Dave Pacheco: 51:52

Call over

Bryan Cantrill: 51:53

the flip any

Adam Leventhal: 51:53

of these bits, is any of them a symbol, please? Like, please let any one of them make sense. Yes. It's Exactly. It's gotta be a very dark place when when when you're like, maybe the answer is flip one.

Bryan Cantrill: 52:05

Cool. And flip one, you're definitely, there are no atheists in a foxhole, think, and this is, you know, I I the the the this is definitely where you've gotten out. You are you are praying to whatever god you've got when you're when you're Cokeland flip one. So, Dave, sorry for the you know, the I'm I'm No. That's great.

Bryan Cantrill: 52:25

You're like, stop generating. Let's get back to you. Kidding me? No.

Dave Pacheco: 52:30

I'm trying to get a cold call stream to work. I'm trying to figure out how I can see

Bryan Cantrill: 52:33

It did not work. It does not exist, of all.

Dave Pacheco: 52:35

No. No. It does. It totally exists, and I've seen the ASCII art. I just haven't found valid pointer to give it and so I'm only getting the like, you know, a bunch of read errors instead.

Bryan Cantrill: 52:45

Yeah. I will I I can get you a walker that will that will walk that will actually generate the the output of colon constraint. That that is my pledge to you. Now that I know that it exists, thank God I didn't lose that whole thing. Never know.

Adam Leventhal: 52:56

Or it's not some branch on some laptop that you put in the dumpster years ago.

Bryan Cantrill: 52:59

It's not it's not in Fastrack x. Oh, Fastrack x. What a what a what a what a great repository that thing was. Went down with the ship. Where are we?

Bryan Cantrill: 53:13

We need to get back to we did we leave breadcrumbs to get us back to the trail? Here we are. I I now I realized now it's gotten dark, and I've got no idea where the trail was. You know, all these all these trees look the same now. Meanwhile, back in OMDB, Dave.

Bryan Cantrill: 53:27

Yeah. So sorry. Ascii Art, I think, is where we were. We were talking about the the great Ascii art in in OMDB, which although is now the time to talk about the equals j format character in MPDB?

Dave Pacheco: 53:41

No. Was thinking about equals j.

Bryan Cantrill: 53:43

Okay. I you know, I can read your mind. Can you talk can you talk a little bit about equals j?

Dave Pacheco: 53:46

Well, I'm I'm just a bystander for that that whole thing. I mean, I can tell you what it is and why. It's great. I mean, equals j so equals is the MDB syntax where you give it a value on the left side and you do equals and then a format character, kind of like a printf format character. And j, I believe is ostensibly the jazzed up format.

Bryan Cantrill: 54:08

Yes. Thank you, Brian.

Dave Pacheco: 54:11

And I think also Jordan.

Bryan Cantrill: 54:13

It also for Jordan Hendrix who who was sitting next to me. He's like, know what I would really like? You know what actually be really useful is to actually like I'm sitting here counting bits. I can't remember why she was counting bits. But was like, okay.

Bryan Cantrill: 54:23

I can like we've got one format character left and good news Jordan, it's yours. It's Jay. So yeah, I think it's jazzed up is is what it supposedly says on the tin. But

Dave Pacheco: 54:33

It's pretty great. Mean, when you need it.

Bryan Cantrill: 54:35

It's it's useful when you need it. Yeah. That's it. When you need it, you need it. This is like a lot of these things.

Bryan Cantrill: 54:39

It's it's when when you need it, you actually need it. And I'm and I'm so glad that we were thinking alike. That we we needed a quick equals j diversion. So now I promise back to OMDB, and it's it's terrific. Ascii are interested more generally, I think that the the printing out, like, the I mean, I can't I think the more general idea is we're printing out this internal state of the system that has no that you wouldn't spend any energy on, like, plumbing this all the way out to the user, but it's extremely valuable to actually understand for the implementer.

Bryan Cantrill: 55:14

And then I think, John, the reason that your point was so important is because the these OMDB way stations allowed us to see the system as it was being developed, kind of get excited about it and, I mean, really understand what we were doing, get to kind of be able to demo the the these way stations. And that in turn, I think, indisputably accelerated our development. I think that that to me is like the big surprise is that the the the development of you because one might reasonably think that codeveloping a debugger with a system would make the development of that system slower. And I think we would counter that it actually made it faster and more robust to develop the system with the presence of OMVB. John, is that a fair?

Bryan Cantrill: 56:00

I mean, that that's certainly with the inference that I drew from what you said anyway.

John Gallagher: 56:03

Yeah. I think that's absolutely right. I I think I would claim there are two reasons for that. One is that adding stuff to OMDB is very straightforward. So there was a there was a chat going back going by that's maybe worth summarizing briefly.

John Gallagher: 56:15

Somebody asked how do you like, where does this thing run? What what is what does the interface look like? So if you log in to a rack as a developer or as a support operator, you OMDB is a a CLI program that's available as soon as you log in. And by default, it's got a bunch of sub commands for accessing different services in the within the rack. So we've mentioned ondb db, which will access the database and you can run queries.

John Gallagher: 56:35

Ondb nexus talks to nexus, which is our sort of control plane brain. And by default, if you run one of these sub commands, OMDB will itself connect to our own internal DNS server servers just like any other service within the rack would. So if you say OMDB DB, it'll look up where are the database nodes in internal DNS and then pick one of those and connect to it. But because you have to use this thing on broken systems sometimes, you can always override that and say, the internal DNS check. Here's here's exactly where I need to go.

John Gallagher: 57:07

The because by default, it can just look up this stuff in internal DNS, adding new stuff to OMDB is basically like, the, like, the the extent of the work is I need to print this thing in a way that is, like, sort of reasonable at a terminal. That's it. I don't have to do any of the machinery to look up services. I don't I've we've already got clients that you it reuses the same clients that other services already use. Like, it's very very fast.

John Gallagher: 57:31

In fact, in most of our, like, reconfigurator pull request, most of our reconfigurator work, the OMDB changes to support whatever new reconfigurator functionality is landing. They just come in the same PR because it's like, here's an extra 30 lines of code in OMDB that goes with this 500 line PR that makes you able to see the results of this thing in practice. So that's that's one reason that I would say that, like, it's absolutely been a speed boost is that it's so easy to add to. The other is that for reconfigurator in particular, having a system that reconfigures itself, it, like, allows a distributed system to reconfigure itself on the fly, that can go wrong in 10,000 different ways. Right?

Bryan Cantrill: 58:10

And Right.

John Gallagher: 58:11

Having OMDB has let us ship things with confidence that it's not going to, like, break in weird ways that we can't understand. And part of that is that, like, we we've talked about the delta between blueprints. Like, this is this is where some of the nice ASCII art comes from. We talked about this on the DAFT episode not that long ago. A thing that we can do with blueprints is I'm gonna I'm gonna generate a new blueprint.

John Gallagher: 58:36

All that does is produce the new blueprint. It does not make any changes to the system, and then a human can look at the diff before saying, yes, it's okay to move on to that blueprint. And that has been the way we've worked since blueprints were introduced, you know, however many months ago. We're just now just now getting to the point where we're gonna automate this and that that if we had tried to do that, you know, twelve months ago when we started with the blueprints, it would have been a total disaster. Right?

John Gallagher: 59:04

Because yeah. The several iterations of blueprint were like, are bugs all over the place. Right? It's a new complicated thing. But because we built these sort as you said said, way stations where a human can look at the output and say, okay.

John Gallagher: 59:17

Yes. This is the plan it's gonna make. Wait. I don't understand what it's gonna do here. I can go ask someone who worked on this part of the system and make sure that this makes sense before I tell the system it's okay to proceed.

John Gallagher: 59:26

The the history we've built up with that over the last however many months is what's giving us the the don't know if courage is the right word to move forward with automating the thing in a way that we're confident that, you know, we've built enough in here that we can move forward with this.

Bryan Cantrill: 59:40

Yeah. That's a really good point. And and because you we Dave, we I remember we often said, so in Choin and and and Oxide too that you you gotta get the human in the loop before you get them out of it. And this allowed us those way stations allowed the these kind of these junctures where you could get the human in the loop for a system that's not totally done yet, but it's got like an element of it is done. A lower level is done.

Bryan Cantrill: 01:00:04

The human can get in the loop and audit what they're seeing. And now, John, to your point of, now we've got the confidence where, okay, we've done we can now start start to pull the human out of a loop and get to where we actually wanna go, is this fully automated system.

John Gallagher: 01:00:17

Yeah. So we we actually turned we ran the very fully automated test, what, four hours ago maybe? And one of the things it did was weird. We're like, what happened here? But we already had the tools built in to set so what happened was we expected the automated planner to take one step and instead it took two steps.

John Gallagher: 01:00:36

We're like, well, that's weird. Why did it do that? But we already had the tools right there to say show me what you did between steps one and two, and then show me what you did between steps two and three. And, like, immediately, within three minutes, we had an issue filed. We knew what it had done wrong because we already had the tools right there from developing them over the last twelve months.

Bryan Cantrill: 01:00:53

Yeah. That is remarkable. That that is really really neat. And I think also think it's been great then for developing totally new functionality that, again, wouldn't have necessarily user visible consequence. I mean, one of those just for me personally was we had all of this sensor data that was coming out of MGS, the managed gateway service, but we had not plumbed it up into the control plane.

Bryan Cantrill: 01:01:22

And I just wanted to know, like, is this the information coming out of this correct? And wanted to take the work actually, is where the you know, we we had kiss and cousins with humility bringing humility to OMDB, Dave, where the the Humility dashboard output, which allows us to see the kind of environmentals on a single sled, I brought that in into an MGS dashboard command via OMDB, which is and was basically able to to to really I mean, that code was very easy to go do. To your point, John, of, like, the fact that it that OMDB handled all these things like the service discovery and all this other stuff, where I could then just run that and get a dashboard. And then, Eliza, that allowed you I mean, you were then stitching that up into the control plane. And, I mean, from my perspective, was great to have OMDB to, like, they're like, okay.

Bryan Cantrill: 01:02:09

Now Eliza's got something that she can go rig or build on. And then you use this more recently on the FMA stuff. I wonder if you want to talk about, like, what you were demoing on Friday

Adam Leventhal: 01:02:16

Oh, yeah.

Eliza Weisman: 01:02:17

Without Well, so it's I'm not actually sure if this is the most interesting OMDB use case, but I was I'm I've been working for the last several months on fault management, and in particular, on a component of the system that sort of goes around and collects fault reports from all the service processor firmware. And, you know, I I wrote a bunch of code for doing that last week, and I wanted to be able to show it off and also to test it. And it's actually pretty hard to demo this because it's an it's an automated process that just comes around and collects the error reports periodically. And we've modified the the simulated service processor that we use for control point testing to pretend to have some error reports. And then this thing comes around and sweeps them up.

Eliza Weisman: 01:03:11

And in a world where the actual plumbing of that into the alert system is something that's still being worked on by me, you can't actually show off anything interesting about that

Bryan Cantrill: 01:03:23

Right.

Eliza Weisman: 01:03:23

Because it it sort of does this within the thirty seconds of the simulated control plane coming up. And then it's like, okay. Now these records are in the database. That's great. There's there's nothing I can really interact with here.

Eliza Weisman: 01:03:38

So I just sort of went and added a bunch of OMDB commands for printing all of this stuff. And now suddenly, you have a way that you can test that this works, and that's really nice. And especially because the, like, actual, you know, integration tests that I still have to go and write for this are not a very nice demo. Know, you show up at at demo Friday, and you're like, I I would like to show you this system that I've I've implemented. And, the way that I will demonstrate that it works is that I I can run cargo next test, and it says, yeah, all of your tests passed.

Eliza Weisman: 01:04:14

And it's like, that's great.

Bryan Cantrill: 01:04:16

Which we would still honor. Just for the record. This is mean, we you know, this is a this is a these are systems demos. So, you know, the fact that the, you know, the program runs, the system boots is itself is a is a modern miracle. But I agree with that OMDB allows us to give a little more pizzazz to our systems demos.

Eliza Weisman: 01:04:31

Yeah. And it was also very useful because as I was going and sketching out some of these OMDB commands that I wanted to have, I I began to recognize, oh, there are, like, some some interesting it gives you thinking about what are the commands that I might wanna have if something goes wrong gives you a lot of insight into, okay, like, what are the contours of the different different forms of data that we're collecting and storing. Like, for instance, you know, I wanted to do a command that lets you show all of the error reports by sled serial. And that was, you know, so that you could say, like, if this sled has some some hardware faults that it's thrown a bunch of errors. Well, okay.

Eliza Weisman: 01:05:18

So it actually if you you go and do this command, you realize it should have a table that also shows you the cubby it was in because the same slide might have been pulled out of one cubby and then later put into a different cubby a different slot in the rack, and you would maybe wanna know that. And conversely, if you want to say, you know, I wanna see all of the errors from sled nine, well, who is SLED nine? That could have been any number of serial numbers at any number of points in time if SLED nine was pulled out and replaced. I picked SLED nine because it is a ill fated

Bryan Cantrill: 01:05:51

SLED nine is a SLED number. Yeah. Problematic number. I I don't know. We should see what yeah.

Bryan Cantrill: 01:05:54

What what is what is slide nine like? Do a cool quote flip one on slip slide nine, would you? Because that that that's who you seem to be.

Eliza Weisman: 01:06:00

You know, and

Dave Pacheco: 01:06:01

then yeah.

Eliza Weisman: 01:06:02

So it it just gives you this sense of like, oh, here are some things that I actually will want to think about when, you know, the higher level tools are built on top of this. It's like it it gives you a sense of the shape of the data that you might not have thought about quite so intimately if you were not thinking, oh, you know, what are the things that I would definitely want to have if this isn't working?

Bryan Cantrill: 01:06:23

Yeah. That's really interesting. Where in truly debugger driven development where, like, I'm thinking about what I want. I mean, always, I love what you said about, like, thinking about what I want when the system fails has me thinking differently about the data that I wanna store in the system itself. The and, Dave, I gotta imagine that's been a theme for you as well.

Bryan Cantrill: 01:06:43

I I mean, the in the the reconfigurator work and the blueprint work.

Dave Pacheco: 01:06:48

Oh, yeah. Absolutely. I'm really glad we got here because this is this is something I've been thinking about a lot. I've been thinking about it a lot. So the when I when the thing I used OMDB for that I remember was the DNS propagation thing, which I don't know if going into that whole

Bryan Cantrill: 01:07:05

definitely. Yeah. Yeah. Yeah.

Dave Pacheco: 01:07:06

So, basically, the the rack runs in internal DNS and external DNS system. They're two deployments of the same software. So for external DNS, customers delegate a DNS domain to us, and then we serve names under that for the console and web API. And we use that because if we wanna add new IPs that back it or remove some or change them, then we can do that, and we don't have to coordinate with, like, their DNS operations team or something like that. Right?

Dave Pacheco: 01:07:31

For internal DNS, we use that to discover all the components within the system. So we run these two different DNS things. And for both of them, the control plane is the authoritative store record of what should be in DNS, like what names should be there with what records and everything. Right? And so the thing that I was working on was the system for configuring for making for having the control plane at large be able to make changes to that DNS configuration and then have that show up at the DNS servers.

Dave Pacheco: 01:08:00

And there are so many ways this can go wrong. Right? I mean, knows every outage ever, it's always DNS. Right? And It really is.

Dave Pacheco: 01:08:07

I this assuming that not only were we going to have outages where people were gonna come to us and be like, the DNS thing is wrong and, you know, what's wrong with it? But also outages where DNS wasn't the problem, and we wanted to be able to quickly show that it wasn't. Right? So Yeah. Another another thing I I I remember talking about a lot at Joint was software should be able to exonerate itself, which is, you know, it's great when software can tell you there are zero errors, but I think it's much more useful if it's able to tell you a a bunch more affirmative things about what it's done.

Dave Pacheco: 01:08:42

So in the case of the DNS propagation thing, I I factored that into three different things that you can observe with OMDB. One of them reads the latest config from the database. One of them reads the latest set of DNS servers, and one of them propagates the latest config to the latest set of DNS servers. And that sounds obvious, right, when you say it, but it was a pretty it was a useful thing. And then you can look at the result of any of those things.

Dave Pacheco: 01:09:06

And so with, like, three or four OMDB commands, you can pretty quickly take any DNS problem and say exactly where it is. It's either in the DNS server or it's in the config for the DNS or it's in the list of servers that we have in the database or it's in the database. Like, there's, like, bunch of places it can be, but you can narrow it down very quickly because, because I was just thinking thinking ahead to, well, what what could it be, and what are the what are the questions we're gonna be asked for a production thing?

Bryan Cantrill: 01:09:33

Yeah. And this is and so we're using ODB to potentially modify the state of the system here. Right?

Dave Pacheco: 01:09:38

That is also true, but that that this is actually just debugging it. This is actually just observing the state of all of those things.

Bryan Cantrill: 01:09:45

Okay. But we we one of things that is interesting to me is we have, I guess, outside of this, but we have used OMDB as a way to to test well defined changes to the system, which is different, think, than I mean, we we have had I mean, with both yes. John is describing in the in the chat, the the minus minus destructive optional MDB. I mean, we do have the ability MDB does have a destructive option, and where we do actually you can actually modify the stated system, and it sounds like you've got a kind of a similar approach in OMDB. You wanna allow OMDB to make well defined changes to the system potentially.

Bryan Cantrill: 01:10:22

Its its primary purpose is to really observe the system, but you've got this way of actually making important changes. Is that fair?

Dave Pacheco: 01:10:30

Yeah. Absolutely. I mean, it is it's it's no longer like a a sort of niche use case. It's all of the Right. Yeah.

Dave Pacheco: 01:10:36

Greater work that we do involve some of these control operations. And it inherited the minus w dash dash destructive from MDB and DTrace. Although it's probably a little bit alarmist for OMDB because it's not usually actually destructive. It's just anything that might possibly change the behavior of the system is put behind the destructive flag. And so Yeah.

Dave Pacheco: 01:10:56

If you tell the thing to generate a new blueprint, even though that blueprint is not going to do anything because you haven't made it the new target, that's still behind the destructive flag because you're changing the underlying system state. And, but it's huge. It's it's that's really where you get to some of the interesting OMDB driven development, right, where you're like, I'm gonna go do something that we don't have a public API for because it's not mature enough yet, but I can go do it in this context. And that's actually I we haven't actually quite done that with the automation of the planner stuff, but that will be kind of the next step. The planner will be something that we have manual for the foreseeable future and is manual.

Dave Pacheco: 01:11:32

But in development environments, we wanna be able to just turn on and on the automated thing, which will just be, like, setting a flag.

Bryan Cantrill: 01:11:39

Right. And and

Dave Pacheco: 01:11:40

runtime configurations, like, not anything fancy, but, some of the other ones are a little more interesting.

Bryan Cantrill: 01:11:46

Well, by the time you get to that, like, you've already done that. The the like, you've used OMDB so many times to kind of do this well defined state transition that it feels like it's the we've derisked that future development of the system quite a bit.

Dave Pacheco: 01:12:00

Yeah. For sure.

John Gallagher: 01:12:02

I think that was almost the exact argument I made to Dave that we needed to go ahead and automate the planner was we're talking about, you know, upgrading the entire system. It's like, oh, that's gonna require, I don't know, back of the envelope three or 400 blueprint steps. Like, I can do that by hand, but the very thing I'm gonna do is like loop and bash and just write my own little automated because I'm not gonna sit there and and monitor it for 300 steps. Right? So, like, if we're gonna do this anyway, let's automate the thing and just put it behind put the automation behind the flag that I can turn on and off if something seems to go off the rails.

Bryan Cantrill: 01:12:33

Yeah. Totally. And and that's I I mean, John, that that's obviously born fruit. Feels like it feels like that that that has been I mean, it's it's just again allowed us to really accelerate and and paralyze too. I feel like they did.

Bryan Cantrill: 01:12:47

There's also an element of paralyzation because it's made it easier for people that wanna help this effort to kinda understand the frontiers of the system. Or certainly from my perspective, has has that been true, do you think?

Dave Pacheco: 01:13:00

Yeah. Absolutely. And I think, you know, we talk about demoing and demoing is obviously important for a lot of reasons, but I think an important it serves an important function of communication. Right? It's like here's Yeah.

Dave Pacheco: 01:13:12

A thing I built and here's how you use it and here's how it works in a way that's like, you know, it's the pictures that's worth a thousand words. Right? And I can give you a demo of the thing, and you can understand it quite a lot better than if I just give you a lot of text about it or even a bunch of diagrams. And I this may be where I start going off the rails a little bit, but I've been thinking a lot about this. There there it's felt for a long time for me, like, there's a close relationship between tools that are good for debugging and tools that are good for demos.

Dave Pacheco: 01:13:41

And Yes. The Yeah. It I think of this it's I don't know if we have some of the hardware folks on to tell me I'm wrong about this. But it feels a little bit like test points in hardware where you've got these spots on the board where you've pulled out a place where you can attach a probe. And, and that's really useful, I think, during bring up of type of some systems.

Dave Pacheco: 01:14:00

Right? And it's also useful later when some system's not working, if you still have them depending on the system. Right? By you know, if you look at the DNS propagation stuff I was talking about, what I and thanks, Eliza, for posting it in the channel. So what's there is, like, there's you can look at the last DNS configuration that a particular Nexus instance read.

Dave Pacheco: 01:14:21

You can look at the list of servers it knew about, and you can look at the set at the result of propagating it to each of the servers. And, like, so if you imagine demoing this feature, like, I've just done DNS propagation and I wanna demo it. Like, I could do a demo where I do a thing that should cause a new DNS name to show up, and then I run dig at the DNS server and it shows up. And, like, that's an okay demo. Like, I've shown that this works.

Dave Pacheco: 01:14:44

But then there's kind of a lot more stuff you could do that I think makes it a much better demo. You can, for example, undo the thing and show that it has gone away, and you can you can show these underlying pieces working the way you expect. So it's like I'm not I've made some change, and then I'm looking at the internal state of each of these things, and you can see the underlying works that are making the thing that I'm trying to show you happen, and now you understand it better. Right? And then I think an even better demo, I hope I did this.

Dave Pacheco: 01:15:12

I don't know if I this is what I actually did when I demoed it, would be like, let's turn off one of these DNS servers and show what happens in the output of this thing that's like looking at the propagation, and you'd see a failure there with an error message. And then you turn it back on, you see the propagation work again. And that goes back to this other thing that I think we got from Matt Rani many years ago, which is like, don't really understand how a thing works until you understand how it fails. And I I just feel like these tools like, a rich set of these primitive things for observing all these very basic steps that these systems carry out are useful for communicating it to communicating this stuff to people, for showing how it works, for showing how it fails, and for debugging, and for support. It just winds up being this

Bryan Cantrill: 01:15:56

Very high leverage. Yeah. Totally. And I and sorry. It was Dan McDonald in the chat dropped in cold calling stream.

Bryan Cantrill: 01:16:02

I wasn't hallucinating. It was great. Dan say, cold cold stream on demand apparently. The the so we we've got some ASCII art of cold cold stream. I just dropped in some ASCII art from when I was developing Cyclics, but showing this kind of graphical display of the heap.

Bryan Cantrill: 01:16:19

And I think just to your point, David, like allows you to demo these implementation details because they can be vividly seen in a way that is just hard to it it it it's much less abstract. It just it makes it more concrete when you can actually see what the system is doing. It's actually telling you, and I love the the the output that Eliza pasted in of the, the the background tasks, the showing the DNS config internal, because you can actually just get much better understanding of the system and what it's actually trying to do in a way that you, as you say, you can't from a demo of like, look, it works. It just doesn't leave you with necessarily the same level of understanding. So, yeah, there is something about about that.

Bryan Cantrill: 01:17:02

About the demo ability of system software when you are doing when you're engaged in debugger driven development.

Eliza Weisman: 01:17:09

I was just very taken by the discovery, and I think this says something pretty big about Oxide's culture around this stuff, is that every Nexus background task, there's this trait that's like, what is a background task? It's a thing that you activate and it returns some future. And that future returns a JSON object that is then what's used to populate that status that we mentioned, the OMDB background tasks status command. And it just, like, all when you activate one of these things and they're activated periodically and so on, they just always return a thing that says, here's what I did. And that's pretty simple and kind of obvious, but it says a lot.

Bryan Cantrill: 01:17:53

Yeah. Totally. And, Alessia, you also mentioned in the chat Keter's Ascii Art on the the the management switch. Do you happen to have I think we'll get some output of that. That is

Eliza Weisman: 01:18:06

I was gonna try and track it down.

Bryan Cantrill: 01:18:08

Yeah. I mean, it really it it it and again, it's like it's it's got this I mean, because Dave, it's almost this like pedagogical effect, you know, of it it's it's like almost deeper than a demo. It's actually instructing people about how the system works.

Dave Pacheco: 01:18:23

Yeah. I mean, there's definitely an element of that. And, like, on some level or again, maybe going off the rails here, but, like, what's the difference between debugging a problem and demoing the system? It's whether you know which of these things is gonna be wrong or whether one of them is wrong. Right?

Dave Pacheco: 01:18:37

But you're following the same set of steps to show the state at all of these points. And yeah. Yeah. I mean, when you're debugging, you don't know which of these is gonna be totally wrong. And when you're demoing, you do, but your audience might not.

Dave Pacheco: 01:18:50

But it's I don't know. There's a real connection there, and and it translates to, like, teaching people about the system. Be I guess what the reason what you said reminded me of that is when you're debugging something, especially if it's not something you've been interacting with for the last couple of months, but it's like a new customer issue, you kinda want a lesson again on how it works. You kinda want that diagram that's like, oh, right. That's how these things are connected, And here's where we expect, you know, this piece of data to flow through the pipeline.

Dave Pacheco: 01:19:17

Here's where it's not. And that's there's a lot of value in investing in having the tool clearly print that stuff out. Like, ASCII art is not just for fun. It's really useful.

Bryan Cantrill: 01:19:28

It did it really is. It it really is. And I think that we it's it's useful to to just, like, instantly know, and I think that, yeah, that's the Eliza, it's the Monorail. And then I remember we used the but just Matt's Ascii Art during Cosmo bring up to know, like, okay. Like, this thing is actually broadly working the the way we expect it to be working.

Bryan Cantrill: 01:19:52

And and then also the fact that the other thing, Dave, I also think it is you know, so much of system software is just a grinder, and it's fun to be able to to to actually develop the software that makes it immediately clear what the system is doing. And I I've always found this from from the earliest days of MDB that, like, it's fun to develop the d mods. And it's it it because it's also especially when you're just you know, we're through so many difficult implementation challenges. And the great thing about this stuff is, like, it's really not that hard to implement. You know what I mean?

Bryan Cantrill: 01:20:29

It's like, it's just kinda fun. And I feel sometimes there are days that I need that, put it that way. There are days that I need, like actually, I I I kinda wanna just, like, ski some cruisers here with the with the with the the the debug infrastructure, and it often feels that feels that way.

Dave Pacheco: 01:20:48

Totally. I mean, I'm just contrasting, you know, the demo I was describing with something where, like, when it works, you get, like, a log entry that's like, okay. I did this thing. Like, that's great. It doesn't have the same appeal that you're describing of, like, okay.

Dave Pacheco: 01:21:01

I can actually see this tangible thing working.

Bryan Cantrill: 01:21:05

Yeah. Totally. So then we as we've been getting kind of because I feel like you've been using OMDB to demonstrate almost every waystation on the way to fully automated update. I feel like that's not that's not really an exaggeration. It feels like that many of these demos are OMDB demos.

Dave Pacheco: 01:21:24

Yeah. Totally. I mean, especially on the execution side. On the planner side, we have a different little CLI that is is a cousin. It's another cousin.

Bryan Cantrill: 01:21:33

It's another cousin. We got a big family here. Listen, it's a you know, look, I our our critics online call it a cult. We call it a family. Okay?

Bryan Cantrill: 01:21:40

We got a family of debuggers, not a cult.

Dave Pacheco: 01:21:42

There's a whole little branch of stuff over there. There's a branch of the family over there where I've made a bunch of little small little CLI tools with REPLs in them that let you kind of mess around with internal state.

Bryan Cantrill: 01:21:53

And yeah, the family lives on a compound. We don't call it that. We you know, it's like, yeah. That's what the that's what the ATF will have you believe, but fine. Well, it's been really amazing and it's been a lot of fun to watch.

Bryan Cantrill: 01:22:08

And I thought it was again, it was just very visceral on Friday when it was just like back to back to back OMDB demos. I mean, Dave, that must warm your heart when you've got so many different folks demoing their their work using OMDB as kind of a foundation.

Dave Pacheco: 01:22:24

You know, I hadn't actually thought about it that way, but I I yeah. It is nice to see. It's just nice to see that it's been useful and, yeah, a useful platform for people to just add stuff to.

Bryan Cantrill: 01:22:36

And and then and then it also has been I mean, fortunately, we've been you we'll be talking a lot today about debugger driven development. But then and John, you were kinda mentioning this in terms of the the our, like, the most recent work as of, hours ago where you've got all that infrastructure. Like, good news, You've got all that infrastructure when you actually have a bug that you need to debug. You actually have actually already built it, and we've done plenty of that too. Right?

Bryan Cantrill: 01:22:58

I mean, you're obviously describing just doing it recently, and I know Matthew in the chat was describing when we had a a Chris Bliss issue that it would was using on to be the debug. So we've actually used it as an actual I mean, to debug an errant system as well.

John Gallagher: 01:23:12

Yeah. Absolutely. And I I think it's I I I'm I'm sort of curious, Dave, if you have an idea of what the split is between commands that we wrote ahead of time anticipating that we would need them and commands that we went back and wrote after we had had some bug that we had to work out through much grosser means. Like, anytime we've had to connect to, like, the cockroach SQL shell and run raw SQL queries by hand, that immediately becomes a, okay, I need a new o m d b sub command to do this thing for me because I'm never writing raw SQL for this bug again in the future. Right?

Dave Pacheco: 01:23:46

Yeah. Totally. I don't I definitely don't know what the breakdown is, but there's there's a bunch of both. I mean, then that is kind of there's something gratifying in that. I feel like when you're debugging a problem like that and you're like, okay.

Dave Pacheco: 01:24:00

I have found a new thing that would be useful. I can go build that and land it and, like, I have just made the world better. The next time we have this problem, it it will be easier for all of us. I find that kind of thing really gratifying.

Bryan Cantrill: 01:24:13

Really gratifying. So the important thing okay. So this is actually another important thing Dave is, OMDB ships. It's on it is on the rack. It's not in a way that it's not customer accessible, but it is on the rack.

Bryan Cantrill: 01:24:24

And we've so I mean, I think that's another really important element here. It's not actually something that you've got to go independently download. It's something that it would I mean, don't you feel that that that's been also essential for its adoption?

Dave Pacheco: 01:24:36

Yeah. Absolutely. I think that is really it's really important that it's just there because you can just log in to any of our dev systems, certainly production systems, and be able to use it. It's also important, I think, John was getting at, that it doesn't assume the whole system is there and it doesn't really it doesn't really assume very much. So, like, pretty frequently, especially on demo day, I'm copying a bill a a version of it that I have built over to some, like, dog food, some system that's already set up because I don't need a whole system set up to do my demo.

Dave Pacheco: 01:25:03

Like, I don't need to set up a system with my bits. I just need, like, some OMDB thing or something like that. And I can just point it at that, and it is happy to be in some strange place. It doesn't need anything else. You can just drop it in somewhere and be like, here's the IP and port that's a Nexus or an MGS or whatever, and it will happily print that out.

Dave Pacheco: 01:25:21

And so you can use it in development also against our simulated environment. So something we didn't really talk about, but we have a there's a command in Omicron which stands up a simulated version of our whole control well, not the whole control plane, but most of it. So it's a real cockroach DB. It's real instances of almost everything except for networking and SLED agent, which is, of course, a lot. But in terms of the control plane itself, it's a lot of the real stuff on a real database, And it is the exact same environment that our integration tests run-in, which also means that you can run OMDB against it.

Dave Pacheco: 01:25:55

And you can use that for building and debugging integration tests also. You can, like, start that, start your simulated environment, and then point OMDB at it just to, like, poke around and see what, you know, what gets populated in this table in a normal environment or something like that. And then you can be working on some tests and be doing those operations in OMDB and and viewing the state in OMDB. I find it like a really useful tool for that too.

Bryan Cantrill: 01:26:19

Really useful. Because I mean, you're you're allowing again, it's it's it's just making the system comprehensible and which I just think is, again, so important. Eliza has dropped some Ascii art some Matt Keter Ascii art in the chat of the of showing you the net status, which again is showing you this this ASCII art of what the system looks like. And but indeed, being able to do that in simulation is also really valuable because it allows a especially someone kinda ramping up on the code base to to understand how the thing actually works just by using the debugger.

Dave Pacheco: 01:26:55

Yeah. It's it's huge.

Bryan Cantrill: 01:26:59

Well, I it's been terrific. Eliza, I think we successful we we were successful in stalling long enough for you to be able to find the net status command because I know that that the mean, that was the ultimate goal of the podcast is to get to the ASCII art at the end of the rainbow here.

Eliza Weisman: 01:27:13

Well, we didn't even get to the other my my other favorite ASCII, which is yours in humility tasks.

Bryan Cantrill: 01:27:22

Oh, yes. Yeah. I'd I'd yeah. Do you have a actually, don't know if I got a humility system up right. You said that in moment.

Bryan Cantrill: 01:27:28

Yeah. I'm the on

Eliza Weisman: 01:27:29

a Brian.

Bryan Cantrill: 01:27:30

Yeah. Oh oh oh, thank you. With the word then, yeah, let's please let us continue to stall. I do I I I do love my Asciard elbows. You know, I think that we all and, like, look, we're all Dan McDonald asked us about, like, how have we gotten this far into Asciard and, like, no one's really mentioning Robert's block comments.

Bryan Cantrill: 01:27:46

I mean, obviously, Robert's block comments are what inspire all of us, but we're just really trying to make debug output that would be worthy of a Robert Bastaki block comment. And I oh, Dave, a coffee table book on asking wait. Okay. Look. Welcome welcome again to another chapter of the small demographic book club where we will have the I god.

Bryan Cantrill: 01:28:07

I I I gotta want that coffee table book so badly.

Dave Pacheco: 01:28:10

I just feel like we've got so many good examples from this, and I would love to see more from from other stuff, other software.

Bryan Cantrill: 01:28:16

When other systems across time and space Yeah. How amazing would that be? I got that would be great. And I I think the and what would the Adam, what was the book that that Tom Lyon had that that Shift Happens? No.

Adam Leventhal: 01:28:32

Yeah. Shift Happens. The Keyword Book.

Bryan Cantrill: 01:28:33

Yeah. Keyword book. I I think we can one up the keyword book with an even smaller demographic of of ask ER from debuggers. I I really yeah. That's think we're we have finally reached our I gotta be just great.

Bryan Cantrill: 01:28:49

That'd be great. I would not wanna be we should I that that was one of these books that, like, will be just destroyed in my house. I actually need to have that. That that's too valuable, actually. Be wasted on my children.

Bryan Cantrill: 01:29:02

Okay. So and then a license dropped in. I and this is where I do love my Asciard elbows. This is I I don't know if you if you don't know how much do you do you you're in Emility, but this is the so this is showing you task state output Mhmm. And showing you and and definitely Confluence CPU info minus minus v vibes here as well.

Bryan Cantrill: 01:29:20

We're just kind of like hanging all different kinds of state. But again, this has been super valuable. Humility has also been really valuable for all the same reasons on the b's been valuable. And John, mean, it's John, I analyze you are users of both. And I think that is this a very common theme across all of our development is this ability to actually visualize the system as we're developing it and and that resulting in accelerated development.

Eliza Weisman: 01:29:46

It's fun that humility and own very visual characters in terms of, like, how the output is formatted. And I think those are the hands of very specific individuals, and I think that's also a little bit nice. Yeah. I've I've learned that, you know, when you add a I assume the way that I have gone out of my way to try and format all of the OMDB commands I've added is the way Dave did it.

Bryan Cantrill: 01:30:12

Oh, of course. Yeah. No. We wait. What would Dave do?

Bryan Cantrill: 01:30:15

We ask our we always ask ourselves this when we're when we're extending OMDB. Yeah. It's it's really good stuff, and it's been a really important day to the way we've developed all sorts of software here. So thank you for OMDB. I mean, again, it was it it it was a a spark on dry Tinder and high wind for sure.

Bryan Cantrill: 01:30:38

And I it's been really readily adopted by everyone. So it's been it's been a lot of fun. It's amazing that it was so recent, relatively speaking. How did we we we we're like three years without OMDB.

Adam Leventhal: 01:30:49

What are

Bryan Cantrill: 01:30:49

we doing? Yeah. It's amazing. Yeah.

Dave Pacheco: 01:30:53

I don't know.

Bryan Cantrill: 01:30:55

Yeah. I just I don't know. But it it it's it has been great to have it and really really viable.

Dave Pacheco: 01:31:04

Yeah. I mean I mean glad it's been useful and obviously goes without saying like, you know, what I did was like drop a skeleton down or whatever and, like, all the things we're talking about are a lot of stuff that everyone else in the company's added. It's it's such a general thing. We're just talking about so many different tools that it has in it. So it's been a real team effort.

Bryan Cantrill: 01:31:22

But I think also like people may be at similar junctures on other projects are working on elsewhere. Right? Where it's just like wondering is it worth developing this debugging infrastructure? And Yeah. I think, you know, our experience would be like I mean, obviously, you listen to this podcast, like, you're clearly dialing for yes on this one.

Bryan Cantrill: 01:31:38

But like, hell yes, it's worth it.

Dave Pacheco: 01:31:42

Yeah. I mean, I don't even know what debugging infrastructure you're talking about, but yes. You know? Exactly.

Adam Leventhal: 01:31:48

Are you asking the question

Bryan Cantrill: 01:31:48

right now? Have I have I actually written that in a message to you? I'm probably I I'm sure I have.

Dave Pacheco: 01:31:53

Speaking of the the Homer Simpson asked him Darryl Strawberry if he's a better outfielder than he is.

Bryan Cantrill: 01:31:58

I was like,

Dave Pacheco: 01:31:58

I don't know who you are,

Bryan Cantrill: 01:31:59

but yes. Oh, that was a Simpsons reference. Oh, Dave. They're such a scour. Well, yeah, this has been

Dave Pacheco: 01:32:10

We've never regretted building debugging infrastructure. Right? We've never been, like, we spent too much time on that one.

Bryan Cantrill: 01:32:16

It actually true. Right. I mean, yeah. I I that that really it is really true. I I there is not any debugging infrastructure that I've ever built that I've regretted.

Bryan Cantrill: 01:32:28

I just I I feel it's like one of these things that's also kind of hard to build wantonly. I mean, I don't know. Maybe it it as long as You're building

Adam Leventhal: 01:32:35

from a point of pain to begin with.

Bryan Cantrill: 01:32:37

That's exactly it. That's exact you're building from the point of the actual system. Like you just I think that like when you run into trouble is when you separate out the the the construction of the debugger from the the implementation of the system. And there's a like separate groups and separate teams and maybe even, you know, separate orgs or separate companies. It's like that's when you start getting, okay, this is this.

Bryan Cantrill: 01:32:59

Now we're building a debugger that's not actually useful or you're adding a bunch of things that we actually don't need. And conversely, you're not building things we do need. But as long as you keep those things connected, it's pretty hard to not to to have debugging infrastructure that you regret. Well, this has been awesome. I mean, it's just so great.

Bryan Cantrill: 01:33:18

What a what a what a bull's eye for us. But I mean, this is just like the this is the oxide and friends we've been meaning to have. It's all been something this. It's all been something to this and I am kind of waiting for the internet to tell us like you guys already did an episode on this. I'll be like, you know what?

Bryan Cantrill: 01:33:36

That makes way more sense. I'm so embarrassed. It's like the Dennis Ritchie thing, which apparently we have told before. And we'll tell again. And we'll tell again.

Bryan Cantrill: 01:33:45

I mean, it's that's that's right. That's a warning. So that that that's it's a promise, not a threat. But a lot of fun. I know Morris Chang put your hand down.

Bryan Cantrill: 01:33:57

I know you've done a lot of OMVB work, but it's like we're just running out of time.

Adam Leventhal: 01:34:01

Catch you next time.

Bryan Cantrill: 01:34:02

Maybe catch you next time. Keep up the good work over there though. We we do like the OMD work you're doing. Alright. Well, thank you very much, Dave, Eliza, John.

Bryan Cantrill: 01:34:11

Great to have you with us. It's been a lot of fun. Adam, do try to convince your child not to litigate against us for past podcast episodes if you don't mind. So just just Spencer. If you could maybe get him to sign a release, it's like, why are you putting this thing I I feel I should have an outside counsel look at this it's like listen do you want do you want the toy or not?

Bryan Cantrill: 01:34:29

Like if you want to have the meal sign the document. Like you don't need an outside lawyer to look at that like well I'm just gonna feed it in the chat GPT's like goddamn it. Alright. No happy meal. Yeah.

Bryan Cantrill: 01:34:39

But you know if you could sign away his rights we'd appreciate it just for the future lawyers. Alright well thank you very much everyone and we'll we'll see you next time.

Creators and Guests

Host

Adam Leventhal

Host

Bryan Cantrill

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere