Oxide and Friends | Transcript: The Network Behind the Network

The Network Behind the Network

May 8, 2023 / 01:39:08/S3 E13

Speaker 1: 00:00

Alright. So, you know, I wanna send out I wanna send out the social media message that we are starting now. Is that a seat, a tweet, or a 2? Pick 1, because I'm not doing it in all 3.

Speaker 2: 00:12

I feel like you're one

Speaker 3: 00:13

of these

Speaker 1: 00:15

One of these turkeys needs to win, so we could I can stop. Like, it's just like this killing.

Speaker 2: 00:22

I mean, where do you think our people are?

Speaker 1: 00:26

I don't even know. I really don't. I I mean, I don't even know. I I I, I'm I am, I don't even know. You know what?

Speaker 1: 00:37

I'm gonna do nothing. How about that?

Speaker 4: 00:39

Yeah. Yeah. You know what? You know what?

Speaker 1: 00:41

We've got who's got. We've got who's got. I've already got

Speaker 2: 00:43

chance they can get it in the podcast if they want.

Speaker 4: 00:45

If they

Speaker 1: 00:46

the podcast. Yeah. Yeah. Do us with the podcast thinking, like, I needed a reminder. Well, why don't you tell us which goddamn social network you wanted a reminder on, and we'll remind you on that one next time.

Speaker 1: 00:55

Okay? And if it's in Osborn, like, go take a hike. Forget it. We're not doing it. We're it's gotta, like or sourdable or all this other no.

Speaker 1: 01:04

It's like, LinkedIn? No. Thank you.

Speaker 2: 01:09

Thanks. Our best.

Speaker 1: 01:11

Our best. Alright. Well, so, super excited to have we got Matt here. We got John. I think Arion's actually gonna join us in a little bit.

Speaker 1: 01:19

And I I wanna give a or at least 4 a little bit. Our our end is, still celebrating the new addition to his family. So, the new addition to his family may have some other opinions about him being, joining us for too long. But I did want to get into some, Matt, before we kinda kicked off with you, I wanna get into some kind of prehistory of the the problem that we're trying to solve here. Because this is, this is a thorny one.

Speaker 1: 01:43

And, Adam, I I can't remember if you remember some of these super early discussions at Oxide, but the the the challenge that we had in front of us is we know we're gonna make a rack scale machine, and we know that the components of the rack, the SLEDS of the rack, are gonna have a host CPU in it. And we know that we're gonna need to have some other computer that controls the computer. And this is traditionally called the baseboard management controller, or BMC. We decided pretty quickly that we are not gonna have a traditional BMC, that we're gonna have what we call the service processor, a really stripped down, microcontroller. Although, actually, from a microcontroller perspective, it's pretty beefy.

Speaker 1: 02:23

But it's, it is not the full CPU. And if you I would refer people to our previous episodes of Oxide and Friends on on hubris and humility. That is capital h and capital h. Do you, by the way, Adam, do you try searching Hacker News for hubris and humility? Ever do this?

Speaker 2: 02:44

I have. It's it's, it turns out a term terms that are used very frequently unrelated to our technology.

Speaker 1: 02:51

Really frequently. If you were selecting a technology name to be searchable on Hacker News, like, Hubris is the wrong name to pick. I'm I'm trying to think, like, what is a worst name to pick for its selective look at hacker news?

Speaker 2: 03:03

Sort of like Go is, like, easier to find.

Speaker 1: 03:06

Go is easier to find. It's kind of a it's like a weird cross section of hacker news to search for hubris because you basically get people accusing one another of it.

Speaker 2: 03:18

Yeah. You're immediately into Trollville. Yeah.

Speaker 1: 03:21

And I've tried searching for hubris and humility. Like give me just the senses that contain hubris and humility, and that doesn't improve things very much. It's like, I have never seen such hubris. Have some humility, dude. It's like well, first of all, you're talking about our memory system in our debugger, so could you please not speak about it

Speaker 3: 03:36

today? But

Speaker 1: 03:37

alright. So we we we talked about that before, and we'll be talking about that. I'm sure we'll be hitting on that

Speaker 4: 03:44

quite a bit today. And just before

Speaker 2: 03:44

you move off of BMC's, baseboard management controllers, it's just worth noting that these things have become such piles, right? This is the thing that you connect to via HTTP and it gives you a browser view of the screen buffer and let you, you know, you try to flash the BIOS or you try to, you know, install off of a CD. And sometimes it works and often it doesn't. And it's this very mysterious collection of software. Like, if you've ever dealt with a server of any kind, you're familiar with the pain associated with it.

Speaker 2: 04:19

And when it doesn't work, sometimes people say, we'll try this other random firmware version and see if that works. And sometimes it does, and sometimes that doesn't. But it's very frustrating and a and a really terrible part of the server experience.

Speaker 1: 04:32

It is a big, big, big mess. And sadly, some of our colleagues have had to relive that because we have needed to have commodity machines to do development on what we were developing our own hardware. And I feel like, poor, poor Josh Klulow. Not Josh. I think he, it was like he was awake during surgery.

Speaker 1: 04:51

So, like, there is no anesthesia. The and, so I and I mean, there are so many times I'm like, Josh, we should really start a computer company. It's like, I know. I know. I'm at that computer company right now.

Speaker 1: 05:03

Like, I'm trying to so so painful. So, the as we, we're thinking about how do we, in particular, how do we connect these service processors, which is like, this is this challenge, that you've gotta be able to actually connect them over the network, which is really problematic. And, if folks haven't seen it, our colleague Rick Alther gave a great talk at OSFC in 2019 on an exploit that he found, in BMC's, which he called BMC Anywhere. Or I guess USB Anywhere is the name of the exploit. But use exploiting this kind of service area and the network connectivity of the BMC's to be able to potentially remotely own a BMC, which is very bad news.

Speaker 1: 05:49

Right? Because you once you control the BMC, you control heaven and earth.

Speaker 2: 05:53

So And these things shouldn't be on the on the network on the Internet rather. But Rick discovered that, like, millions of them are?

Speaker 1: 06:03

Yes. He disc yes. He discovered I believe he discovered, like, immediately something like 77,000 that are on the Internet that that had this vulnerability. I mean, it was like so bad bad news. So we wanna avoid all of that, but we you still have this problem of, like, you need a network to be able to connect to these things when they are otherwise powered off to tell them to power on, or you there there's kind of this insurmountable problem.

Speaker 1: 06:30

And what the the thing that we really had, embraced, or or were looking to was something called NCSI. And this is the part, Adam, I was wondering if you remembered some of those early discussions, around NCSI. And so

Speaker 2: 06:47

Dude, and this is, like, one of the the fancy features of a NIC. Right?

Speaker 1: 06:52

Yeah. Well, this is what is allowing that kind of physical cabling to be on on essentially does not in addition to this kind of high speed interconnect, you are on this this much lower speed inter or or lower speed interconnect where you can actually use this thing to talk to a BMC. You can divert traffic, and and this is the the network channel sideband interconnect. And, you know, it seems really attractive, because it means that you can use the same high speed cabling for this management traffic, for the service processor traffic, which feels like the right answer. Because we we, at the time, we believed we were gonna have what I would call kind of a traditional Tioga Pass OCP kind of design, where we would have power in the back and network cabling at the front, and, god, we don't wanna double our cabling.

Speaker 1: 07:44

That sounds awful. We'll use the same cabling. And I said something that I kind of regret, that I I I kind of distilled all this into a catchphrase, NCSI or bust. So I don't know if you remember what you'd say. I just ended up remembering very vividly, NCSI or bust.

Speaker 1: 08:06

And the and this one definitely came back to haunt me, as we were were contemplating this. Because as it turns out, the or bust was looking increasingly likely as we got deep into NCSI. And there are a bunch of challenges. And I think actually, Arne is here, if you can spot him. Arne, maybe you can raise your hand, and we'll get you up on stage here.

Speaker 1: 08:30

And, you know, we were assuming that this management network would be over in CSI, but but there were a bunch of problems. Well, there was a problem there were the problems we knew about. And one of the problems we knew about was this this requires, like, a lot of firmware to behave correctly. And that always makes us super nervous, because firmware often doesn't behave correctly. So that was the kind of the one we knew about.

Speaker 1: 08:50

And, Arne, do you remember I mean, I I was I actually went back to try to find this discussion. I remember we had a discussion that was like really long, and it was the summer of 2020, and it was hotter than hell. And people were getting, you know, it's it it was getting a bit I mean, I would say heated, but but people were getting frustrated because in part because the problem felt very frustrating. And even folks that were advocating for the use of NCSI knew that, like, all these problems, we know we're gonna have problems with firmware reliability and so on. But, you know, it just feels like NCSI or bust.

Speaker 1: 09:22

I'm like, oh, gosh. I should've said that. I should never

Speaker 4: 09:24

I was definitely in that camp, though, because I was very pro in CSI because I've never used it. Everyone who's never used it is very pro. The the big one though that we really struggled with, which none of the Nick vendors were, or none of our Nick partners were were very doing a good job explaining is how the how do you reset the Nick properly? That's right. And how how much can like, basically, the problem is that the NIC is actually now not really a NIC.

Speaker 4: 09:56

It's more like a little switch because there's 2 ports and there's a little, merging of traffic going on. There's a little rule engine that executes stuff and that merges these flows. And now how much guarantees do we have that traffic from the from that management interface never end never ends up in the in the queues for the OS? And how or the other way around. How much how much can we make sure that the OS can never spam its own management interface?

Speaker 4: 10:23

And then the bigger part was how do you reset the NIC without resetting the management interface? Because ideally, the management interface is a separate sort of portion in the silicon that has its own reset sequence that is reset ideally even using a separate pin from, you know, your BMC or your your service processor in our case versus the other all the other logic, like your heavy lifting logic for the PHY and everything else that the OS has control over. The problem is you need the OS to set up all the basically all the silicon surface area to get the PHY to even work and the link to work. So there's this chicken egg problem, and it's it's it's really messy.

Speaker 3: 11:04

Who can

Speaker 4: 11:04

reset what and where and and how? And and we simply did not get the guarantees. We did not get the we didn't we didn't get the not even the guarantees. We're not even we were not even getting the warm fuzzies if this stuff would ever work well.

Speaker 1: 11:19

We we didn't even get the cold fuzzies. I mean, yes. We were asking questions of vendors. It was just like blank blank blank. You're like,

Speaker 4: 11:26

Well, the problem is that no one has really implemented this interface. Yeah. So if the spec is there, it's the the there's a protocol. It's a protocol specification. So it other than that it is RMI I think or that's RMI I believe.

Speaker 4: 11:48

But nothing about how their management portion of the sort of ASIC is supposed to work. Therefore, every every implementation does its own thing, and they may do it they may do it right, but more more likely than not, they just thought they did it. And so and and so the the we we were looking through some of the documentation, and we were asking. We had a long list of questions for each of these NIG manufacturers. And I remember that we were looking through pieces of documentation and and and driver code and whatnot and trying to figure out what happens when you do certain things, with the firmware.

Speaker 4: 12:24

The o can the OS poke the firmware in a certain way to lock the management controller out? And and we've we even from reading documentation, we could already infer that some of these implementations were simply not gonna do the right thing.

Speaker 1: 12:35

Well, you could already feel like we have left the tarmac. We are off the actual we we are off road here, and I can, like in fact, we might even not have been on a dirt road. We may actually

Speaker 4: 12:43

Like, the the only the only way that we could have really convinced ourselves that this was gonna work is if if we would have seen the RTL that actually built up the device, like or at least the reset tree for the device, and no one was obviously willing to share that with us. Definitely not at that stage. So

Speaker 1: 13:00

Well, and even our more our our kind of our best partners, more most forthcoming folks, we're like, boy, we cannot get these questions. Like, never exactly heard this question. Like, that is not a good sign, because we feel like we're asking the kinds of questions that anyone should be asking. Who is you? And it's and then so we we are you mentioned the reset concerns, which were really deadly, but and the power domain concerns were really deadly.

Speaker 4: 13:20

Oh, yeah. These are super messy too. Like, some parts, because the the the chip each of these chips consists of several power rails that you need. Some of them marks are like like, I think, like, 5 or 6 rails. It's it's it these are fairly complex devices because a lot of them do a lot more than just being a nick.

Speaker 4: 13:37

They have, they they have a DRAM controller in them. They have they have Exactly. They have 1 or more CPU cores in them. Some of them have a lot of CPU cores in them to be It's

Speaker 1: 13:46

a smart Nick, aren't it? Please. It's a smart Nick.

Speaker 4: 13:48

Yeah. But and so there's there's a lot of power circuitry in order to to power each of these more complex pieces. The PHY itself for a 100 or 200 gig link, usually has 3 power 3 power rails for the like, 2 for the analog portion and then the one for the digital portion. And and then there's PLLs and, like, you know, the whole shebang. Like, it's fairly complex, like, stuff.

Speaker 4: 14:08

And then they load it up with all sorts of features in order to enable these storage applications and I don't know. Like, you know, the kitchen sink gets put in this thing. And and so even powering 1 on is, actually less than straightforward, and it gets really messy. But, like, if you want to because our our our server design is very, we have we're we're very deliberate about power domains, and we can we have a sort of monotonic ramp where where pieces are enabled depending on the power domain that the machine is in so that we can also power stuff off properly if needed. In order to, for example, conserve power if you don't need need the the machine or or whatever for whatever reason.

Speaker 4: 14:51

And that got really messy with part of the chip wants to be in 1 power in in an earlier power domain than all the stuff that is running when the host CPU is running. But you can't run a chip halfway. That that that that that just that's not a not a mode that has ever supported. And so Right.

Speaker 1: 15:08

You

Speaker 4: 15:09

would have to power the NIC on before the the host CPU is powered on. And then at what state does the host CPU find the NIC? Unclear. And then how much of how much initialization of the NIC needs to be done by the host OS versus the the your service processor or your BMC. So it it got real messy real quick.

Speaker 4: 15:26

And so as much as I really wanted that solution to work, the the the the Robert, in particular, convinced me that this was we were gonna be in for a world of hurt. And I I think

Speaker 1: 15:40

it was one of these yeah. And it was one of these things where you could just tell that every step is a step in the wrong direction. This has happened only a handful of times. I saw in my own career where you could just realize that, like, we are down the wrong path. And, you know, I a friend of mine was at a company and left abruptly, after a very short stint.

Speaker 1: 16:01

And I was asking him, like, wait wait. What happened? And he said, you know, there's there's there's an old proverb. If you find yourself down the wrong path, go back. I'm like, that's that's a pretty good proverb.

Speaker 1: 16:13

And so I think we realized that we need to go back. And and meanwhile, something else had happened that was really important, and that was that we had made a big decision about how we were gonna cable the system. And instead of having cabling out the front, we were really aggressively exploring a cable backplane and blindmating into it. We talked about that, with Doug a couple of episodes ago, and we began to realize, like, wait a minute. Now this major argument for NCSI was reducing the cabling burden for the operator because we would we didn't wanna double the cabling.

Speaker 1: 16:50

But if it's a cable backplane you're blindmating into, it's like operators don't care if, you know, you're using what the a couple of those differential pairs for for, the management network. So the I think all of these things kinda added up. And, yeah, Ari, and there was this moment where and I think it was again, because this had been, you know, I and I would quite say contentious, but there definitely had been, like, folks are like, NCSI is a mistake, and, no, we need to do NCSI. And then there's me having to reconcile my my past self saying NCSI or bust. And because I really wanted it to work too, but we're just coming to the conclusion, like, this is the wrong path.

Speaker 1: 17:22

We need to stop. We need to go back. And we need to do this problem that was actually a a problem that we were resisting because it is also thorny, which is the alternative. And what's the alternative? The alternative is you need a separate management network.

Speaker 1: 17:38

So what that means is

Speaker 4: 17:39

called that the bag on the side for a while. That's that's how that's how not we want well, the How it's This was really a This was really a thing where we had this clear like, not split is maybe the wrong word, but there were definitely 2 camps. There was one group of people, like, NCSI is definitely what we want, and then the other was, we should not do this ever because it is such a and and then the interesting thing is though that once we laid it out on the table, everyone agreed though that it was not the right decision to push for that and that Yeah. We should find an alternate like a different way to implement this, which was and I remember this very vividly because I was smack in the middle of writing up the design for what the networking switch was gonna look like. Because at that point, we were already working on our own design because we'd already done we'd already come to the conclusion that we wanted to do an ex like, an externally PCI connected device, not put the host CPU in there.

Speaker 4: 18:32

So we were already on our own path, but then then the question became, like, what is that separate network going to look like? Are we gonna stick that is that gonna be a separate chassis? Is that gonna be cat 5 sort of cabling or cat 6 cabling? That looks gross. Like, how how do we how do we deal with that?

Speaker 4: 18:51

And so

Speaker 1: 18:52

We and yeah. Well, then you get the kind of the the decision that we're going all in on the cable backplane kinda coming in there, which, but we still and we're doing our own switch, but this means now we are gonna have a second switch effectively. We're not gonna do one switch. We're gonna do 2. And, Arion, do you wanna, talk actually, before you talk about it, because I I I had asked you earlier, and I I I know, again, you got the new addition to the family at home, so I know thank you very much for for taking some time away from your journey to leave to to talk about this.

Speaker 1: 19:25

But I had asked you prior to this for the origin story of what we have come to call the management network, which is monorail. And I thought I knew what the origin was, but now I am worried that I am doing the thing that Adam accuses me of, and sculpting history. I'm, like, retconning my own history. So now I I I I and, Adam, it really you have to understand that it is it's painful for me to make this

Speaker 2: 19:51

sound right. Painful, and I appreciate it.

Speaker 1: 19:52

Yeah. Yeah. If I'm making it sound odd, it's sufficient to handle the noise, like, I mean, I I I think it's our chest pain is normal when you're saying things like this. But so I thought I knew, but now I went through my chat history. I'm like, now I am a lot less certain.

Speaker 1: 20:05

So, Ariane, before you describe what it is, could you describe the origin of the name monorail? Or maybe you need to describe what it is to get to the origin of the name.

Speaker 4: 20:16

The name is somewhat technical in in in in nature, but it so actually, I think Keith was the one who coined the term. And Wait a minute.

Speaker 1: 20:27

You're punting this, Keith. We can't get you No. No. No.

Speaker 4: 20:29

So Keith Keith Keith came up with the name. I know why it is called monorail, but Keith

Speaker 1: 20:34

Okay. Okay. Good. Good.

Speaker 4: 20:34

I I I don't wanna I don't wanna claim that I I coined the the the the I gave it the name monorail because I did. Okay. I did I did like the name a lot, and I carried it forward. And and I wrote it down a lot so that everyone else started calling it that. So that that I guess that that I did do.

Speaker 4: 20:50

But so the problem or the the so why the name monorail? Well, this whole network came to be out of quite a few constraints because, yes, we had to do what we wanted to do now was a secondary ASIC in the same chassis because we didn't want to use a sec a separate chassis. So we did wanna cheap out and just take it off the self shelf switch and then try to squeeze that into the rack with the with the additional cabling. Now so that was that was part number 1. Then the problem then became how many ports do we need because we have 32 servers in this 32 sleds in the in the in the rack, plus then 2 switches themselves that need to be connected.

Speaker 4: 21:36

We want a technician port on it. 1 or more technician ports on each switch so that you can so that that is connected to that is accessible on the front of the rack so that you can connect a laptop or something into the rack so that you can do management functions. And then we had the powershell controllers that we wanted to connect. So very quickly, the number and then we wanted an interconnect between the management network and the main switch, the Tofino ASIC that we have. So that we very quickly run into this thing where it's like, oh, how many ports do we need?

Speaker 4: 22:06

Well, it turns out we needed something like 30 7 ports, if I remember correctly, and Matt can correct me if I'm wrong here. Either 36 or 37. And so finding an ASIC that was even suitable that could that had that many ports was kinda challenging because if you go you you you would think that, oh, well, there's, like, 48 port switches on the market. Then, yes, there are 48 port switches on the market. But the way these are constructed is either usually, they have one central so so one switch ASIC, and then, you need sort of supplementary or complimentary, ICs or chips that make up then PHYs that give you then all these ports.

Speaker 4: 22:48

And the way that they, they connect these things together is using a, sort of, like, industry pseudo standard called QSGMII, with with SDMI standing for a serial gigabit inter independent interface. And then the quad is then you can combine 4 of those in a single transceiver, so a single send receive pair on your printed circuit board. And then so you can do 4 links through that. And so you have these sort of breakout chips. So these chipsets in 48 gig switches usually consists of one switch that can that has for example, 32 ports accessible, and then you need to sacrifice some of those and then you can use these quads to then get to 48.

Speaker 4: 23:33

So it's like it's a bit of a puzzle to get there. Because all these ASICs say they support something like 50 ports, but that those are logical ports. They're not actually physically bonded out on the IC and therefore accessible to you to build a network with. So so the puzzle there was how do we get to, a number of ports that we like, physical ports that we actually need? How do we do that in a set in a in a as little like, the smallest number of, integrated circuits that we can buy because all these things cost money?

Speaker 4: 24:03

We were in this like, remember, we're so we're we're talking now summer 2020. We're starting to get really deep into the, you know, parts are not gonna be available. All the suppliers have pulled their supply their their inventory from the distributions or the the distributors. So you have to go to each individual, silicon manufacturer and ask them for inventory. The automotive industry is is transitioning to Ethernet.

Speaker 4: 24:29

So there's lots of e like, lots of Ethernet that is going into automotive applications. And unfortunately, not so much the switch ASICs, so that's that's okay. But, you know, we're small fish. So getting getting enough inventory is is problematic because you're if if there isn't if there's only so many of them to go around and there's someone who like a bigger organization that places an order, then you you might be screwed. And then the final piece was we wanted so there's only so many ways in which you can actually connect systems together.

Speaker 4: 25:02

And, traditionally, you do that using, twisted pair Ethernet or or or CAT cabling. And the problem there is that a 100 gig or yeah. A 100 megabit link uses 4 like, 2 pairs, 4 conductors. A gig link uses 8 conductors. And so initially we don't really need a 100, like a 100 meg is fine, but it would be nice if you had a path to a gigabit in the future because you never know how much data you're gonna be pushing through these links.

Speaker 4: 25:34

You might want to change at some point parts of the design, and then, you know, you might be doing, you might be doing, like, larger software updates through this. Like, you might be distributing a, several 100 megabyte, OS package or something through this thing to to to as your first your first, you know, your stage c is 1 and 2 of your OS loader. And so you don't know how that evolves. And so you have this discrepancy between 2 conductors or 4 conductors and 8 conductors. And then twisted pair Ethernet relies on magnetics on on inductors to decouple 2 systems together.

Speaker 4: 26:14

That's how you can have 2, 2 hosts connected using, you know, a 100 to 200 meters of Ethernet cable in 2 different buildings on 2 different, 2 different power circuits without, having currents flow between these 2, these two buildings in ways that you don't want to. And the problem is that these magnetics, if you look at a NIC, you can see those are pretty visible. They're like these larger, chunky square usually, they're like these black square, cube thing looking things that sit right behind the ethernet jack. And, if we wanted to put 30 something of those on a circuit board, that's a significant chunk of real estate. And we we simply did not have that real estate.

Speaker 4: 26:59

So then how do you over like, first of all, we don't wanna deal with this 4 conductors versus 8 conductors for a 100 for a 100 megabyte 100 megabits or 1 gig. And then how do we do this in a way that you don't need these magnetics? Now so there's there there are some some trickery that people have done with if you're staying, you know, on the same power circuit roughly, then maybe you don't need these these inductors, and then you can AC couple, twisted pair ethernet, but there's no specification for any of that. Some chips sort of support it. Some some say they do.

Speaker 4: 27:33

It may not be validated. So you're definitely in, like, sort of undetermined land. And you're you're building something that you're gonna be on the hook for, and and it might not be compatible with anything else that you might, like any ICs that you might buy in the future. So that's not an an ideal situation. So ultimately, the the the solution we landed on was that the the switch ASIC that we picked from Microchip exposes most of its, SerDes, most of its ports over s g m I I, the the serial gig, independent independent interface.

Speaker 4: 28:07

I forgot what the acronym is for, which is a loosely defined thing from Cisco that Cisco used between that is it's effectively used between GBIX, like optical transceivers and and gate max, a media independent interface. There you go. But then in a serial form, so it goes over 2 conductors or 2 differential pairs. So you have 1 send, 1 receive. And that's where the monorail name came from because you have basically one transmit receive pair in order to have one mono to, build up a link.

Speaker 4: 28:48

And so And then the nice property that we got from that is that, SDMAI in its sort of standard as it is, is an LVDS signal that you can AC couple, meaning that you don't need to have 2 circuit boards connected with 2 grounds. You can you're not relying on the ground to for your for your signal levels. So you're using differential signaling between the two systems that is very similar to how an existing, 10 gig, 100 gig, link already works. And so the the same cabling that we would use in the backplane for the 100 gig link, the same twin x would be perfectly suitable to also run that LVDS signaling over. Contrary to twisted pair, cat cables are slightly different.

Speaker 4: 29:39

They they work different. They don't have a shield. Well, CAT 7 now does, but, CAT 7 is actually just TwinMax. So it could have worked, but, so this this looks much closer to, the 100 gig link that you already have between the main switch ASIC and the host, except now, it's just a gig link. But then we can we can gear that down so that the link in practice runs at 1, or we we do an actual 100 megabits of traffic over it.

Speaker 4: 30:09

But the link is 100 or 1 gig capable. So in the future, if we have faster parts, we can we can use 1 gig if we wanted to, and we don't have to change any cabling. We don't have to and and the switch itself would still be compatible. So there's this was a long puzzle to make work, which took, I don't know, 2, 3 months to complete, I think, before we were all in agreement how this thing was gonna look. So Yeah.

Speaker 4: 30:33

That's how that's how we ended up there. Long story.

Speaker 1: 30:36

We and so as you can imagine, the much shorter story that I had in my head was that this was what I think anyone certainly anyone from Gen X and indeed most millennials would think when you say monorail, it's March versus the monorail. I mean, you immediately get the Simpsons reference.

Speaker 4: 30:52

Well, I I mean, that might have been in there, but that that I don't know. I don't know. I mean, has he

Speaker 1: 30:58

known that Keith originally coined this? I kinda feel we kept this. I mean, clearly, it's got a I mean, this is sure to make it a very good Simpsons reference instead of Simpsons references.

Speaker 4: 31:06

That has Keith watched? I don't know.

Speaker 1: 31:08

I I feel, Adam, we can answer that question on behalf of me. Yeah. Lots. Lots. I I'm I'm almost ready to ask Keith this question because he'd be like, no.

Speaker 1: 31:16

You dummies, of course, isn't the same structure. What else do you mean? The but what I do know, and, Ari, this is why I was confused about I guess I went back to chat. What I do have is a private chat between us, and this is in November of 2020, when the, you say the first this is the first reference to monorail I could find anywhere in kind of Oxide's history, is you saying to me, I'm dying here with a monorail song. I guess I'll go by Wyle going forward, Wyle Langley being the character from the Simpsons episode.

Speaker 1: 31:48

So really I am now I think

Speaker 4: 31:51

once once the name came to

Speaker 1: 31:52

be once the name

Speaker 5: 31:53

once you said it was Yes,

Speaker 1: 31:55

absolutely. Yes. Yes. Yeah. Absolutely.

Speaker 1: 31:57

And and it is a, and so when I wasn't sure if I was gonna be able to get a hold of you today, I did I I I called Robert, and I'm like, Robert's also not able to join us now. But I'm like, hey, Robert. What is what is your recollection of the, of the origin and the name? And he's like, if you are looking for my permission to recon it into a Simpsons reference, you have my permission. Like, alright.

Speaker 1: 32:18

That's exactly what I'm looking for. And that was great. I I was like, wow. Quick quick phone call. That was

Speaker 2: 32:22

a good show. Good show.

Speaker 4: 32:24

The very first time it came up was actually in a meeting. It was a, like, it was it was a spoken thing. Like, I think so too. It referred to it immediately as monorail for multiple reasons. One of which was the the it's gonna be an SDMI link.

Speaker 4: 32:38

And so but then, like, definitely, the whole circumstance of how the thing came to be was very Simpsons esque. So yes. That

Speaker 1: 32:46

that's great. Well, it it it is a great name, and it is a and I think it's, you know, and I remember even you at the time because this is a problem that you were really wrestling with. Like, how do we do this? It feels like a very over constrained problem. And, the the the the method that you found to do this in terms of using the, what, 100 f x.

Speaker 1: 33:08

Right? In terms of the, I don't know if you wanna describe a little bit, like, the path that a packet takes, as it as it as it goes over the monorail, but it is, it's it's kinda wild. And this was one of those where where, Ariane, I just remember you saying, I don't know why this wouldn't work. And you were asking other people, and that was the answer you were getting from a lot of people. Like, I don't know why this wouldn't work, but it is weird and Webex has, like, never seen it.

Speaker 4: 33:36

Yeah. So the problem here was that the, yeah, because the constraints I laid out, those are the those were only part of it. Because the the the thing that we were struggling was the the service processor we picked has an RMII interface, one RMII interface. So you can connect to 1 Ethernet MAC or it has an Ethernet MAC that is exposed using an RMII interface. And you so you can connect to 1 Ethernet thing, whether that's so that can be an ethernet PHY.

Speaker 4: 34:06

So you can you can have a physical copper connection to an ethernet network, like an ethernet switch using any a a cat 6 or cat 5 Ethernet cable. Or probably you can use it you can connect RMI interfaces like MAC to MAC. You can connect directly to a switch, which exposes its its MAC over RMII. That will probably work, but there's no guarantees that it will. I found some discussions, for example, on a TI messaging board where people were asking that question, like, hey.

Speaker 4: 34:36

Can I take the from this microcontroller and connect it directly to the switch? And there were there were some back and forth of whether or not that was that was supported, and it was def it was not it was had not been validated in in silicon, so they didn't know if it was supported. You you were on your own to try. So we were definitely off in land. It was like, oh, like like, we're definitely going off the beaten path because what we ultimately wanted was that single RMI connection needed to go to 2 independent switches because we wanted the redundant management plane.

Speaker 4: 35:06

And so the the you you you I need to you need to post a diagram to make this visible because it it will be it's kinda too difficult to explain or visual helps a lot. But basically, we're going from the service processor that had has 1 m RMI connection to a small three port switch that sits right next to it that so one of those links is then that RMI link to the service processor with 2 remaining links that are unavailable to go to a to to 2 switches in the rack so that you have 2 independent management plans. And then we're we're using some VLAN trickery to keep these these two paths separate, in the host and or in the service processor so that you have 2 truly independent paths that can be active, that can both be active and you don't have any any switching loops, etcetera. But the the problem was I could not simply find a we could not find a small switch that would have 1 RMII interface and 2 SDMI interfaces because that was ultimately what we wanted because we wanted to connect those SDMI interfaces then through the backplane to the switch ASIC that sat in the different chassis, in the switch chassis.

Speaker 4: 36:16

And so ultimately what we ended up with was a there's a there's a PHY part for Microchip, which exposes an SDMI interface on one side and a or 2 s g rather 2 s g m I interfaces on one side and, 2 100 FX, which is the original 100 megabit fiber standard, which is not really a standard. It's also a little bit loose to find on the other side. And that thing actually, in this case, acts as a media translator. Overgoing service processor using an RMII interface to a little switch, which then has a 100 effects linked to a to a, a PHY, which then goes s g m I through the back line cabling and then into this into the rack switch itself where then the link is is connected to a MAC on the other end. And, yes, this was very much a I drew it up and I'd be asked, I remember emailing back and forth with some FAEs for Microchip in, this is their division in Denmark.

Speaker 4: 37:20

And they definitely said like, yes, this is, this should work in theory but no one has ever built this. This is why would you do this? Because effectively what we've done is if you look at how this would look in a traditional sense, the PHY that normally sits in the switch chassis now sits in another chassis across a cable. Like, we we we basically taken part of these part of the Switch and put it in every server, which is which is kinda kinda weird. But it it works and it, it it it it now exposes a clean s g m I interface to the outside world from both the switch and the and the and the host.

Speaker 4: 37:58

And so in the future if there ever is a microcontroller for example that hasn't that has an SDMI interface or maybe even 2 SDMI interfaces, that would be absolutely great. We could connect them natively. Or if we find a small switch that that has 2 s g m I interfaces and an r m I link, we can replace the the MAC and the PHY plus that little switch with one part. Or and this is a path we can even go potentially explore. We can even drop the s g my eye and we can do some jiggling and like do 100 effects too.

Speaker 4: 38:29

Because it turns out the signaling is electrically similar enough and then we can do use some configuration magic using our our third management network to then bring the link up. But that's maybe a story for another time.

Speaker 1: 38:42

Right. Spoiler alert. There's another network behind the network behind the network. Yes. We we we we we're we're winners.

Speaker 3: 38:48

Alright. I think you're skipping the funniest part, which is that it's impossible to find a standard for a 100 base FX. Like, no one believes this.

Speaker 4: 38:55

There is no standard.

Speaker 3: 38:56

Like, I think there's no actual standard for this.

Speaker 4: 38:59

I think the original basically, like, 3com or someone made something, either 3com or Cisco. Someone made something to connect GBIX to a Mac. And basically, they needed something to connect these these fiber transceivers to to a to a switch ASIC on a PCB. And someone invented something, and they sort of loosely drew it up. And then they gave the specification to all the GBIG vendors.

Speaker 4: 39:26

And everyone has just been sort of implementing this, but there is no specification for this. Like no one can tell you exactly what the electrical specification of this, like, what what the fullness levels are. I I don't know how that the whole world works, but it does somehow. And and it's it's a hope and a better way to when we

Speaker 3: 39:44

were trying to bring it up, like, is it AC coupled? Is it DC coupled? Do you have to pull it up? Do you have to terminate it? I mean, you you ended up figuring out experimentally, if I remember, if I just swapped

Speaker 4: 39:52

it on the other side. It should work either DC or AC. And, it turned out that in our case, one side really wants to pull up the other like, really wants it to be pulled up. It it actively does that. And then the the the source didn't like that.

Speaker 4: 40:07

So we had to do AC couples and then biased and, I don't know, like, bunch of trickery. But, yes, that that was an experimental setup. We basically built a prototype using some of our proto boards, and then we added components, and we removed some components until the link worked.

Speaker 1: 40:23

And so, Matt, this is a great point to get you into the story because, so you by the time you had come to Oxide in kind of, like, what, like, September of 2021, we had figured out that this cray we wanted to do this crazy thing. You'd try to come after we'd already come to this conclusion of doing this thing that, like, should work, but no one has done it. Why would you do it that way? And this is kind of, to a certain degree, drops on your lap of, like, hey. So now we need to make all this software work.

Speaker 1: 40:54

And in particular yeah. Exactly. But, Matt, please make this work. And in particular, I just remember, you know, Ariane, you and I shortly before Matt started going through the VSC 7448 manual, and, like, I don't know what I was expecting. I mean, it's obviously complicated, but holy god is it complicated.

Speaker 1: 41:11

And in particular, there's a MIPS core in there. And you're like, no. No. Don't worry. Like, we're gonna we're not gonna

Speaker 4: 41:15

use the MIPS part. One core

Speaker 1: 41:18

in there.

Speaker 4: 41:19

Oh, are you talking about the switch? Well, that was switch.

Speaker 3: 41:21

Yeah. The server.

Speaker 1: 41:22

Yeah. The server. No.

Speaker 4: 41:22

No. No. No. This well, with the switch, we knew that there was a MIP score there because No.

Speaker 1: 41:25

I know.

Speaker 4: 41:26

We don't the the dev thing runs Linux. It's like a full on Linux machine.

Speaker 1: 41:30

Right. And so I don't think I realized again, though, it was kinda like this daunting moment of like, holy god. Like, yeah. Okay. 2nd switch.

Speaker 1: 41:38

No. You actually need a second switch operating system now.

Speaker 4: 41:40

No. No. No. The manual is is a couple 1,000 pages.

Speaker 1: 41:42

It's a couple 1,000 pages long. Yeah. It it is a long manual. So, Matt, you started, and this was kinda your I mean, this is like, you know, you were looking like, hey, how can I where can I chip in? And it's like, yeah, we've got no one can get the software side of this.

Speaker 1: 41:57

So I think I if I recall correctly, Matt, one of the first things you did is bought the dev kit for the 7448. Is that am I remembering that correctly?

Speaker 3: 42:05

Yes. So Arian I I showed up, and Arian was like, here, just expense this I think it was, like, a $24100

Speaker 4: 42:11

Yeah. $2,000 dev

Speaker 3: 42:13

kit. Yeah. It's it's not cheap and it is built like a tank. Like, it's got half inch acrylic mounting plates on either side of the PCB. So I could throw this down the stairs, and it would be fine.

Speaker 1: 42:23

You would well, you would you would yes. It would be fine. The stairs would not be fine, And it's not even clear to me. The thing is gigantic. It's like physically huge.

Speaker 4: 42:31

So do we still have pictures in the in the, in from our our accident friends about the dev boards? I think I put pictures in the in the in the album of 1 of our gimletlets mounted into like, I I hacked it into the into my dev unit. Like it's screwed in. Like the gimletlets sits under the acrylic. Well that's yeah that's your picture.

Speaker 4: 42:55

But I have a there's a picture where I put my gimballet inside the thing and then and then wired it in. Because one mode and one of the modes in which we can run the switch ASIC is you can basically tell the CPU to always stay in reset and never bother, and we connect spy to our service processor instead. And so the service processor then runs our own driver that we've written to bring this whole thing up. So we bypass the whole MIP CPU. That thing never

Speaker 1: 43:23

that thing never Adam is not for people to please post vectors, because Yeah.

Speaker 4: 43:27

Let me

Speaker 2: 43:27

go that's right.

Speaker 1: 43:31

Yeah. So, Matt, you yeah. You would just probably get the little kit. Yeah.

Speaker 3: 43:34

Yeah. So I get this dev kit, and it is bigger bigger than I expect. And so I, like, unpack it and just try to fit it on the desk in my tiny apartment. And like Arien said, so the the plan here all along was this thing has a core. We do not want that core running, for the typical oxide reasons.

Speaker 3: 43:51

Like, we wanna be able to own the firmware that runs on it. We wanna have a good amount of trust in the system, and we also don't want, like, another yet another processor. We keep finding them in various places. And so the plan all along had been we are gonna keep that reset, which you can do with pin strapping, and then configure it over spy. Because over spy, you can read and write registers.

Speaker 3: 44:11

And, like, it's a chip. If you set registers to the correct value, it will behave in the correct way. And that was kind of the plan all along. And I don't know if anyone quite realized how many registers we would have to set the correct values before this chip would actually come up.

Speaker 1: 44:25

No. We did not. Editors would narrator's voice. They definitely

Speaker 4: 44:27

I had I had some sense. I just kicked it down the road. I was like, I I can't think about it because then I I I don't I I I start to hyperventilate because I really don't have enough time to build it. I know I knew that it was gonna be lengthy and because I had I had at some point because the driver is open source. And so I had I had spelunked a little bit through the driver to figure out, like, okay.

Speaker 4: 44:47

How how does this look? And, well, as you might expect, it is a large blob of c, and it is, you know, pretty dense with lots of stuff that is not really documented as you would expect.

Speaker 1: 45:00

I think Matt's got some opinions on all this.

Speaker 3: 45:01

So it's interesting. So, like, the the chip has kind of 3 layers of documentation. There is the datasheet, which is 520 pages and is a PDF. Attached to that PDF because that's a thing that semiconductors like semiconductor vendors like to do, there is another PDF, which is the list of registers in the chip, and that one is 845 pages. And it turns out that neither of those data sheets is actually sufficient to bring up the chip.

Speaker 3: 45:24

So I I put a link in the chat to Mesa, which is the Microchip Ethernet switch API. And to their credit, it is open source. It's on GitHub. You can go look at it. And the canonical way to bring up the chip is to use Mesa, to the point where if you have problems with the chip and you ask Microchip for help, they will say, why aren't you using Mesa?

Speaker 3: 45:43

We can't help you. Have you seen how big the data sheet for this chip is? Good luck. So we've had, we've had issues with that in the past. And so, yeah, a lot of the bring up was just tracing through their SDK and figuring out what exactly it's doing.

Speaker 3: 45:56

Because the chip is, you know, 54 ports, 80 gig. It can be configured in any number of ways. Well, then you're using

Speaker 4: 46:02

the right as sort of a soft emulator for this thing that would basically log all the reason rights that would happen? Yeah.

Speaker 3: 46:08

So at one point in the state of desperation, I took their SDK. I compiled it on my own personal computer, and I replaced all of the register reads and writes with calls that would print what it was trying to do. And then by running switch startup, I could get this very robust log of every register operation that it would try to do and then compare that against my own configuration code and see what was different, basically, which is also fun because the switch attached. So in certain cases, their code would try to read registers and expect certain values. So I had to add a bunch of special cases of, like, oh, when it reads this status register, return 12 because that's the right value to keep the the startup going in the code.

Speaker 1: 46:49

Yeah. Absolutely brutal. And as I recall, you you discovered a lot of issues that way. Throw like, issues that were preventing the thing from working by effectively understand like, tracing through for things that, as it turns out, were very load bearing. And I

Speaker 3: 47:03

There there were several cases where I got to the point of, like, I know there's one thing that I have configured wrong, and, you know, someone that has been using this chip for longer than I have could probably tell me in 15 minutes, but I have not been using the chip for that long, so it's gonna be another 2 weeks of struggle. I think one of the big ones was jumping a little bit ahead when we were trying to bring up qsgmii. So like Arian said, we picked this chip because it has a lot of surveys, and we can connect it directly to almost every service processor in the system, but not everyone. I think it has, like, 35 surveys, and we need 40 or something like that. Yeah.

Speaker 4: 47:37

We no. It has 32. I mean, I think, 32, and I think we need 36. So we needed a couple Yeah. We need a so we use 1 QSGMII to 4 times SGMII sort of media converter chip from Microsemi.

Speaker 3: 47:51

And that was the the root of many problems. It turns out that for some reason, one of there are many different modes you can operate the chip in. 1 of the modes is ganging up groups of 4 surveys, to do a single x a u I I link, I believe, which is a 10 g are cooperating to send a single data stream. And if you don't want that cooperation, you have to turn off this feature, because it's on by default. And that gives us, like, 2 months to figure out.

Speaker 3: 48:30

Eventually, Ari had finally tracked it down.

Speaker 4: 48:34

Well, yeah. There's there's, like, the set there's a setting where it basically because if you want a link to behave, then a link to behave the way it needs to, you need to have all these link be driven by the same clock. And basically, what it does is it there's there's this one bit that you need to set or unset rather, in our case, that basically distributes the clock to 4 that that quad of those of those SerDes. And unless you do that, they will all be together on the same PLL. So one of the links will work, but all the other ones will break.

Speaker 4: 49:05

And we had some really interesting behavior where it wouldn't everything was fine until we brought we brought that link up, and then suddenly, one of the other links started misbehaving when one went down. I forgot what the what the what the what the Yeah.

Speaker 3: 49:16

No. It was like the links were grouped in sets of 4 for this x a u I I thing.

Speaker 1: 49:21

That's right.

Speaker 3: 49:21

And so I discovered, like, oh, for some reason, ports, you know, 53, 54, and 55 only work if port 52 is up. Like, what the heck does this mean?

Speaker 1: 49:32

Well, and this is what those where you're just like, I know that there's there's gonna be, like, one underlying cause that's gonna that is gonna explain all these symptoms, but I I've got no idea what on earth it could be. It just makes no sense that all of a sudden this other port, you know, 3 ports away starts behaving differently depending on how it configured this port.

Speaker 3: 49:52

Yeah. This was around the same time when we, when Eric actually physically soldered a probe to the QSGMI link, and we captured a bunch of data on the oscilloscope and then, reverse analyzed it to figure out the packets that were going through there.

Speaker 1: 50:04

Yeah. And you have a great blog entry on that, Matt.

Speaker 3: 50:06

Yeah.

Speaker 4: 50:06

That is

Speaker 1: 50:07

pretty much

Speaker 3: 50:07

what I'm talking about. Chat here.

Speaker 1: 50:10

Yeah. And

Speaker 6: 50:11

can you describe tuning that link remotely? I think that's a good story.

Speaker 3: 50:16

Yes. So this was another setup where, I didn't actually have a full sidecar in my place because they're huge. They're even bigger than the the SDK that I or the the dev kit that I showed you a picture of. And so Eric had one in his basement and the way we tuned it was, he set up a Google Meet with a webcam pointed at an oscilloscope. And I connected remotely to a computer that was plugged into that network switch, and then just twiddled survey parameters until I got some good looking links.

Speaker 1: 50:48

I did not know that. Oh my gosh.

Speaker 3: 50:50

Oh, yeah. Yeah. It was like a Google Meet. He went out to get lunch and, yeah, I spent a while tuning this and eventually got something that looked nice. Let me see if I can get a

Speaker 1: 50:57

The tricks,

Speaker 6: 50:58

of course, that Matt had to figure out was that in order to trigger the scope, it needed to see, you know, the right kind of transition. And so he had to keep knocking the thing out into, like, la la land and then bringing it back so that the scope would retrigger, and he could see what he was doing.

Speaker 4: 51:13

By the way, we know that we have that, in the chat, you know, I asked if we know that they have web interfaces. Yes. We do know that they have web interfaces, but this is just much faster because everyone already runs Meet all the time anyway. So, you know, quickly point a webcam and Meet, and you're good to go.

Speaker 3: 51:28

The other funny thing about this, which I kind of just learned on this project, is the this is kind of the worst. So, like, when you're doing very slow links, you don't have to care about tuning them. When you're doing this speed of, like, 1 g links, you have to kind of manually tune them by hand, which is annoying. And once you're up to, like, 10 g links, you don't have to manually tune them anymore again because they're so fast that they have to tune themselves.

Speaker 1: 51:49

Right. Right. It's actually not possible to make it to it. So you are actually in you you're in the actual valley of despair. Manual Exactly.

Speaker 1: 51:58

Yeah. Well, to be

Speaker 4: 51:58

fair, Matt, when you interviewed, I asked you because you've been doing motor platforms for 3 d printing, and I asked you why you want us to do some of this. And I remember you saying, well, I wanna work on some solid state electronics for a while. There, you know, no moving things. I'm like, oh, do we have the project for you? No moving things.

Speaker 3: 52:18

It's true. There are no fluids. There are no, no liquids. That's great. I mean, there there are fans, but that's a that's a subject for a different

Speaker 1: 52:24

episode. Yeah. That is awesome. I see. And then we when was it along here that you discovered the firmware payload that needs to be loaded onto this

Speaker 3: 52:34

thing? So that was actually, on when we're working on the PHY side of things.

Speaker 1: 52:38

That's the phi side. Okay. Right.

Speaker 3: 52:39

Yes. Like, yeah, that was real. Yeah. We've kinda known about this for a while where all of the phi data sheets. So the phi's of the chips, which are kind of doing the end of the network.

Speaker 3: 52:49

So they are the things that give you rj 45 jacks on your technician port. And then we also use them as port expanders kind of inside the system to break qsgmii into 4sgmii links. And all of the data sheets have, like, a mysterious line saying, load the firmware update or, like, load the configuration somewhere in their setup, which is left unexplained. It turns out that is also in the Mesa SDK, and most of the time, it involves patching an 8051 cord that lives somewhere inside the PHY. So if you search the Mesa repo for the word 8051, you find all kinds of stuff.

Speaker 4: 53:25

This code is also in the Linux kernel, by the way, because I originally looked at both their open source code and the Linux kernel, which has a different different implementation because I couldn't work it out how to how to exactly do it. Because the way that they reset this thing is pretty gnarly. Like they they set some bits to sort of trigger a reset and then they fault it or something and then they like it spins in some place and then they load stuff in RAM and then they give it a kick and then it resets itself and then jumps back into the code that they just patched. But it is Eases. Yeah.

Speaker 4: 53:58

No. It it it does not sound sanitary. You you also kind of You gotta regulate with this stuff.

Speaker 3: 54:05

You can trace the layers of IP where, like, there's the outer file where you just set registers and it does stuff. There's whatever IP they're wrapping inside of that where you have to, like, go through a different register and you have indirect rights where you write a data payload and an address, and then the chip goes off and does that right on your behalf. And there's actually, one that's nested too deep where you have to use indirect rights to configure an indirect right. So, like, 2 layers down, some IP core inside the chip eventually gets your message.

Speaker 1: 54:35

Wow. And, I mean, I guess on the one hand, it is not a surprise. I mean, we I'm sure there are many 8051s in the oxide rack that we don't know about that are sitting on the other side of these various controllers. But, this is one that we were, having to deal with more directly than than others, and that we were actually handing this thing from our bundles and so on. And then those ended up being load bearing.

Speaker 1: 54:59

Right? Not I mean, it ended up being really important to get this thing kind of the right sequence.

Speaker 4: 55:03

Oh, yeah. Because the the payload actually has a 100 effects fix. Like, there's a there's a neurata that they document where they they they explain that sometimes the link doesn't come up in 100 effects on one side unless you apply this patch. So, yes, this is load bearing for us. We have to have this applied because it it probably won't work if we did it, didn't you?

Speaker 1: 55:23

So I mean, literally, like, correct 8051 code is the difference between this thing running and not not running for us.

Speaker 4: 55:29

Not correct. It like, it's it's hot patched in RAM. Correct? Like like like, scribble over some bytes to make it work. That that's how it goes.

Speaker 4: 55:40

But just to to sort of circle back to the beginning of our, like, our NCSI conversation. So imagine a NIC for a 100 gig link that has storage acceleration pieces. It has others, like, higher end CPU cores, DRAM controllers, like all this other jazz. Like, we're the the stuff that we're talking that that we feel is gross and complicated, that's just a 1 gig little like, a little 1 gig file that is with, you know, a data sheet that is, what, 2, 300 pages long, a register spec from maybe, I don't know, 300 pages, a little bit less probably. I don't I I I I forgot.

Speaker 4: 56:18

Not that complicated in in comparison. But even there, it's all that that already took us a lot of time and effort to get working reliable and and reverse engineer our way through. And that was with open with open code mostly.

Speaker 3: 56:34

Yeah. To their credit, like, Microchip publishes all this stuff, and you can just go read through it, without signing any NDAs or anything.

Speaker 1: 56:41

This fixes We would have been dead in the water without that. And I think that's not an exaggeration. Right, Matt?

Speaker 3: 56:46

Yes. Yeah. If it well, it was funny, though, as as as you read through it, you see notes about, you know, this fixes Bugzilla number something something something.

Speaker 1: 56:55

Yeah.

Speaker 3: 56:56

There are, like, hints of microchip internal documents where they say, like, you know, configure auto negotiation per ug1035. And I've I love to get my hands on some of those documents. So, Microchip, if you're listening, send me an email.

Speaker 4: 57:12

But it it was part of the criteria why we selected these parts because the, to come back on this inventory problem a little bit, these parts were not necessarily automotive grade, and so that kept us out of the crunch that happened in 2020. And then or they were just not as common in order for automotive applications. And then, the other part was that we had I had looked at this at the I've seen that these drivers are open source that that for for example, the PHY is supported in the Linux kernel. And so we had some some confidence that, okay, we will probably be able to make our way through this and and and make get it to work, given that there's enough sort of open code floating for these out like, outside the microchip walls.

Speaker 3: 57:59

And although even with the even with the parts being not automotive grade, we ended up building with 2 different fis. So we have a Oh, yeah.

Speaker 4: 58:06

Because it was still probably

Speaker 3: 58:08

5. Yeah. Like, a VSC 8552 and a VSC 8562, which turns out are totally different. Like, totally different bring up sequences, different architectures internally. So we have to support both of those simultaneously.

Speaker 3: 58:21

What they're

Speaker 4: 58:22

sort of PIN compatible? They were I forgot what the FAA called it. It's like the marketing they like this. So basically, the the major difference between the two parts is that the the 62 was the the next generation that supported MACsec. So you could have, at the MAC level, use encryption with certificates.

Speaker 4: 58:41

None of that we use. We do pay for it, but we don't use it. But the because there was just no inventory of the parts that we wanted because, yeah, someone someone bought it in the

Speaker 1: 58:50

that we wanted, we had I believe it was our record setter at a 90 3 week lead time. Yes. And I like and at what point do you just Yeah. Yeah. Yeah.

Speaker 1: 58:59

So It's like Oh,

Speaker 2: 59:00

I'm sorry. 90 2 week. I'm sorry. Good news.

Speaker 1: 59:03

Yeah. Good news, everybody. I mean, just like, okay. So at what point do you just say, like, no, we don't make this. It's like, I agree.

Speaker 1: 59:08

Just, like, write down, like, fuck you on a piece of paper And it's like, at what point

Speaker 5: 59:12

do you think?

Speaker 4: 59:13

Have fears at some point.

Speaker 1: 59:14

Yeah. It's like it's not that's not a lead time. Like, that's not that's not right. That's a that's a I don't know. At some point, this kinda cuts over and you can't

Speaker 4: 59:21

The problem was we had actually purchased quite a bit. We had purchased some, as in, like, ends of them, maybe a 100. We had we had enough for sort of a first couple of builds. That's why so we procured enough so that we could build sort of until dbt, and and then we figured that there would be enough lead time to sort of procure procure the rest. And then, but then it turns out that they could we're just not gonna get these in time.

Speaker 4: 59:45

So we switched to the higher end part, which was electrically mostly compatible. We had to do some footprint jiggery to make that work, write a lot of different like, Matt had to write new code for it. And now we have still a bunch of these 8552 parts sit in inventory that we kinda have no use for because we're probably never going to use them because I don't know if the lead times have actually already come down. Maybe they have at this point, but do we care? I don't know.

Speaker 4: 01:00:10

I don't know. So I don't know.

Speaker 1: 01:00:12

And so, Matt, I'd love for you to get some of your methodology in terms of how you develop software here because you took I mean, this is obviously, like, a complicated thing to wrap one's brain around this whole thing. And one of the things I loved about your approach is you as as you went to each one of these building blocks, you really put in a lot of tooling to understand, like, how the part was working, And that we've we've been able to, like, keep using that tooling. You you wanna speak a little bit to your methodology there?

Speaker 3: 01:00:39

Yeah. So, like like we've said, we started from their SDK, which also includes, full register definitions. Even some fields which are not defined in the register list are defined in their SDK. But, of course, it's not machine readable in theory. It is, a bunch of c, preprocessor macros to, like, select bits and so on, but it does include Doxygen comments.

Speaker 3: 01:01:03

And so one of the very first things I did was wrote a terrible half of a Doxygen parser that was just sufficient to read through these structured headers and spit out, metadata, basically, to figure out, you know, from these structured headers, build me a Rust struct that has all of these registers and the pins within them and getters and setter functions and documentation for all of that. And So that was kind of the foundation for all of the tooling that we ended up building on top of that. So, after making the world's worst oxygen parser, I had a bunch of rust code and could start talking to the chip. So I started with, like, the basic STI connection where I could read and write registers, and then started plumbing the metadata through. So instead of just reading and writing raw addresses, you could specify names of registers.

Speaker 3: 01:01:53

So, like, write port 0's configuration register with this value, read it back, and have it pretty printed, and actually show you, like, what's going on inside of those registers. And then from there, it was just a lot of going through the SDK and figuring out what it was doing at each step to bring things up. So I think I, you know, picked one of the protocols. I think probably s g m I was the easiest, and started porting over the configuration code for that, bought a different, like, an off the shelf network switch so I could have a test bed where I plugged from the dev kit to the off the shelf switch. And if I could get bytes flowing in both directions, that was a good sign.

Speaker 3: 01:02:31

And then continue to build a bunch of tooling around this so that you could kind of check the status of the system at all times. So, unfortunately, Adam, I have, I have some screenshots here. See if I can post some pictures of the tooling.

Speaker 2: 01:02:43

It it It'll be in the show notes. Yeah.

Speaker 1: 01:02:46

And and, Matt, I'm sure you had you know, we do we talked about this before that we do this kinda weekly demos. And at some point in time, once you got some of this working where you were at least configuring the switch on your own, I remember that you did a a demo where the then the the could you describe that a little bit? Because I was, I mean, I think an early act of terrific Oxide showmanship.

Speaker 3: 01:03:10

So once I had enough to kind of boot the switch and configure the r j 45 port on the front and the s g m I I link to a different off the shelf switch. I set up a demo where I kind of showed this. I showed that you could look at the port status and look at counters and see packets flowing through it. And then midway through the demo, I revealed that I was actually running my home Internet through the switch. So it was coming from the wall to my modem to the switch, via RJ 45 out to the switch or to cable to the other switch and then to my Wi Fi router, and that was how it was connected.

Speaker 3: 01:03:48

And to prove that, I turned the port off and immediately dropped off the video call. And then

Speaker 2: 01:03:53

Which is a great demo. Like, a great great demo. Stage dive too, where you see it. And now when I turn it off and he's gone. Okay.

Speaker 2: 01:04:01

I guess that's working as intended.

Speaker 3: 01:04:03

Yeah. And so, yeah, let me see some screenshots. So here's a good example of tooling. This is actually for the, SP side of things. But this is showing running the a status command, which shows you the status of both internal links in the system.

Speaker 3: 01:04:18

So these are the 2 different files that Arjen was talking about, the, RMII and then a 100basefx and then sgmii out. And then it also shows the MAC tables, which I found invaluable when trying to figure out what was going on because the MAC tables are essentially ground truth for what packets the switch has seen. Whenever a packet goes into a port on the switch, it looks at the MAC address, and then it uses that if you try to send packets to that MAC address in the future. And so this is a really good example of, you know, if a packet has gone through and we see the MAC address, we know that it's arrived. It doesn't matter if either side has actually, like, admitted it if this really tells you whether the switch has seen it.

Speaker 3: 01:04:59

So after a lot of grinding, we finally had all of the different ports and protocols and PHYs working, and I started doing some higher level tooling. So this is a picture of the monorail status subcommand, which you can run on the hardware, and it will tell you every single port in the system, whether it's up or down, how fast it is, whether if there's a PHY attached, whether the PHY is up or down. And so this is kind of our our one stop shop for looking at system status, and this has been extremely helpful.

Speaker 1: 01:05:29

Extremely helpful. So I love this for a bunch of reasons. I feel like this has also got great pedagogical value in terms of, like, if you wanna understand like, if you understand the output of monorail status, you actually understand a lot about the network. And, it the I love this thing, Matt. And, Adam, I don't know if you you have had to run this thing in anger or not.

Speaker 2: 01:05:51

No. But this is spectacular.

Speaker 1: 01:05:53

Oh, it is. And because, you know, in Adam, and we obviously saw this a lot with dtrace, where you people will be, like, there's a big difference in watching someone else use DTrace than and actually eating it yourself to get your own butt out of the fryer. And the I feel this way about monorail status. Matt, I'd obviously seen this before. It looks great.

Speaker 1: 01:06:13

I I love it. But then just recently, you know, we were trying to debug some issues, and I needed to run it myself. And I was like, oh my god. I love this thing. It's it is, it is, really extraordinary, because it just, it just tells you a lot about the system, and there's a lot that you know just by looking at this thing.

Speaker 3: 01:06:35

Yeah. And so this has been done for a while. Like, it's in been in pretty good shape, and a lot of the remaining work has been kind of ironing out weird issues. So the the link between the monorail switch and the big switch is something that Ariane could probably say more about, but has been pernicious. It occasionally goes down and, you know, starts refusing to auto negotiate.

Speaker 3: 01:06:56

So it's been a lot of figuring out stuff like that and either figuring out fixes or, more recently adding watchdogs to detect when various sides of the link are stuck to, like, manually kick them, to cause them to reset.

Speaker 1: 01:07:09

Yeah. And I think I mean, this is one where we we definitely know we are big time on our own because, you know, we've got the the 7448 talking to FeNO. Like, pretty sure we can say with confidence that we're the only ones on the planet doing that. I mean, I I I don't think anyone else has done this before. And the fact that it is occasionally misbehaving, it's kinda hard to know that.

Speaker 1: 01:07:31

Right? Which that that could be a bunch of different things. That could be on on the Tofino side for sure. It could be on the 3448 side. It could be by the way, we've configured it, it's hard to exactly know what's going on there.

Speaker 3: 01:07:43

Yeah. And it doesn't help that the the auto negotiation is pretty opaque. Like, even especially on the Tofino side, you just kind of see it going into auto negotiation and then going into link training. And sometimes it makes it out of those two stages, and sometimes it doesn't. On the 7448, you have a little bit more visibility and that you can see it through the auto negotiation state machine, but this is not super well documented.

Speaker 3: 01:08:04

So, yeah, it's it's tricky. But with sufficient amounts of watchdogging, we can just detect when it gets stuck and reset the link in that case. And that seems to cause it to come back up.

Speaker 1: 01:08:16

Which is a relief. And that is, because these are one of these things that, like, this can be obviously really problematic. If this link goes down and we can't reset it, it can't restart. We are, we're in hot water pretty quickly. So we need to

Speaker 3: 01:08:31

do have 2 switches. Right?

Speaker 1: 01:08:33

What's what

Speaker 3: 01:08:33

are the odds of both when they go down?

Speaker 1: 01:08:35

Exactly. Exactly. And so, I mean, and, Matt, I mean, the the the presence of the management network I mean, now that we kind of have we've gone all the way through all of this. I mean, now the NCSI alternative just I mean, we we knew it was the wrong direction way back then, but, oh my god, is that clear I mean, can you imagine if we were relying on the high speed network and its functionality at all? Because we we've been able to to, really be able to understand pretty broken systems via the management network.

Speaker 1: 01:09:10

This thing has has has been actually pretty robust.

Speaker 3: 01:09:15

Yeah. I mean, one of the nice things about it is also that because it's not booting Linux and configuring the switch. So, the way that they expect the SDK to work is running Linux on the MIPS processor and co located within the switch. So it's doing direct memory reads and writes to memory map registers to configure the switch from within, which is very cool. But the fact that we're not doing that means that we are booting in, I think our last time did it, like, 4 seconds flat.

Speaker 3: 01:09:39

So the management network is up before you've noticed that the fans are, like, making noise when you power up the rack.

Speaker 1: 01:09:47

Right. Yeah. And this is a really big deal because, I mean, we wanna be we want that management network to be up really, really quickly. We do not want, and to get just get that basic liveness of the system. And it's been, again, it's been robust.

Speaker 1: 01:10:01

And in fact, when it's it's actually really important that it's robust because, you know, I think we got a new sense of appreciation. We have been debugging, gimlets, our compute sled. We've been debugging them on the patch for, you know, a long time. And, boy, when those sleds go into the rack, there are no dongles. You know, you are no longer connected over your the the the kind of these debug ports that we've been using to debug this thing.

Speaker 1: 01:10:25

It's like, yeah, we don't have this anymore. We only have the management network to now debug the service processor, and then the root of frost, which is even further away. So it's, we really need that management network to be working, all the time, to be able to debug them.

Speaker 3: 01:10:41

Yeah. So a lot of John's work has been, building, you know, well structured tooling for kind of accessing everything about the SPs on the management network. And a lot of my work has been building unstructured hacky tooling for doing the same thing.

Speaker 1: 01:10:56

Yeah. Do you do you wanna I don't know which one do you do wanna describe first. Do you wanna describe the hacky tooling or the the wheel structure?

Speaker 3: 01:11:01

Hacky tooling is the hacky tooling is funnier. So this is actually I mean, this is building on some stuff that, Brian did where he added the ability to, take task dumps. When a task has crashed, You get a record of its memory, and then you can extricate that through the management network and examine it on the bench later, which is very helpful. And it turns out that reading memory of tasks that has crashed is, like, only a small step away from reading memory from tasks that are live, which is only a small step away from kind of reading arbitrary memory from the chip. And so a lot of the work I had been doing was adapting our code that was meant to be using the debugger, to work over this management network.

Speaker 3: 01:11:39

Because once you can read arbitrary memory and you have the ELF data and the DWARF data from the binaries, it turns out that you can actually get a lot of useful information out of a running system.

Speaker 1: 01:11:50

And this has been extraordinary, honestly. And, also, like, this network is fast. I mean, we're so used to, like, going over you know, when you're debugging on the bench or going over the SWID interface, which is going over USB, which is ultimately, like, USB is fast, but, it is actually it runs pretty slowly for for implementation reasons. And boy, Matt, it is nice. It runs really, really quickly.

Speaker 1: 01:12:13

And I really appreciate having a, like, a relatively high speed network. You know, this is not a a low speed network. And we've been able to, you know, we we've started to use that in a really load bearing capacity where we can kind of quickly extract the debug information that we need out of a task that's misbehaving, and be able to understand what's going on. So in in terms of the, and I don't know, John, if you wanna talk about some of the the the better structure tooling. Well, Matt and I are are kind of are are hacking away here to to be able to understand these systems.

Speaker 1: 01:12:46

I don't know if you wanna elaborate on any of the tooling that that you built on top of management network or why it's so important. Sure.

Speaker 5: 01:12:53

I mean, I don't I don't know how far you want me to go back. When I started at Oxide last February, I showed up and I had in my interviews, you know, there were open positions for a control plane engineer and for a embedded engineer. And in my interviews, I was like, I'd really like to hop in between those 2. And I got a lot of, oh, that'd be perfect. We really need somebody to work on sort of interface between those two pieces.

Speaker 5: 01:13:14

So that's I showed up at Oxide and, like, day 1 Adam sort of pulled me inside and said, hey, We have these, like, placeholder stubs for this thing we're calling the management gateway service, which is how the control plane, which is the, you know, host level software that that a customer would interact with on their rack, talks to the service processes over the management network. How about you take this and run with it? I said, okay. Sure. You know, it's day 1.

Speaker 5: 01:13:35

Why not? How how hard could it be?

Speaker 1: 01:13:37

How hard could it be?

Speaker 5: 01:13:38

Yeah. Here we are a year and a half later and that is still the thing I'm working on most days. It I mean, I think most of what it can do would not be a surprise to anybody who's been listening to the the rest of this episode. Right? Like, it it exposes all the kinds of functionality that you would want from your, you know, service processor management network side, out like like the the fundamental thing is like being able to update systems.

Speaker 5: 01:14:03

Right? We can send updates to the service processor. And speaking of speed, like, I think the first time I demoed updating a service processor over its network link, certainly the first time I ran it myself, it's like you see you know, it's got to erase its flash which takes, you know, a few seconds and then you start to stream the update in. And I was used to flashing it over a dongle where you watch the progress bar march across the screen over several seconds. I ran it over the network and it it was just done with, like like, I it started and then it was finished.

Speaker 5: 01:14:32

And I thought I had screwed something up, but I hadn't. Like, it it just it's so much faster than going over a dongle. Right? You can

Speaker 1: 01:14:39

write the thing immediately.

Speaker 5: 01:14:39

So the a lot of the, I think, more interesting stuff that that's become useful, especially in the last couple of months, is the way that the service processor is connected to the host OS and the way that we expose that out over the management network. I don't can I dive into that? I don't know if that's Yeah.

Speaker 1: 01:14:53

Absolutely. Yeah. I dive into it. Yeah.

Speaker 5: 01:14:55

So the service processor is connected to the host CPU. There are 2 UARTs that we're connected to over the host CPU. One of them is the the normal serial console UART. So one of the pieces of functionality that we've exposed internally and was absolutely critical as we initially put gimlets into the rack and lost all of our dongles is that the service processor and this gateway service can proxy out the serial console over UDP. So you can you can connect to essentially like you've plugged into the serial port of 1 of these gimlets, which doesn't have a physical serial port, but through the management network.

Speaker 5: 01:15:29

The other you are

Speaker 1: 01:15:30

And and hold on. We well, this is huge though in terms of the the the I and, Adam, have you run the Humidi console? Have you run have you done use John's work on this?

Speaker 2: 01:15:39

Yeah. To see the the console, unbelievable.

Speaker 1: 01:15:42

Unbelievable. And you are able to get it it is because, John, you're able to make this thing, like, it's not lossy. Right? I mean, that's just what it's it's not fast, but it's not lossy, which

Speaker 5: 01:15:51

is actually really, really important. Well, you you mentioned speaking of hacky tooling, you just went mentioned my favorite hacky tool of all, which I did not specifically say. So, one of the tricks with the way these UARTs are connected is that the serial console UART is actually physically jumpered to the service processor. So with our original set of gimlets that we had on benches, that jumper was not connected, and you could actually just plug in, like, an FTDI dongle and get direct serial console access to the machines, you know, just like the way you would any other serial console UART. But once we started physically jumpering the that UART to the service processors, which I think maybe even on the latest rev of the gimlet, is that physically wired in?

Speaker 5: 01:16:33

Like, it's not even a jumper anymore? Somebody correct me if I'm wrong on that.

Speaker 1: 01:16:37

It didn't even jump in on that one. Yeah. I can't remember if they if they're installing the jumpers by default or the yeah. They didn't know we've

Speaker 6: 01:16:42

Production revs are all jumpered by default, and, actually, the headers aren't even populated.

Speaker 5: 01:16:47

Right. So if you have a gimlet that's already jumpered, you can no longer plug in a serial console dongle even if you have the gimlet on a bench out of the rack. So Brian mentioned in passing humility console proxy, which is this thing that piggybacks on the management network console proxy, which is actually like a quasi production level thing. It's not fast, but it's it's perfectly usable. With humility, you you don't have a management network connection.

Speaker 5: 01:17:14

All you have is a debug dongle. So because Humility already knows how to send and receive commands to Hubris tasks, it can sort of hijack the control plane agent task, which is is normally responsible for all the management network, structured messages, and tell it, hey, instead of evacuating your serial console out via the management network, just store it in a buffer and I'll come back and read it later and tell you to reset that buffer, you know, however I've read. So you can, you know, with a well, I don't know, 400 millisecond delay between each packet read and write, you can pretend like you have a management network by going through a debug dongle through this humility console proxy, which is

Speaker 1: 01:17:53

It's great. Delightful. Oh, man. It is so huge because when you the you need this because you pulled the sled out of the rack and, you know, this thing should be booting off of its n dot twos and so on, and you gotta have a way, like, without having it plugged into a management network. You actually don't know if the thing's booted or so, like, when you need this, you really, really, really need it.

Speaker 1: 01:18:12

And, John, I I have needed it on several occasions and have been very, very grateful for it. It's really extraordinary.

Speaker 5: 01:18:18

I think you've probably used it more than anybody. So I'm glad certainly I was I was sort of flying blind. Like, a lot of the serial console stuff I implemented based on our gimletlets which, you know, I got a few months after I started here and I I implemented all of the serial console handling based on that. And we started using it on the gimlets without, like, I had literally never tested it on a gimlet before somebody started using it in anger. So I was very happy to see that that transfer over.

Speaker 1: 01:18:41

Oh, works like a scam. And it would and, you know, I have to tell you when I needed it, like, I needed it so bad and I needed it, like, I had just very like, you know, when you are kind of, like, 5, you know, well, problems into a problem, and you're, like, I need some I need this thing to really really work, or I'm just gonna burst into tears. And, like, that just worked like a champ. It was great.

Speaker 5: 01:18:59

So for what it's worth, the reason you had that experience is because Robert had that experience, but the tool did not exist. And he he messaged me and said, hey. Is there any way for me to get the serial console if I don't have a management network? And I said, I mean, theoretically, sure, but I need to do some work. And he said, okay.

Speaker 5: 01:19:18

And had to go back and figure out some other way of working around this problem. So, you know, a couple of days later, this this humility console proxy was born and it was no longer useful to him, but you've borne the fruits of his pain. So that's

Speaker 1: 01:19:29

I I totally. And I I mean, I think, John, we've seen this time and time again, right, where we when we stop and build that tooling, we almost even, you know, Robert found a way to debug his problem or whatever and moved on, but we knew we were gonna see something like that again in the future. So and it it wasn't I mean, it was only a couple days of work if I recall correctly. It wasn't that bad.

Speaker 5: 01:19:47

Yeah. That's right. It I mean, it's it's very much a hack. It was easy to to throw in there and and, you know, it's not not a big deal. And like you said, very useful.

Speaker 5: 01:19:55

It was obvious when he asked for it. Like, oh, this is gonna come up over and over again. I really need to fix this.

Speaker 2: 01:19:59

Well, I

Speaker 6: 01:20:00

think the thing I love the most about it is that you don't have to like, because that's the way everything is normally connected, there's you don't have to perturb state in order to turn this thing on. You can just decide, like, uh-oh, something really bad happened, and now I need a serial port. And boom.

Speaker 3: 01:20:15

You get it.

Speaker 1: 01:20:16

Oh, boy. Yeah.

Speaker 5: 01:20:18

Yeah. I guess maybe it's worth mentioning that the service processor in general is just pulling whatever the host sends across this u r and discarding it if it doesn't have anybody connected to send it to. And it turns out that itself is actually load bearing. Like, at least in an initial version of the host OS, if the if the serial console jumper was connected but the service processor was not pulling any data, it would it would not boot. It would pause at some point waiting for that, FIFO to clear.

Speaker 5: 01:20:46

So it turns out that the serial the service processor pulling data off that line is actually pretty important.

Speaker 1: 01:20:51

Yes. And I get let let's just say that I've had to debug that one several times over. We're like, why the hell won't sync boot out? Because I'll I'll connect a serial console to it. Like, oh, it's booting now.

Speaker 1: 01:20:59

Alright.

Speaker 3: 01:21:00

What's going on? Yeah.

Speaker 1: 01:21:00

And the I took a while to rest here to know. No. No, Dumbo. You are actually connecting when you connect to stroke console. That is actually what's what's causing it to boot.

Speaker 5: 01:21:09

Yes. Yes. That released the boot gate. That's right.

Speaker 1: 01:21:13

So sorry, John. I I I didn't mean to stop you and make you elaborate on that one, but that was that was No.

Speaker 5: 01:21:17

No. It's cool. I I I'm glad you mentioned it. I would have forgotten to forgot to talk about that. So the other u r, I mentioned I said earlier the service processor has 2 UARTs.

Speaker 5: 01:21:24

The other one we use for we we call it IPCC, which I think is inter processor control channel. Does that sound right?

Speaker 1: 01:21:31

Plausible enough.

Speaker 5: 01:21:32

Okay. So that one is because so this this has been really interesting because we control both sides of this link. We are writing all the firmware on the service processor and we are writing, you know, the holistic boot OS on the host side. We have full control over what we use that link for. And we have put a bunch of really interesting stuff in there, like the service processor can tell the host early boot time parameters, like which RAM slot it should read its its RAM disk the first portion of its RAM disk out of, which what bits should be set set for a starter mode?

Speaker 5: 01:22:04

Like, should it boot into KMDB or boot into, I forget what those other bits are. Brian, you've used this more than I have. See, probably.

Speaker 4: 01:22:10

Yeah. You can set all these the debug bits for the OS so that you can run with, like, way more logging enabled or less logging if you don't need it, which is really helpful.

Speaker 1: 01:22:18

Well, and then and and kind of a a a callback to an earlier episode, you can also boot over the not not over the managed network I can boot over the over over k dot 2. Boot over the boot

Speaker 4: 01:22:28

Nick in the front. Yeah.

Speaker 1: 01:22:29

Yeah. Over the nick in the front, which is which was useful when we didn't have the management network up necessarily. So that's another thing you want might wanna indicate, over those bets.

Speaker 5: 01:22:37

Yeah. That's right. So the the coolest thing I think I I'll go and jump straight to the thing I demoed on Friday last week. I think the coolest thing we built so far with the management network is the ability to completely restore a gimlet, do a full OS reinstall. So the way this works is we have a startup option similar to all these debug bits that says, let me back up just a second.

Speaker 5: 01:23:00

So when the host OS boots normally, it reads the first 32 megs of its RAM disk out of RAM and then that RAM has a hash in it of the matching like, the rest of the RAM disk that's stored on 1 of the m dot twos. So then the OS will go and look for look on the partitions of the m dot 2, try to find the matching hash, and then load the rest of the RAM disk from there.

Speaker 1: 01:23:20

And see our episode on holistic boot for more details on that. We we have talked about that in the past.

Speaker 5: 01:23:25

Right. So we we now have a startup option that the service processor can set, and therefore, an operator via, you know, our management gateway service via the control plane that says that tells the host OS, instead of looking in your m dot twos, I want you to ask the service processor for all the data you need to boot your the second phase of your RAM disk. So the service processor, you know, obviously does not have the kind of memory it would need to store a couple 100 megabyte OS image. But, again, it can act as a proxy out to, you know, something else that does. So the way this bit works is the host OS starts, it it reads the hash from its RAM and then it says, oh, I'm supposed to ask the SP.

Speaker 5: 01:24:03

So it'll send a message on this IPCC line and say, hey, go fetch me, you know, the data starting at offset 0 for host OS block x y z, you know, with some hash. And the server processor will then relay that request out to the gateway service who presumably is sitting around answering these requests with matching host images, send a block of data back, gives it to the host, and the host says, okay. Thanks. Now give me the block starting at 1024 and so on. And we can end up live streaming a host OS over the management network to a machine that has no hard drives at all or has dead hard drives that have you know, we've we've just replaced them with brand new hard drives.

Speaker 5: 01:24:39

So we can do a complete OS boot off of just the management network. And it's not fast. Right? It runs at at like the line rate

Speaker 2: 01:24:49

Yeah. What?

Speaker 5: 01:24:50

What did you say, Adam?

Speaker 2: 01:24:51

It's running at UART speed.

Speaker 5: 01:24:53

Yeah. That's right. The line rate in theory is 3 megabit, which translates to like something around 300 kilobytes a second. And we're in practice, we're getting like half of that, which has not been a priority. Like, to stream in a 200 megabyte recovery image today takes 20 minutes.

Speaker 5: 01:25:09

Maybe we could trim that down to 10 with some work on the the line. Maybe we could trim that down to 5 by making the process smaller, but, like, this ought to be a relatively rare operation. It's not the end of the world if we have to wait 15 or 20 minutes for it to happen. And what we can stream in is some relatively small image that knows how to bootstrap itself, bring itself up on the main network and then go fetch real software payloads that are, you know, presumably 100 of megabytes or gigabytes over the the main network, which you know, does not have this UART line rate speed limit in it and, you know, then then we can do a sort of full in place update of a of a gimlet that, like I said, has completely lost its mind. We wanna completely refresh it.

Speaker 5: 01:25:49

We can do that, all via the management network.

Speaker 1: 01:25:52

Well and that's what's so cool about this, is that we now have we because we're also gonna be using this, and one of the problems we had, just rewind a little bit, is, how do we actually initially program the m dot twos? And this is one of the I I I can't remember how much you were involved in these conversations to be like, how hard it can be to program a good come on. How hard it can be to program the m dot twos? It's like, no, it's actually a giant pain in the ass. And we were doing all sorts of different things and had all sorts of crazy schemes to try to just, like, can you buy just, like, an m.2 programmer?

Speaker 1: 01:26:22

And the answer is, like, not really. I mean, yes. I don't know. Go Google them, and you'll find various things that we tried and they're not very good and way there are all sorts of other problems. And it's like, boy, if we could somehow use this reset recovery path, this like dire, the sled has been totally wiped and we wanna recover from nothing.

Speaker 1: 01:26:41

If we could somehow use that path to actually build ourselves up into a full image, then we could actually use that for kind of our install in the factory. And, yeah, the the people in the chat are like, you can't just image the m dot twos before you installing them? And it's like, yeah, trust me, we've done that path. It's pain in the ass. But we we kinda came to the conclusion if we if we could use the recovery path, it would be it would solve a bunch of problems.

Speaker 1: 01:27:05

And, John, well, you've been able to do that now.

Speaker 4: 01:27:07

Not just that, it actually exercises the recovery path at a very regular cadence. So you know that that path always works because it it our manufacturing process relies on it, which means that we we test it. It's fully like, it's it's a fully functional piece of of the control play.

Speaker 2: 01:27:23

That's right. So the last resort actually will work when it's time for the last resort.

Speaker 4: 01:27:28

Exactly. You've tested that it will work because you've installed the machine in the 1st place using that method. So I'm a big fan of that those types of of of methods to use, like, train like you fight, fight like you train sort of thing. Like, always use the stuff that you actually intend to use in the field, so that you know it works by the time you need it.

Speaker 5: 01:27:47

Yeah. We do have a little bit of a chicken and egg, which is, like, to get to the management network, we need a gimlet to, like, program the switch and management network to bring everything out.

Speaker 4: 01:27:56

You need one of them to work. It's at a there yes. Yes.

Speaker 5: 01:28:00

That's right. But only 1. Like, if if something catastrophic happened, every gun, like, gets knocked out. You have to pull one out, fix it, and then you can plug it in to the switch and use it to reprogram the rest.

Speaker 1: 01:28:09

Which means we can actually reprogram the entire rack, actually, remarkably quickly, because a lot of this can happen in parallel. I mean, this takes a long time because you've got you've got a superhighway connected to a cocktail straw when you actually wanna get, over this kind of, like, it ends up being a 160 k per second or whatever. But the, you can actually do all that in parallel. So it's actually, ultimately ends up being, like, not that bad to actually and again, this is this kind of, like, very dire scenario where you actually need to to reimage the whole thing, which is pretty cool.

Speaker 5: 01:28:37

Yeah. That's right. The the limit this this 160 k limit is per gimlet, and you can just run all of them at the same time. That's fine.

Speaker 1: 01:28:44

Yeah. And I think it it it was you kinda think about, like, well, boy, you know, you are gonna put all these images via the technician port. It's like, yeah. The technician port is like the super highway compared to this. Like, a technician port is, like, super fast.

Speaker 1: 01:28:55

Speaker 5: 01:28:55

can be great. Yeah. I I was doing the demo on Friday, and I was like, oh, yeah. We record the times of all these steps. Let me go let's see.

Speaker 5: 01:29:01

So it took, you know, 22 minutes to wait for the the recovery image to be streamed in. Let's see how long it took the it took that recovery image to go download the real software payload, which was something like 4 times as large over the real network. Wait. Is this right? It says 2.1 seconds.

Speaker 5: 01:29:15

Is that right? Oh, yeah. That's that's probably right. Yeah. So it's nothing in comparison.

Speaker 1: 01:29:20

Right. It's it is really, really quick. And and, John, that has been I mean, obviously, that there was a lot of pieces. I mean, I feel like

Speaker 4: 01:29:27

this is

Speaker 1: 01:29:27

true for the technician port too when, you you know, Aaron and Matt were getting all the just getting packets out the technician port port is also hugely complicated, and and you need every single one of those pieces to work, or the whole thing doesn't work. I mean, it's you actually need all of these links to be there's a lot that I know there was a lot that was involved in getting all of that to work.

Speaker 5: 01:29:47

Yeah. Absolutely. It That

Speaker 4: 01:29:49

path works through, like, on another Rube Goldberg machine, which is which I say a little facetious in the sense that we've really maximized the use of various components to combine and get stuff to work in a cohesive manner, but some of this stuff definitely jumps through many hoops before it works. But once once we have it working, it works reliable, it turns out. So it's fine.

Speaker 5: 01:30:13

Yeah. I think every every project I've done at Oxide has been putting together the work of other people. And in this case, it's putting together the work of, I don't know, maybe 3 quarters of the company have touched, like, critical paths of this update process.

Speaker 2: 01:30:24

I mean, notice how many of our episodes we've been referencing because really it sits the intersection of all of the hardware

Speaker 4: 01:30:30

and all of the software.

Speaker 5: 01:30:31

So I was literally about to reference another one, a couple of weeks ago, maybe. I think Andrew was on here talking about the Wicked Tooie.

Speaker 1: 01:30:39

Oh, yeah. Yeah. Yeah.

Speaker 5: 01:30:40

Yeah. So the the this recovery process, if you're a customer, you up you would execute this recovery process via this this, terminal UI that Andrew was on here talking about a few weeks ago. And literally Friday morning, as I'm trying to get to the demo, right, all of our demo driven development, the we have a bug in the UI where it's, like, not refreshing the progress reports correctly. And as I just said, it takes like 30 minutes to run through one of these updates. A 30 minute lap to debug a UI problem is atrocious, but Andrew has had gone and built this, this debugger, like, where you can just record a session of Widget and then play it back at, you know, a 100 x speed.

Speaker 5: 01:31:20

So I ran the recovery process once, recorded all of the events. Like the the update completed successfully, it was just a UI bug I'm trying to track down. And I can replay the session which ended up being, I don't know, a couple of 100 megabytes of serialized events at a 100 x speed and walk through, you know, fix this UI problem, I can take I took, I don't know, 25 laps and instead of 30 minutes each, it was 30 seconds each. It was it was amazing.

Speaker 1: 01:31:44

It was amazing. And I think that the one of the things I loved about that, Sean, is that just like you were saying that, you know, that I was the first person to to use the console proxy, and really use it to solve a problem and how gratifying that was. I feel like for Andrew, you were the first person I mean, Andrew used it, and it's, like, it seems like useful, so I built it. But the fact that you had actually used it and it was a it was a a huge difference maker for you, I think, was really uplifting. And it's like, alright.

Speaker 1: 01:32:09

Great. Yeah. This is this is really useful stuff.

Speaker 5: 01:32:11

Yeah. I think I think he said because he got I got I had to get up really early Friday morning to take a kid to, anyway, he got up later and he saw this, like, debug screed of me with all these screenshots and everything. He's like, wait, have you been using the wicked debugger? You just made my like, it's it's 10 AM and you just made my day that somebody's actually using this thing for real problems. So, yeah, it was fantastic.

Speaker 1: 01:32:30

Well, it really does actually. There is something I I don't know. Adam, I mean, obviously, you and I have both had the privilege of feeling this feeling a lot. And it is like that feeling of an engineer up here telling you, hey. I used your like, the thing that you built just saved me a ton of time.

Speaker 1: 01:32:47

Or or I don't know how I would have done this without this tool. And, Matt, obviously, we saw that with you and all the stuff that the tooling that you have built, John, the tooling that you have built. I mean, it it it it's really gratifying to hear that stuff, actually. I mean, it's there's something unique about it when a fellow technologist is using your tooling, and it's it it it has bailed them out.

Speaker 2: 01:33:09

Yeah. Especially because often building the tooling, you can think, should I really be doing this? Am I building this just for myself? And if I am I am I fighting the last war? So get getting that feedback of, oh, no, no, no.

Speaker 2: 01:33:20

Like there's we're fighting that same war over and over again. It it's terrific.

Speaker 1: 01:33:25

It was it's really good. Well, in this I mean, we've just gone to this management network over and over and over again. And and, Matt, I'm not sure if there are any other, other particular war stories you want to elaborate on. But, I mean, this has been so essential for us in so many ways, and I just cannot I I mean, it's even though it was a huge amount of work and, Ari, and the the thing that, that no one else had done, but why and why would you do it that way is the thing that has been really, really essential for us getting the system completely operate operational.

Speaker 4: 01:34:00

It was very gratifying, actually, that we puzzled this thing together out of this out of the standards or sort of and then in the case of 100 effects, a little bit loose standards, or none non existent. And then it it did work. It did work after we put the software work in. It it worked, and that was kinda kinda neat. So, yeah, that was pretty pretty cool.

Speaker 3: 01:34:23

Yeah. Yeah. It's it's been very, very gratifying to see this work. And also to mostly not have to think about it at this point. You know, we we plug the system in, comes up, things go up, and we can we can use it for things.

Speaker 3: 01:34:33

So it's kind of faded into the background of just being a tool rather than being a project, which is always nice.

Speaker 1: 01:34:39

It is nice. And, Arion, it goes back to something, you know, you had said earlier, and, we kind of mentioned Ignition briefly in passing, but this is kind of like this this primordial presence and power control network. And, you know, our and one of the things that you had said about that that I think is true of a bunch of things at Oxide is, like, you know, we're gonna work on this. We're gonna get it right, and then we're gonna be able to use it for a long period time. Like, we're we're we're not gonna have to think about it anymore.

Speaker 1: 01:35:01

And, you know, it's it's always harder than we thought it was gonna be. You know, we always thought it was, like, how hard can it be? John, I love you. Like, hey. How hard can this be?

Speaker 1: 01:35:09

Sure. You know, look at this. How hard can this be? Matt, I feel like when you were working as looking at at Monorail, I, Monorail, I think, actually is, like, how like, this actually feels pretty hard, but I that one does not feel easy when it's been handed to you. But, the you know, it it takes us longer than we might like, but we boy.

Speaker 1: 01:35:28

Once we're able to get it robust, it it becomes really part of the foundation.

Speaker 5: 01:35:33

Yeah. I I forgot to even mention Ignition, but Ignition Control is also exposed in a management network, and that has been critical in debugging the control plane agent task itself, which is the task on the service processor that, you know, we talk to the manage talk to the service processor over the management network via. I think, Matt, you pointed out that it's it's certainly become the largest and most complicated task in our hubris builds and it has bumped into a couple of very odd and obscure kernel bugs over the last couple of months, which just nothing else had been complicated enough to bump into. But with that 2

Speaker 1: 01:36:06

kernel bugs at once, I feel, like, at you know, kernel bugs and in hubris are extremely unusual because the kernel's pretty small.

Speaker 5: 01:36:13

Yeah. That's right. It it certainly hit 2 within, like, a week and a half span. I don't it may have been the exact same thing that triggered them both. I don't I don't remember.

Speaker 5: 01:36:21

Yeah. But ignition control was super critical in those cases because when when a service processor service processor gets knocked off the management network because of a kernel bug, for example, you well, I mean, you can't talk to it anymore. Right? This is the thing that we are you supposed to use to recover when can't talk to a machine. But because we have ignition control and that goes through the sidecar service processor, as long as it's still up, we can use that to hard power cycle the machine which reboots the service processor and and brings everything back online.

Speaker 1: 01:36:51

It's absolutely huge. Absolutely huge. And the yeah. For those with the a, we we've designed Hubris to obviously be very robust, and broadly, it has been. Right?

Speaker 1: 01:36:59

That's putting on the chat that a 100% of the kernel bugs we found, maybe 2, have been in the the control plane engine. Again, we just don't see them that frequently because it's it's a pretty small surface area. The, but it just, honestly, extraordinary work. And, Ariane, Ariane's gonna get going. Ariane, we're gonna let you return to your family.

Speaker 1: 01:37:19

I know it's late on the East Coast where where John and and Matt have joined us. But, extraordinary work, everyone. This has been so much fun to see. Adam, I know you've been, you've been asked, like, we gotta do an episode of the Medich Network because there's so much For

Speaker 2: 01:37:33

For the past 6 weeks, Brian, we've been talking about what, what we're going to do on any given Monday. And I've been like, how about the management network now? And how about the management network management network? So I'm, I'm so glad, to, to get this episode under our belt and, Ariane and John and and Matt for for joining and talking about this. I I I love this area of the system, and I think it's been, awesome summary of it.

Speaker 1: 01:37:57

And, essentially, everything we've talked about today is open source. So you can get to virtually Matt, all of your work for this is in either Hubris or Humility or in an open source crate that that is drawn on by one of those 2. Right?

Speaker 3: 01:38:12

Yep. Yeah. If you go to Hubris and look at the monorail task, which is in, like, task slash monorail, that's kind of the starting point for all the stuff we talked about today.

Speaker 1: 01:38:22

So a, and and then a bunch of stuff over on the humility side, you can see how we've actually actually used this stuff. So, all up the doors. Check it all out, and, you know, terrific comments in the in the chat, as well. I've been one one of the Adam, I think one of the things you and I have loved about Oxide and Friends is, boy, they the the demographic that we attract, is, are kinda nerds. So, a lot of great comments in the chat as well.

Speaker 1: 01:38:48

So, thank you, everyone. Thank you, especially Matt and Arianne and John. Really, really appreciate it.

Speaker 3: 01:38:54

Yeah. Thanks, Brian. Thanks, Adam.

Speaker 1: 01:38:56

You bet. And I'm just gonna tease next week a little bit. We're gonna have, SamTek on next week. It's gonna be a lot of fun. So, look for an invite there soon, and, thanks again, everybody.

Creators and Guests

Host

Adam Leventhal

Host

Bryan Cantrill

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere