Holistic Boot

Bryan and Adam talk about how Oxide boots its systems--no BIOS, no BMC. A modern server unlike any other server in the market all of which carry 40+ years of the PC legacy.
Speaker 1:

Yeah. Sorry. This is a bit of a, the the this this is a

Speaker 2:

Little note Huddl.

Speaker 3:

Little note Huddl. Steve is like, oh, I must have missed the announcement. I'm like, well, not really. It happened, like, points ago. So, no, he did not miss the announcement.

Speaker 3:

I mean, not really. So did and, Adam, how much is did you

Speaker 2:

I I am I am willful ignorant. So my wife has been sick all weekend, so I've been off of Hacker News even more than usual. So the fact that you're talking to us if c was apparently the number one story. Congratulations. All weekend.

Speaker 2:

Totally escaped me.

Speaker 3:

Well so yeah. I mean, it was kind of so I think and I think Dave's here. We can and he can, try so Dave so the the talk was released actually when I was at another conference. I was at Mount Copper Fest, in in Maine. Right.

Speaker 3:

I do love those guys. I love

Speaker 2:

I really wanna get to that conference. I think it's a it's a big regret of mine that I've never prioritized that.

Speaker 3:

It is it is so great. And, and I and obviously love Steven and James and RedMonk, The whole Redmunk crew and Rachel Stevens and all of them are just great. So, anyway, that was a lot of fun. But I was not I'd not had a chance I saw the video had gone up, but hadn't had a chance to really do anything with it. And then, Dave had this terrific tweet thread.

Speaker 3:

Well, he had a terrific tweet, actually, which was, really I mean, it's always nice when a fellow technologist really appreciates something that's been done. And I thought, Dave, thank you very much for your tweet. It was great, explaining why, why this is important. And then someone asked him a follow-up question like, I don't really understand why this is important. And then Dave really took it apart in I mean, it just, like, nailed it in this kind of whatever it was, 14 tweets, just explaining why this is important.

Speaker 3:

And Adam, I'm not sure. Like, how much of the holistic boot stuff have you been? I mean, the miracle of boot. I'm not sure. I mean, obviously, you've had plenty to do.

Speaker 2:

No. I've been I've been at the opposite end of the universe. So, sort of catching some of the Hawking radiation that gets emitted from that black hole. But, and and certainly understanding that, like, we've gone a to say we've gone an uncharted path, I think, is putting it mildly. But, I've definitely not waited into the details.

Speaker 3:

Yeah. And I think also, I feel that you suffered some of that pain. We suffered some of that pain together at Sun for sure. But I feel like and, Josh, I I mean, I feel like most of our bias pain actually happened when we were trying to run a cloud ourselves at Joyant. At least for me I mean, it's like, I'd had bias pain in my life, but, god, it was so much worse from from

Speaker 1:

Are you saying bios or bias?

Speaker 4:

Are we

Speaker 3:

gonna do this right now? Are we gonna do are we gonna do one of these cultural difference moments right now? Ice I mean, don't we okay. I I probably pronounce those things the same. So so when you hear me say unconscious bias, have you been assuming that I've been referring to a basic input and output system?

Speaker 1:

You know, a lot a lot more stuff makes sense. Right? Right? It's like, boy,

Speaker 3:

why are we having another conversation about, like, bias? I was like, yeah. We get that right now. So the the bios. Bios.

Speaker 2:

Like a biopic.

Speaker 3:

Like a biopic.

Speaker 1:

But Space derailed. Alright.

Speaker 3:

I the the yes. Space derailed. Exactly. But I feel like we suffered a lot of that pain when we were trying it's when we were trying to run a bunch of machines and really operate them that I feel we had a lot of

Speaker 5:

this pain.

Speaker 3:

A lot of that pain.

Speaker 1:

I think it's it's probably a tired adage, but, like, like, people are like, well, I had this computer and it worked fine. It's It's like, right. Well, get a 1,000 of them. And then, like, 48 of them won't work for me. And then each, like, it won't be good.

Speaker 3:

It won't be good. And, Josh, you had a a because out of here apparently has other things to do other than read hacker news comments all weekend. But but I know I'm among a kindred spirit. You had a great comment. In the like, deep in the hack this is the game of Hacker News story.

Speaker 3:

Dave's tweet got picked up. It got picked up by so, Adam, just to educate you, it got picked up by hacker news, and it was somewhat shockingly it was the number one story on hacker news for, like, hours yesterday, long time yesterday, which is always a mixed blessing, I feel. On the one hand, it's great to get a lot of attention on something. On the other hand, it's not so great to get a lot of attention on things sometimes. And, Josh, I think you accurately predicted that, like, well, the informed people have all weighed in, so now it's time for the uninformed to weigh to to weigh in, and that's exactly what happened.

Speaker 1:

And just people with different needs. Right? Like, again, if you have one computer and its firmware is working well for you, then this is not your problem. Like

Speaker 3:

Yeah. And you and someone said more or less that. Like, I've got one computer. This is not my problem. And I moreover, I don't see how this could be a problem for anybody.

Speaker 3:

I had a particular example. I mean, I was very time limited in my talk. So I had one example that it'd be like, is there any other example where I'm at particular just to flush it out a little bit? We, much to my and, Adam, I'm not sure how much, like, you had insight into this when we were going through this. But we we saw all of these uncorrectable errors inside of Joanne, uncorrectable memory errors.

Speaker 3:

And these machines would just die on uncorrectable errors. And because of the experience that you and I had had together at Sun, it's like, okay. So let's what is our rate of correctable errors? It's like, oh, there are no correctable errors ever.

Speaker 2:

Yeah. Not yeah. I I remember this is just

Speaker 3:

Right. Anyway, but but there's a whole, like, apparatus, the CMCI, for the operating system to have information about correct wears. It's like, yeah, those are always 0. It's like, oh, why are they 0? Because it engages in what's called the firmware first model in which it is the bios that actually, gets that that that information and doesn't pass it on to the the the guest operating system.

Speaker 3:

So I decided that example at the the the the, this is an operating system host. I had cited that example in the talk and someone's like, well, do you have any other examples? And then Josh, you rattled off a bunch of them. Do you wanna talk?

Speaker 1:

I mean, I'm just I'm just reading for my daughter at some point. Right? It's like

Speaker 3:

so if you wanted to actually go into some of those because I do think it's, like, people don't realize how much is down there and how much can go wrong.

Speaker 1:

So I would say it's not just the BIOS. It's also all of the management infrastructure in these machines, I think. What like, the problem is not just the BIOS. It's also the BMC and all the other bits that get shoved in there to pro to provide, the value add, I guess, that makes it, like, not just an Intel reference design motherboard or whatever that you've bought. And those are the bits that usually don't work very well.

Speaker 1:

Yeah. But, like, only some of the time. And that that probably the biggest problem is that lots of these things work a bit. They just don't work always, and they often fall over the most when something else is falling over and you really needed them.

Speaker 2:

And and that's gotta tell you

Speaker 1:

what it was.

Speaker 2:

I mean, that's gotta resonate so strongly with people who have not had some of these more pathological problems. Because how many times have you logged into the BMC through its like goofball HTML5 web interface, you know, password admin, login admin, and, like, try to reboot it or try to connect to the console or try to do anything and have it fail in ways that are It also doesn't work. Success. Yes.

Speaker 1:

Yeah. It just the the most irritating example that I have from the last, like, 6 to 12 so there's this red fish thing that's supposed to supplant my PMI. And so, like, it sounds good in principle. It's like, well, it's rest. It's not very good rest, but it's rest.

Speaker 1:

And, like, in theory, you can now tell the computer to boot either from the network or from the disk through, like, a rest thing, which sounds good in principle. That's what I actually want to do in the lab a lot of the time, and it works. I estimate about 60 boot settings in a way that works, but it's really unclear because it sometimes then returns, like, that its intent is to boot from the network or the disk or whatever I told it, but it doesn't do it. Like, it re reboot the computer, And the buyer says, I'm getting my I'm getting my instructions, and then it boots from the wrong device. And you're like, well, what

Speaker 6:

but then

Speaker 2:

What happens?

Speaker 1:

But, like and then if I reboot the computer again, it does it again. And then, like but but then sometimes it's enough to, like, set the boot order a lot of times. And, like, one of those times that I set it will take, and sometimes it feels like every time that I set it, it it actually undoes any progress we made previously in setting it. So it's like not even idempotent. It's just it's madness and there's no way.

Speaker 1:

So in the end, the computer always boots from the network because I figured out how to, like, get into the settings menu and do that. And then we control whether it boots from the disk or the network by switching out I pixie files on another computer. It's like and, like, I'm sure that this is a thing that operations groups have internalized as well. That's how computers work. It's like, well, but they don't have to work.

Speaker 2:

Josh, that is such a good point because I think that for so many and and for, you know, myself included, at least for a long time, this just felt like the natural order of things. Like, yes, it is frustrating, but it can be no other way for reasons that are inscrutable. And I think that that that's even true of, like, people who operate tons of servers outside of the context of, like, a hyperscaler. There's, like, sometimes things break and you have to do it again, and it's hard to predict how many times I'll need to reconfigure the the boot order settings until it will work.

Speaker 3:

And it's even worse than that because this is so sedimented that just as you and you're describing it like operators think like, well, is the way it has to be done. People on other instructions and architectures like, well, if we wanna, like, get people to use this thing, we need to do what x86 does. And we you're like, no. No. No.

Speaker 3:

No. No. Please don't. Please please do not need UEFI. You don't need to do you don't need Redfresh.

Speaker 3:

You don't need all this this this madness. And watching ARM and risk 5 kind of repeat some of these same mistakes from x86 because they feel they have to because those are the expectations. It's like, no. No. No.

Speaker 3:

No. No. Please.

Speaker 1:

I think I think there's this expectation that by having an ossified standard that everybody purports to comply to, that we won't have to keep redoing all this other stuff. And like

Speaker 3:

Yeah.

Speaker 1:

I think I think people are imagining the world that they would like based on the nouns that they've used in describing that stack. And and then sort of but, like, the the problem is that everything they've described is exceptionally complicated. And anytime that you say, well, I can't have control of that piece, then you can't fix it when it's inevitably broken. And also, like, these these standards are incredibly wide. Like, they like, they have a huge amount of surface area, and a lot of it is often poorly specified.

Speaker 1:

And so like it may be that the 3 BMCs that you buy are all standards compliant even though they all work differently. Right.

Speaker 3:

Well, and I yeah. I mean, and Redfish in particular has and I'm in I I will one of my most embarrassing moments and if Robert Bostocki, if you happen to catch this recording, just I I know I apologize to you before over this. I want you to apologize again. One of my most embarrassing moments was when we were first hearing about Redfish. For whatever reason, god only knows what had happened to me that day.

Speaker 3:

It just sounded like this is what's gonna solve our problems. I was like, this Redfish thing sounds great. I just remember being like, I don't think you understand what this thing is and it's not. I'm like, I but but Della say they're gonna do it. This is gonna be this is gonna allow us to manage the entire fleet of machines.

Speaker 3:

And then Oh,

Speaker 6:

so But the the problem is

Speaker 1:

the people that built the first thing are now building the second thing.

Speaker 3:

In So

Speaker 1:

it's gonna be broadly similar in quality.

Speaker 3:

And it's Adam, have you looked at Redfish?

Speaker 6:

Not at all.

Speaker 3:

Okay. I feel that you in fact, we should do a Twitter space on, like, bad web APIs because I feel like you've seen so many APIs. I would love to know where Redfish ranks.

Speaker 2:

Awesome. Okay. I'm I'm looking this now.

Speaker 3:

It is there is a bunch about it that is very, very weird and makes it, and I I think Tom is here. It's oh, good. I can't say okay. Good. Thank you, Tom.

Speaker 3:

Tom is requesting to speak just as that. So I'm like, wait a minute. Did I see Tom here? Because, Tom, I remember you and I talking about Redfish, and you went on a delightful rant about Redfish in, OSFC 2019. I'm hoping that Tom still got that rant in him.

Speaker 3:

Alright. Hopefully hopefully, Tom is is is warming up. But I Wait. Wait. I'm here right now.

Speaker 7:

Oh, good. You're here. Excellent. Fumbling fumbling through the UI. Yeah.

Speaker 7:

No. I I my rant about Redfish was from experience as well, but but, basically Redfish is just a modern syntax. And syntax never solves the problem. The way XML didn't solve the problem and SNMP didn't solve the problem. Then the problem is people don't stick to standards.

Speaker 3:

That is a really good point that syntax doesn't solve the problem. That it's actually syntax is maybe part of the problem, but it is by far from the whole the complete problem. And there's the also, with anytime you've got a, like, a field that the vendor can have this kind of, like, opaque field that's vendor specific, the vendors will just use that for everything. Right. I think it's there's gonna be some right?

Speaker 6:

Because, Tom, that's probably one of

Speaker 3:

the problems with Redfish. So everything is in, like, these these other these kinda OEM fields.

Speaker 2:

In which case, like, why bother having to go standard at all?

Speaker 8:

Yeah. Every every

Speaker 2:

just yellow it.

Speaker 7:

Everything is a superset of a subset of the standard, right, which is no standard at all.

Speaker 9:

Standards in general are good, but this reminds me of Moti Roscoe's thing in his ASD 21 keynote where he says, these interfaces have congealed over time. That's such a great word. And I can't think of a domain where that's more apt than the system firmware.

Speaker 3:

Yeah. It really has. Sorry. I'm eating my burrito.

Speaker 7:

Yeah. And it's it's Godwin's law too. There's been so many different organizations doing too much and each in their little piece.

Speaker 3:

I think you mean Conway's law, but

Speaker 7:

we I'm sorry. Conway's law.

Speaker 3:

It's like, wow. This is about to take a really wild turn. As it turns out, Redfish is from the alt right. Who knew this? But yeah.

Speaker 3:

No. You the total Conway's Law where you you can see, like, oh, I think I can see from the API structure. I think I can see the org chart from here. So and I embarrassingly, I thought I and I guess I've I myself fell into this trap of, like, syntax will solve all problems. Of course, no.

Speaker 3:

Of course, it won't solve all these problems. It does remind me of when I did a lot of work to go parse the ACPI tables for error injection as part of this problem. And I'm like, okay. We need to be able to inject errors in this thing. We have got no ability to parse these tables today.

Speaker 3:

Write all the software to parse the tables and then go on the Supermicro machine. I'm like, they're all empty. They're all empty. I it's like, uh-oh, man. It's like, oh, to be or to be filled in by OEM.

Speaker 3:

You're like, oh, man. But then I love the Josh, you see that hacker is right. Somebody just took it like you. We were talking about the expectations of sedimentation. And, there was a line of, like, removing layers of abstraction seems like it could lead to incompatibility.

Speaker 3:

I'd much rather have proprietary blobs everywhere than incompatibility. It's like, well, how about proprietary blobs and incompatibility? Like, why choose? You can actually have both of these things. So when we, I mean, when we set out, we I mean, for and and, Josh, I don't know if you wanna elaborate on any of your other your your other, diary entries.

Speaker 3:

You had a I mean, it it was a good list, I feel. But it was, I I felt that

Speaker 1:

it was a bad list.

Speaker 3:

Fair enough. The opposite of a good list. Yeah. It's a lot of a lot of pain. But so when we set out, we I mean, you said yourself that, like, this is, like, a a chunk of these is the BMC, so we wanna get rid of that with NSP.

Speaker 7:

Yeah.

Speaker 3:

We wanted to get out from underneath this Rentfish problem, which is part of the problem, Adam, is that these the BMC is just, like, hanging out a web service on the not the Internet, hopefully, but often I mean, like

Speaker 2:

But in some cases yeah. It's not but

Speaker 3:

in some cases, yes. And it's like, you really don't want your BMC having a web services stack. That's not a good idea. Like, you really want that thing to be much more finally controlled. That is a recipe for vulnerabilities.

Speaker 3:

That's a lot of surface area.

Speaker 2:

Yeah. Like, when you start scratching your head and thinking, like, what is this thing I'm typing admin admin into? Right? There's actually probably a bunch of software running that.

Speaker 3:

There's a bunch of software. And the the the problem actually, there's an interesting kind of threshold where and I didn't didn't really click for me until I saw, I actually, at at OSFC, really good talk on the structure of of HP's BMC, and, like, what the actual ASIC looks like. And one of the problems that they actually have is that, once you you exceed the SRAM in a microcontroller, then you are actually forced to get DRAM for this thing for your BMC. So this is not even the computer. Right?

Speaker 3:

This is the the computer to manage the thing. And once you have a big enough footprint, you need DRAM and the it's like, well, what am I and so one of their problems is they can't find DDR 3. Like, they're just, like, not for sale anymore. So they are having to put DDR 4. It's like, oh my god.

Speaker 3:

And then you it it

Speaker 2:

So how big of a computer do I need to manage this computer?

Speaker 3:

Right. It's like so it's like I I like, I have to buy, like, an f one car. Like, I just wanna go grocery shopping. It's like, yeah. Sorry.

Speaker 7:

They Well, pretty soon pretty soon, you'll need a small processor to help you boot your BMC. Oh,

Speaker 1:

I do.

Speaker 2:

A beta BMC.

Speaker 3:

Oh my god. That's coming. That exactly. The big yeah. Because this BMC is not gonna boot itself, like, with actually these but seriously, Tom, we talk about, like, your because, like, that thing now, especially Doctor 4, one of the things we really appreciate, I appreciate much more viscerally than I have previously

Speaker 9:

in my life

Speaker 3:

is how long it takes to train

Speaker 4:

those DIMMs. Right. And so

Speaker 3:

now, like, your BMC actually

Speaker 1:

Right. And

Speaker 3:

so now, like, your BMC actually can't boot that fast because it needs to train its DRAM, by the way, that it doesn't even need. It doesn't even want. It It wants to actually have, like, so and a big part of that problem is, like, well, your footprint got too big in this thing. Like, you wanted a web services stack so that needed that needed a, you know, that needed Linux. And now you have, you know, tens of megabytes, a hundreds of megabytes of footprint, and now it's DDR and it's like it's like multiple organ failure.

Speaker 3:

You know? It so we we knew we were gonna get rid of the BFC. We have an SP. We only use the SRAM. We don't have any any DRAM on that thing.

Speaker 3:

We don't have EPDR on that thing. We, so that simplifies a lot. But then the then the other big piece that we knew we wanted to do, which is what we I was talking about in this talk, is the actual elimination of the that proprietary bias that executes before the bootloader. And that code is often unseen. And I think a big part a challenge that we have in the industry that I'm not sure how clearly I called attention to it.

Speaker 3:

But there is a an unfortunate codependency between the microprocessor vendors and the bias writers, the IVVs. And they end up kind of developing the stuff together in a very undocumented or unseen fashion. And it means that, like, no one else can actually do the platform enablement because it's not documented. And I don't think this is pernicious necessarily, but, that is a big part of the problem here is that the parts themselves are not documented.

Speaker 2:

It it seems like not not just not documented and, you know, not just potentially pernicious, but also fragile, like fragile in a way where once they get it working, it's sort of good enough. Very hard to understand the failure modes and and probably, you know, hard for them to document. I mean, obviously, we've seen hard for them to document because they don't necessarily understand how it works.

Speaker 3:

Well, they they definitely it it at the very least, it is, in terms of that lowest level of enablement, there the the sequencing is not completely elucidated, and it is clear from their own implementations that, what we found a lot is units that'll be initialized multiple times, which clearly you shouldn't need to do. It's kind of it's kinda in the name. Clearly, you surely have to initialize it once. So, yeah, I think there is some of that of, like, they they don't know exactly what's required and what's not. And then once it's working, you don't want to, actually dork with any of it, which is the other the the other big challenge, that, yeah.

Speaker 3:

Dave, go ahead. Yo. You're you're here.

Speaker 6:

Hey. Yeah. I I just wanted to add to that. The other big deal is, lack of diagnostic information. So to to your point about stories of former of weird firmware problems, one that I had on on a single machine, not even at scale, was, the management engine was just broken on an Intel board, and it just brought up a single core at locked at 800 megahertz.

Speaker 6:

And the rest of it and the rest of the firmware just didn't do anything about it. You know, it it it actually, maddeningly, it constructed you know,

Speaker 4:

I I don't

Speaker 6:

know if it's ACPI or anything else, but it constructed in all of the state. It handed to the operating systems. Like, you are the owner of 1 800 megahertz

Speaker 2:

Xeon core. So I have a nice day. All the way up to say, here, you know, here. This is what

Speaker 1:

I have for you?

Speaker 6:

Yeah. So this is, like, a a machine I a machine I bought and assembled and booted. It's like, wow. This is this is booting really, really slowly. No.

Speaker 6:

It's it's it's like a 4 you know, 3 3 gigahertz part

Speaker 2:

with It's a Pentium.

Speaker 6:

Yeah. Yeah. And eventually found you know, I went and dug through the terrible BIOS, APIs, not APIs, UI, and eventually found, you know, hidden in a sub sub sub menu, oh, the management engine is in recovery mode. Well, that sure sounds bad. And, eventually, once I had that, found you know, went online and found that the board vendor had a forum post somewhere and said, oh, yeah.

Speaker 6:

That happens sometimes. Put the management engine back into manufacturing mode, reflash all the firmware, and it should be fine. And it's

Speaker 2:

like That's confidence inspiring.

Speaker 6:

Yeah. It's it's it's this this is this is your response to, you know, your firmware getting it getting it kinda wrong here. Like, couldn't you have told me that something was amiss?

Speaker 9:

Turn it off and turn it back on. I mean, that's what the entire industry does. Yeah. Does

Speaker 3:

Yeah. I mean, if you find a dead mouse in your soup, that happens from time to time, and, remove the mouse and continue eating the soup. It's like Yeah. I don't think Exactly. I've got a lot of questions about how the soup is made now.

Speaker 6:

Yeah. And just, you know, count my lucky stars. I had one server to recover and not 10,000 in the data center. That would have sucked.

Speaker 3:

Yeah. Wow.

Speaker 6:

And, you know, another one where I I can't share too much of the detail, but, due to an assembly error, the CPU came up with, like, half of its half of one of its buses missing. And so you just had a terrible performance degradation because the firmware somehow managed to muddle through, like, oh, like, 50% of my hardware is not functioning. I'll just initialize the rest of it and function at half speed. Okay. And don't tell anyone about it.

Speaker 3:

Yeah. So, alright, this is a really painful experiences, but I think that because part of the challenge you have in both these cases is this lowest layer of firmware knows that things are very, very amiss. And it has no real way. It's like I I, you know, I wasn't handed a flare gun. Like like, it has no real way.

Speaker 3:

It's like, I don't know. Just just put it. I don't know. It it has no way of indicating what's I mean, like, it knows there's a raging inferno down there, but it's got no way of indicating it. So it is up to, you know, a higher level software has that got no way of kinda querying that kind of state.

Speaker 3:

And I I do feel like this is a a really persistent theme that we see where the the the element in the system that knows that something is amiss has no way of communicating that to the element of the system that can actually do something about it. Well, this is part

Speaker 7:

of the consumer electronics heritage where you're

Speaker 4:

at a

Speaker 7:

great lengths to make sure you don't get a support systems,

Speaker 3:

because we've divided the world into those that generate the systems and those that generate the software, it's like and and in part to, like, deprive those to generate the hardware from any kind of margin. It's like, well, yeah, I don't have enough, like, staff to invest. He's like, yeah. I don't know. It goes into recovery mode sometimes.

Speaker 3:

I don't know. I don't know what to tell you. If I, you know, if I had more than 3% margin, they've gotta figure it out. But, which is another part of the challenge here. Steve, you had your hand up.

Speaker 10:

Yeah. So first of all, I'm excited we're finally talking about this stuff because I think this whole thing is, like, one of the coolest things that we're doing because it's it's like an area before I worked at Oxide, I didn't know any of this stuff existed really, to be honest. And, like, finding out all this garbage that's in there and the fact that we're getting around it is something I'm, like, extremely excited about, and it's very cool. But, also, it's kinda that, like, slightly outside perspective. There's this quote by Adele Goldberg of, like, for a really long time, which is in small talk or, like, in an object oriented code base, everything happens somewhere else.

Speaker 10:

And, like, that's the way that I feel about this layers of firmware is that, like, we've built all these, like, layers of, like, people writing these different firmwares that you only care that your thing works well enough to put the blame on some other vendor somewhere else in the stack. And that's, like, why we end up in these kinds of places where it's like, oh, yeah. That's a bug that happens, but just, like, you know, it it doesn't matter. Just, like, reset it. Like, it'll be fine or whatever because you just don't have control necessarily over what some other crap somewhere else and some other layer of this is, like, doing.

Speaker 10:

And so, I I don't know. For me, like like, in that hacker news thread, someone's like, what's the business case for doing stuff like this? And it's like responsibility. Like, we Yeah. We our job is to make sure the computer works.

Speaker 10:

How are we supposed to make sure the computer works if there's, like, all this other garbage that's literally lying to you that's, like, written by other people, and then you can't inspect or read or know what's going on. And so Or

Speaker 3:

That's why I think it's important. You have totally important, Steven. Then you also don't even know who to call. So if you somehow manage to, like, get to the right person,

Speaker 4:

one

Speaker 3:

of the things that they are I'd be curious to know how you heard this. We definitely heard this a lot, which is, you know, no one else is telling us about this. Like, we've shipped millions of these, and if it were as broken as you'd say, we'll be hearing it from everybody. Like, you're the only one you're the only customer we've heard this from. And you you kind of believe that for the first 1,000 times you hear us, and And then you begin to realize, like, wait a minute.

Speaker 3:

No. No. No. Everybody else is is seeing this. They just don't know who to call.

Speaker 3:

They don't or they don't know how to express it. They don't actually realize that this is

Speaker 10:

a problem. And so But you get learned helplessness too. Like, I was talking to a friend who works at a big company on their, like, platform team, and he's like, oh, yeah. Like, if I report this bug upstream, they're gonna go, sure. That's a bug in some firmware.

Speaker 10:

We'll file an upstream bug and, like, who knows if that everything gets looked at a little and taken care of. So why would I file more of these bugs? Because, like, they're clearly just getting ignored.

Speaker 3:

Totally. And I think it's I mean, I got so frustrated with one of these that, years ago. So in particular, do you remember the perk the the perriers we're seeing on our perks, Josh? The, back in the day, we were seeing these and, Dell this is on boot. The perk, which is an HBA, would record it will report a parity error, and then it would do it would just stop.

Speaker 3:

Like, I'm not gonna I'm just like, I'm done. I'm I'm here. I'm I'm stopping the system because of this parity error, which I don't know. Maybe I I I I'm not sure.

Speaker 2:

Wait. Like, so it did, like, a read from a block and that was it? Parity error? Good night?

Speaker 3:

It would say parity error and then it would stop booting and you would need to reset the system. System. And then if you reset the system, it would frequently work. And the Dell in particular so we were running our Alumos derivatives, SmartOS, and they're like, this is a SmartOS bug. And we're like,

Speaker 2:

SmartOS is not running.

Speaker 3:

That sounds that sounds very strangely, like, exactly what I told them. I'm like, okay. If this is a smart OS bug, if smart OS is able to travel back in time and space, it and somehow distort your software that runs early in boot, like, that's another problem. But it's like, no. It has to be because we're not seeing it anywhere else.

Speaker 3:

No other customer is seeing this.

Speaker 1:

What about the machine that we've never installed the operating system on? What about that one? Right.

Speaker 2:

Yeah. But do you intend to install serverless?

Speaker 3:

That's right.

Speaker 2:

There it is.

Speaker 3:

That that that is. Exactly. And I finally I was in I'm so frustrated by this that I was at a conference and, you know, back in the day when there was a little more physical infrastructure, and I was kinda calling them out on this. And I'm like, hey. Show of hands.

Speaker 3:

Has anyone else seen on the the this Dell perk card? These the and you could see, like, a room of, like, maybe 300 people, and there are, like, 10 hands that went up, which is a lot. And everyone is, like, looking at one another being, like, oh my god. There is someone else. Like, I every one of those had been told you are the only one seeing this.

Speaker 3:

And, of course, they weren't we weren't the only one seeing this. We were so I think, Steve, just to your point about, like, the the value of this being and I did click some of those activist comments ridiculous. Like, I don't understand the business case for this. It's like, it's a firmware conference. It's not it that's alright.

Speaker 3:

This is not the open source business case conference. Anyway, but the, I thought you'd you made a very good point about, like no. No. Like, it the the the point of this is you've got one system, and so you've got one entity that's going to bear responsibility for the whole thing end to end. Anything is broken.

Speaker 2:

You know, Brian, it's it's a it's the curse of vertical integration. Because if we if we had 2 companies, then we could stand there like the Spider Man meme and point at each other like Dell does with Microsoft and and go on and on in the industry. But the fact that it's vertically integrated, unfortunately, we've got one throat to choke.

Speaker 3:

Well, okay. So I actually do wonder about this. Like, if we see a higher proportion of this because, like, the fastest way to, like, to to get something blamed is just to be doing something that other people aren't doing. Like, oh, are you running OCaml in your stack? It's it's OCaml.

Speaker 3:

Like, are you like, what? Like, are you running oh, is it this operating system? Oh, that this is what's causing the I just need to can I go in your environment long enough to find something that is just, like, not Windows, and then I can blame that? And then all the vendors suck. I just need to find another vendor in here.

Speaker 3:

Can you please give me your vendor list? And then I know who to blame, Which is very frustrating.

Speaker 2:

Yeah. But I I don't know. I feel like both that's like the story of the industry. Like, how remember there was this era when, like, database vendors wouldn't support you when you are on a virtual machine? Because it was different.

Speaker 2:

Right? Because they could.

Speaker 3:

Because they could.

Speaker 2:

Yeah.

Speaker 3:

Right. Right. Exactly. There's nothing about a virtual machine that make it right. It's just so and it's it's like you really need to have a culture of, like, you know, you gotta own the whole problem.

Speaker 3:

Like, I don't care if it's because you're on a virtual machine, maybe. But, like, we need to own the whole problem. Sorry, Josh. Go ahead.

Speaker 4:

I was

Speaker 1:

just gonna say, what about Richmond 16? I'd forgotten about it until Patrick mentioned it a few minutes ago.

Speaker 3:

Yeah. So you should, so Richmond 16. This is what Adam, how many bug numbers do you know? Do you know, like you know what I mean? I know 0

Speaker 2:

I know 0 bug numbers. I I feel absolutely. When when you are dropping in, like, these iconic bug numbers from history, I just nod along. I'm finally telling you.

Speaker 3:

I the and I I don't know if you've lived a charmed life or just a healthier one. I mean, if

Speaker 2:

you Maybe just an an enumerate one. I don't know.

Speaker 3:

Yeah. Right. Just like, no. I I followed like, I worked through it in talk therapy, and now I can't remember the bug ID anymore. It's like, wow.

Speaker 3:

That's great. No. Richmond 16 is a, is a bug. Josh, do you wanna describe the symptoms of Richmond 16? Because this really was, like this is a wake up call from

Speaker 1:

the airplane. The the worst part about it was it was, the RAM disk that we booted our operating system from would be corrupted just a few megabytes of it in different places. Sometime after the control was handed to the operating system, like, a minute after boot had really sort of was well and truly over, like, then you would go and run, like, LS. And I guess depending on the how your image is constructed, your LS would not work.

Speaker 3:

Because a party RAM disk had been clobbered.

Speaker 1:

Yeah. By

Speaker 3:

Like a party your RAM had been clobbered.

Speaker 1:

And we never really nailed it down again because the the BIOS was, like, closed. But, like, when we turned off the UEFI network stack, not that we were using it, but when we switched that option off, I think it got better. But certainly, we, like, just refused to ever use those physical pages ever again.

Speaker 2:

Okay. Sounds sounds scientific. I mean, not not that you can be more scientific than that. But

Speaker 3:

Well, no. But he I mean, he was

Speaker 1:

Well, so we we were checking for, like, SMM activations, and we had, like, Keith had spent a lot of time on

Speaker 4:

this with, like, the, the

Speaker 1:

we had nailed it down to specific physical memory ranges, like a couple of meg specific physical megabytes, I think, were were subsequently being overwritten with just trash. And, like, there there was no SMM activations that we could see, assuming the CPU is being honest about that. And there there were no watch points that we're fine. So it's not like it was our fault. So, like, it must have been DMA from outside the CPU.

Speaker 1:

And so, like, we suspect that the network card had been configured in such a way that it was eventually, like, pooping some packets into the space or something and, like, it was doing that for some amount of transfers after boot had been handed off and because, like, the firmware is gone at this point. It's not running anymore.

Speaker 3:

So Well, it's not running in air quotes.

Speaker 8:

In any in any way that we get it

Speaker 2:

to be running.

Speaker 11:

Running in any way

Speaker 1:

we can tell.

Speaker 3:

But this is, like, one of these moments in it. And, yeah, I totally forgot what Rich and 16 as well until you mentioned it. But the but this is one of these moments that was very revealing in terms of, like, oh my god. There is software that is running at like, this should there shouldn't be anything else running on the system. I like, we are the operating system.

Speaker 3:

We control the CPU. No one else has access to the operating system's memory. It's like, well, it's mostly true. Not always

Speaker 9:

true. So I know we

Speaker 2:

have hands up, but I I I I really want to dovetail to the SMM there if we could. I mean, just because I feel like this was several such a revelation for me coming to Oxide. And I think folks on the space might not have heard it or or, reveled in its majesty.

Speaker 3:

Yeah. So SMM is the system management mode, and this is a mode that the CPU can enter whenever it wants. Happens. And

Speaker 2:

But it's not like a virus. This is not malware, ostensibly.

Speaker 3:

I think you're that's really gonna pivot on your definition of malware. I mean, it's like you're it definitely gets could get blurry, but, no, it's not malware. And the it's just like, oh, god. Okay. Wait.

Speaker 3:

Why would you do that? It's like, well, there are various reasons that you can enter SMM, and there are very reasons why it can why it may choose to enter SMM, why SMM can be entered effectively and the thing that is really alarming you got no visibility to what that software is. The thing that I did not realize until I mean, only, I mean, years, after I we knew SMM was a thing and certainly putting pressure on Intel that they were ignoring, to give us some visibility into when we were in SMM and a suspicion that there was a lot of software in SMM. But you don't actually don't know what's there. What I did not and I knew that it was, like, it was originally done for, like, laptop suspend and resume, which is the way that that was working without any OS support is because SMM was actually doing it.

Speaker 3:

It was like, okay. Like, I can kind of understand that.

Speaker 2:

Back to Dan's point about this system being congealed. Right? Because you've got you're breathing in this magic functionality that your hardware can just do, you know, independent of any proper OS support.

Speaker 3:

That's right. And that got they got a little too strung out on that. And in particular, they so, I mean, the one that I heard that was when Ron Minnick was describing I don't know. Maybe this is I can't remember. I heard him say this from 2017 or 2018 that there were mouse drivers in SMM.

Speaker 3:

You're just like, were because they were when they were initially implementing USB mice, they wanted it to work with a with a p s two mouse driver. So there was a mouse driver sitting in SMM. You're just like, well, that's no. No. No.

Speaker 3:

No. No. That's that is not a 100% not a good idea. And so it's it's troubling. I mean, it's the kind of thing that's like, you know, when people would talk about what's an SMM, a part of me would be, like, that sounds kinda conspiratorial.

Speaker 3:

And then you realize what is an SMM, you're, like, as it turns out that was not conspiratorial enough. It turns out that actually, it is very, very very troubling, the amount of software that's in there. So we and we leave SMM empty. So there is nothing in SMM. Unfortunately, you cannot completely disable it.

Speaker 3:

You have to deal with it one way or the other. I know we Matt and Ian have got their their hands up. Matt, do you wanna chime in?

Speaker 5:

Yeah. Sure. So, on the, on the parity error anecdote from earlier, am I correct in guessing that, another thing that, made that problem, annoying to say the least is that the error was only displayed on the VGA console.

Speaker 1:

Yes. You guessed

Speaker 3:

correctly. We we have a lucky winner. Yes.

Speaker 5:

So so you would have to go in there with your with the the the VNC equivalent, client that was built into your BMC web interface and look at the and then see the error that way, I guess.

Speaker 3:

That's right. And I believe and, Josh, maybe you remember ex someone else remembers exactly. I believe you could press f one to continue. So this is where it's like, yeah, I'm just gonna hang out until you press f one. It's like, okay.

Speaker 3:

There's no cool PC, but there's

Speaker 2:

someone sitting in front of it.

Speaker 1:

That's right.

Speaker 5:

Another thing. I I don't know if this is I don't know if this is SMM or just the BIOS hanging around and and hanging around and and and butting in where maybe it shouldn't. But I I remember in, like, 90 4 when there is a when I was a teenager, the on on Packard Bell Computers

Speaker 2:

Nice.

Speaker 5:

When on Packard Bell PCs, when, running DOS or Windows 3.1, while the while the OS, such as it was, was running, you could hit alt s to go into the Packard Bell BIOS setup.

Speaker 3:

Oh my god. Well okay. Yes. So that, that would be interesting to know whether that was SMM, which definitely could have been. SMM has surprisingly old origins.

Speaker 3:

It's from the 386.

Speaker 1:

What so

Speaker 3:

it could have been estimate, actually.

Speaker 5:

Or it could have just been I mean, the BIOS was handling keyboard interrupts

Speaker 3:

and everything. Yes.

Speaker 5:

And and this this was a school computer, function. But a school computer. That is weird. Could and function.

Speaker 3:

But a

Speaker 2:

smoking computer, that is where

Speaker 5:

it could and couldn't run.

Speaker 3:

God, if if you were like my kids now, whenever I have a new piece of electronics arrive at the house, I have to be the one to unbox it because otherwise, they will find the parent controls, and they will lock me out of my own device. Do you have this happen, Adam?

Speaker 2:

No. I think I you know, I've I've already taken control of, of, like, the Wi Fi router, which is like you gotta lock that down parents because like that's the point of control of the whole house. If you can turn off everyone's internet, like you can make your teenager show up anywhere.

Speaker 3:

Right. Exactly. You make it well, it but similarly, I'm surprised that, man, I'm surprised that your self constraint of, like, not actually be, like, hey. I can actually set some passwords on this, but I can actually ransom the the this bias back to school.

Speaker 5:

Well, the I mean, I I I I kind of did the opposite thing the the year before when I was at another school where where they had Mac computers, and I discovered that I could turn off the, quote, unquote, security software that was installed on those computers by holding down shift while, while while the OS was booting to disable system extensions. But I was a good kid. I reported it to my

Speaker 3:

teacher. There you go. The reported that got a CVE for it. That's, that's pretty funny. And I I at Packard Bell.

Speaker 3:

So I remember it being a kid being like, wow. Packard Bell. That's a funny name for a Hewlett Packard machine. There must have been litigation for Hewlett Packard and Packard Bell. Right?

Speaker 3:

How did it

Speaker 4:

I don't know.

Speaker 3:

Well, it's one of those questions is always I've always wondered. Ian, you, you had your hand up.

Speaker 8:

Yeah. One one aspect of this work is is kinda about Oxide's kind of balance sheet. Right? It's about having happier customers, about reducing your support contact rate, about having fewer, like, head scratching support cases. But I am kinda curious about how you take this thing that is kind of invisible and make it more visible or more transparent to customers so they understand this aspect of the selling point of why you would buy an oxide rack as opposed to, like, a Dell machine.

Speaker 8:

Like, I Yeah. That's whether you've thought about, you know, what is the hard drive status but for support contact rate for my machines doing weird shit?

Speaker 3:

Yeah. No. That's a great question. And I think that I mean, just to be clear, I don't know that we view this as a selling point per se. I think just to Steve's earlier point, this is enabling us to deliver a high quality experience, and it's the high quality experience that is the selling point.

Speaker 3:

So I think that this is a this is a necessary implementation detail of that.

Speaker 2:

But the negative consequences is something that most of the customers I think we've spoken with correct me if my mom, Brian, but it it resonates with them. When you're like, have you ever been in a situation where you've got 2 vendors both gaslighting you, both telling you you're you're the only ones who it's ever happened to, never get a satisfactory answer, they tell you to reset it, you know, it happens to 10% of your fleet. Does that ring a bell? Like, we see lots of nods in those situations. You see

Speaker 3:

lots of nods.

Speaker 2:

And, of course, we can't say, hey, that that will never happen to us because, you know, we're smarter than those folks. But we can't say, like, it's, you know, we have taken these steps so that there are not these areas of mystery and uncertainty and lack of visibility.

Speaker 3:

That's right. And I think that the the thing that so and I knew we would be able to we knew we would be able to deliver that kind of responsibility. I have to say and then, Josh, we need to get your your take on this. I also thought we would be taking enormous technical risk clearly because, no one's really done this with an x86 machine before. And I was all I also just assumed that it's like we are taking a slower path by doing this, That this is gonna be a harder, slower path.

Speaker 3:

And one of the thing that is, like, kind of amazing to your question about the value, Ian, it is I'm now I I could I would say unequivocally, we actually took a faster path that the which I never would have thought that this was actually it we were able to and I I mean, in part because this is, you know, the enormous skill and expertise and background and resolve and grit and so on of the people involved. Robert Moustaki, Keith Rosowski, and the folks that were really, like, driving this were very determined. But I also having seen the way the this platform enablement, the way it's done even for hyperscalers, it is not fast to rely on a proprietary bias vendor to support a new platform.

Speaker 1:

And I think we had a longer a longer ramp up to the point where we're gonna be able to sell something than if we had just put Super Microsystems in Iraq. But we will attain a much higher top speed once we get up to speed, I think. Because, I mean, recall, like, the decade or in my case, 6 years that that we spent at Giant. Like, how many engineer months did we, like, pay in terms of my salary for me to, like, just, like, fiddle fiddle with biostat. I mean, like

Speaker 3:

I know. Well, but so but I true. But, Josh, I also think that, like, once we made the decision to to chuck a BMC and to have a service processor, like, we were already once you're with your on your own board design, you are effectively already on your own. From that moment, it was actually faster for us to build a holistic system than it was for us to engage with an IBV to go build their variant of a Giza for us.

Speaker 1:

Yeah. I think that's I think that's true. We also just need as much stuff. Like, it's not like we had to go and reimplement all of open b and c and all of EDK 2 and all of these other things. Like, because we we we actually just wanna jump into the OS as quickly as possible, and we don't need to be compatible with anything else at least right now.

Speaker 1:

And like

Speaker 3:

We don't need to be anything compatible with anything else at this layer. One of the things I do love about oxide is that we are often simultaneously ripping out a simultaneously ripping out a legacy abstraction at the bottom of the stack and then reimplementing it virtually. So you've got, where we are I know that it's been fun to watch folks go deep into old abstractions and then reimplement them in Pro Plus or machine model.

Speaker 10:

And so, like yeah. Sorry. Go ahead. It's difficult to incorporate debugging time or, like, after when something is working into, like, time estimates. We see this with Rust in general where people are like, stuff takes a lot longer in Rust.

Speaker 10:

And it's like, well, it initially takes longer, but then you spend less time after the fact debugging weird issues. And, like, sure, if you're using, say, like, Ruby, you may get it, like, quote, unquote done faster, but then you spend all this time later figuring out where bugs happen. And so I'm not saying these trade offs always make sense that way, but I think it's kind of a similar thing where we we chose this path that we maybe thought would be slower, but it turns out it's faster in the end just because, like, how we perceive of what is fast and slow is sometimes not, like, actually accurate. Those months of time you spend adjusting biosystems, Josh, weren't, like, considered in the, like, how long did it take to get this solution implemented? Because that's just, like, the way that we think about these kind of processes.

Speaker 10:

Alright. I found that to be true with software development in general.

Speaker 3:

No. I think you are totally right, and you are especially right in this domain where everyone's responsibility is just to get their own little thin layer working and hand it off to the next thing. And then when the whole thing goes sideways, it's like, like, don't worry. You're gonna get away with it. No one's gonna come back and blame you because they're it's such an opaque layer

Speaker 2:

of pain. Just as faster than we ever seen

Speaker 4:

it before.

Speaker 10:

It's faster to get working is, like, for a certain definition of working, which often doesn't actually mean working. It means, like, working enough to get sent to whoever wants to accept it.

Speaker 3:

That's right. That's right. And it so, no, I think you're exactly right. And I think that we and and you're right that that we the kind of the way we think of, like, when is the system done? And for us, like, the the time for that has been I think it's been I mean, for what we've been able to do, we've been able to do it in a in a remarkably small time, remarkably short time.

Speaker 3:

And then I think that the what another question they've been asked is, like, well, what's gonna happen for, like, a next generation of the CPU? And it's, like, a next generation CPU definitely is gonna require work, but it they're incentivized to keep a lot of these excellent abstractions and provide we get good documentation. We're confident we can actually do it, relatively quickly. So we Ian, is this is this helping to answer your question?

Speaker 8:

Yeah. I think the I think the piece that I'm I'm kinda missing here is, how you kinda beat that drum in a way that doesn't require customers to talk to other customers? Like, is is it a, oh, I'm going to post blog post about not just here's this extremely weird issue that took us weeks to debug, but maybe I do a blog post which is like, here's a support case that probably would have taken 4 weeks but instead took 4 days because we had x, y, and z built into the kinda core brain stem of this of this rack. Like, you know, I'm just thinking through, like, how you sell this benefit in a scalable fashion instead of through a, you know, one to 1 sales conversation or selling purely to people who who are currently, you know, at the thirds of their current vendors.

Speaker 3:

Yeah. And I think that I mean, it it become it does become an implementation detail at some level, and it becomes ideally, it it becomes one of these things just allows us to build a better thing wherein the the demo you know, what is the demonstration that it is a better thing, that it's a higher quality thing? You need to get quantitative about it. But you can also just get very qualitative about it. Like, I mean, I think that and Apple has done this for years where it's like, no.

Speaker 3:

Like, this thing is just it works. And I'd be you know, I I am curious with the implementation details on why it works, but from an end user's perspective, what I appreciate is that it just it just works. Todd, you had your hand up.

Speaker 4:

Yeah. So I guess I so I don't know nearly enough about this. So this may be a dumb question, but I'm I'm a little curious. How much does your design constrain you in terms of different types of hardware that you wanna put, on the machine? So if you wanted a node with different GPUs, different CPU, or maybe a machine with a different network or, like, a larger network, what, what would you have to change about this setup or anything?

Speaker 3:

Yeah. Great question. And the answer to it is, I mean, like the, the Magic eight Ball, it depends. So, I mean, if you you definitely are, but I would argue what you one is anyway. It is a big lift to go to, for example, Intel X86 versus AMD.

Speaker 3:

That's a big, big lift. Or and but that's a big lift anyway. The the and with Intel, we've got the management engine that would be something we'd have to contend with. There are a lot of reasons why we chose AMD over Intel for the CPU. But so that would be a big lift.

Speaker 3:

Changing for this particular aspect of having a kind of the holistic system design and having the operating system do the lowest level platform enablement. It really is the host CPU that is the that is the thing that that is affected the most. Chain the the NIC, for example, the or a GPGPU PCIe device generally won't be that of I mean, won't be that affected. I I I should put all that with caveats, but, what this really is is about, a tighter integration with the host CPU, and it is a tighter integration with the host CPU. But I would argue that, like, you've got a tighter integration with the host CPU anyway.

Speaker 3:

So this is more just allowing one to deliver a better thing with it.

Speaker 2:

Right. Is this accurate to say we'd have to do kind of a similar scale of work if we didn't go with this model, if we didn't go with this holistic boot approach, But a lot of it would be out of our hands and in the hands of of other vendors where we don't have transparency or or visibility into the operation

Speaker 4:

of these things.

Speaker 3:

Absolutely. And we I mean, we also again, because we have a serve we have chucked the BMC, we also have our own board design. So, I mean, if we like, we're not just, like, chucking an Intel CPU in there. We're gonna we would have to do a de novo board design around Intel X86, and it would be, it would be a lot of work. And we but that's part of the reason, you know, when we conceived of oxide, one of the things that we wanted to go do, and I think we heard, Kate talk about this a couple weeks ago on the supply chain, is, like, we very deliberately wanted to have deep partnerships with folks.

Speaker 3:

We don't want we're less interested in the ability to kinda plug random stuff in here. We really wanna make very conscious decisions. Todd, does

Speaker 4:

that does

Speaker 3:

that answer the question?

Speaker 4:

Yeah. I think so. I I think so. I'm I'm mostly curious, like, you know, if if this thing could scale to, like, a 240 rack HPC machine or something like that because because where I work. So I'm I'm wonder if that's on your radar because that could simplify our boot process a lot.

Speaker 4:

I don't know if you know how long it takes to boot these things, but it's it's it's a pain in the ass.

Speaker 3:

No. Yeah. I I I yeah. But I would love to hear stories. Boot time stories.

Speaker 3:

Yeah. So we need tell me a bit about to the degree you can, what, like, are these mainly g b GPU based systems, or what do they look like?

Speaker 4:

These days they are. So, like, since our last system, which we, I guess, we put on the floor in 2018, which is Sierra, it's a 17,000 GPU system. It's like, 4,000 nodes, power 9 plus Volta. It it takes the vendor forever to get them booting quickly. Like, to the point where we actually have boot speed requirements in the SOW.

Speaker 4:

And I mean, it's still, like, still hours. Right? Like, if you have to re if you have to restart the whole machine, some nodes aren't gonna come and you're you're gonna spend a long time getting that thing working. Like, when we had to do the government shutdown, I don't know if you remember that, but we were required turn off all the machines, which is just a giant waste because, like, they're not gonna come up immediately and start running things

Speaker 5:

Right. When the

Speaker 4:

government shutdown ends.

Speaker 3:

Right.

Speaker 4:

So, you know, having a faster machine for something like that or faster boot time for something like that would be would be good. Not that government shutdowns happen all the time, but just in general, like, for managing, you know, thousands of nodes, this sounds super good. I'm just, you know, curious. And I guess the other the other facet of the question is, when we buy these things, I mean, we're we we buy them, like, 5 years out and there's a lot of collaboration with the vendor on, like, non, recurring engineering. So that's, like, innovative stuff that they do to make sure the machine is deliverable.

Speaker 1:

Yeah.

Speaker 4:

And, you know, so that's kinda cool because the vendor gets to

Speaker 7:

Is that right?

Speaker 4:

They can do crazy things in that. They can do cool research, to make the machine work in 5 years. But, you know, a lot of it is, you know, what's on the vendor's road maps for things like GPUs because that's where most of our power comes from these days. And if you can't, you know, pick the best GPU for the job, then it's gonna be hard to compete on one of these procurements. Right?

Speaker 4:

Like, if you're tied to specific hardware. So that's kinda where the question is coming from.

Speaker 3:

Totally. And that's what we tried to I mean, we felt like we in in picking AMD Milan, we were picking the best CPU. We don't have a g p GPU offering at all, just to be clear. So with this, we have only got CPU networking storage only. And gpgvue is definitely something we're looking at.

Speaker 3:

Accelerated compute is something we're looking at very, very closely. I will tell you just bluntly, part of the reason that we are not, we haven't figured out what we're doing for accelerated compute yet is because we don't feel we can deliver oxide value with NVIDIA. And we don't feel we can do that in part because NVIDIA doesn't really have an interest in work in partnering, for our experience, in partnering to to develop a truly integrated system. I I don't know, Todd. It'd be great to maybe that's wrong.

Speaker 4:

Well, funny you should should mention that because, like, I don't know if you saw that Oak Ridge's machine, which is part of the same procurement as ours, is a giant AMD system and so is ours. So, anyway, I didn't win any of the XScale procurements for for reasons. And, you know, we we're that's that's part of the competition thing. Like, a lot people were like, oh, you can't get that other GPU stack to work. Well, we're interested in it because it's competition.

Speaker 4:

Yes. Right? And so we're trying to create a market for other GPUs because of because of reasons.

Speaker 3:

Amen. And I also feel that, like, part of the reason that's really important to do that, and I mean, that's great that you're doing that, part of the reason that we really wanna see, whether it it's a reforming NVIDIA or a different player like AMD. But we really believe in open source software. It doesn't feel like a deep thought, but we it's really important that we when you've got someone putting in all this time and effort for your platform enablement, you want that to be something that doesn't have to be repeated. And making it open source is the way we have learned.

Speaker 3:

That's the way that happens. I mean, if it weren't for open source software I mean, it's just it's just and, Adam, I don't know if you have the kind of the the same moments where you think like, oh my god. Open source software is such yes. It's a big deal, but it's a it's an even bigger deal from that. I mean, where you look at all the stuff that we build right now that rely even at Oxide, where we're pulling in all these open source components, it it's just it it's unfathomable to

Speaker 2:

think it's It is staggering. It's staggering to think of, like, doing this from nothing, like, the the way that we did in the bad old days to to do it literally every code of being one that we had written.

Speaker 3:

And I think you see and, Steve, kinda your earlier point of, like, you sometimes it takes years to see these effects. I feel like we are I feel like I mean, this is true for a lot of programming languages, but Rust is like a culmination of many different open source bodies of work. I and it's building on so many different layers of open source. Some that are directly in the the the project, like LBM, and others that we're able to really directly inspire or they're able to pull in other components. So, Todd, our big belief is that, like, that platform enablement really has to be open source, and we can't do that realistically with NVIDIA.

Speaker 3:

And that's a a big challenge. I I'd be great. I would love NVIDIA to realize that it is in their interest to allow people to more readily build products around them, but that's not where they are at all.

Speaker 4:

So you made just on on that point, I mean, we have a whole paper at c 20 about vendor stacks versus open source stacks for HPC machines. Because there's, like, sort of a historical perception that the vendors add a lot of value with their proprietary stacks. And and so we, you know, we pitted, like, an open MPI plus, you know, vanilla Linux configuration versus, like, what the vendor of our machine was providing. And the the the difference isn't that big. I mean and and also, like, on the MPI side, it's it's a toss-up what kind of network performance you get.

Speaker 4:

So it's like, is it worth making all this stuff work with custom vendor compilers and custom vendor implementations of everything and proprietary things? So, I mean, basically, we're thinking no. But on the other hand, like, we at the moment, kind of because of the procurement we did where we tried to, you know, build a new market for for different GPUs, that there's only one integrator left in the HPC world who build systems at our scale. So we're trying to figure out who could be new integrators there. And and part of that is through open source.

Speaker 4:

We think we can Yeah. We can take on a lot of the responsibility for the stack, work with the vendors closely, and, you know, work work with them to use more of our stuff, like our resource manager, our stuff, vanilla, like Linux for the operating system, and introduce more competition there. Open open space right now.

Speaker 3:

Yeah. That's very interesting. Well, for our for our part, I we would I mean, we're obviously open sourcing everything that we're doing at Oxide, and, it would be great if someone else is like, hey. You know what? If I make a system look like that, I could actually run their service processor.

Speaker 3:

I could run I could run your press or I could I mean, I feel like we would like to inspire other folks, and that this is part of the reason it's it's all openness because we've benefited from that, and we want the industry to be able to benefit from that, especially in domains that, you know, it's like where you should be able to benefit from all these things. So, hopefully, we can collectively, put some pressure on folks because I I just totally agree with you that the open source has gotta be the path of the stuff. And, Drew, I saw you getting in here a while ago. I'm not sure if you had a a if you had a question or a thought or maybe not. And then, see, I know Ian, we got to your we we got to your question slash, I I I or did Ian drop off?

Speaker 3:

Ian drop off. So in terms of where we're going, I mean, we it was exciting to be able to demo the booting. But, Josh, thank you very much, by the way, for arranging for the, the lab machine so I could actually demo it on on stage. That was a lot of fun. But I I it's always nice when you demo booting and people actually appreciate it.

Speaker 3:

You know? This is not it takes a very takes a very rarefied error to appreciate a boot demo. But it was actually really exciting to be able to and, Josh, is that a Star Trek reference? Like, the station now under computer control, is that a

Speaker 1:

Yeah. It's from, the search for Spock, Star Trek 3, I think, where they nicked the spaceship. They make off with the spaceship, and, of course, they don't make off with the crew. So, like, there's this, like, throwaway line where, like, the chief the chief engineer is like, don't worry. I've just rigged out everything up to be automatic.

Speaker 1:

And it's like, why don't you do that for every spaceship? And secondly, I'm like and also, like, why, like, why is there a crew of more than 4 people

Speaker 3:

for any ship then? And

Speaker 1:

but it regardless, it on every screen in the background, you know, to make the set look good is this little, like, Commodore 64, like, federation of planets picture with this station is on the computer control where they would normally be, like, something, you know, blinking for for an actor to sit in front of. Anyway, it doesn't it's just it looks like it does in the in the the message of the day there. But,

Speaker 3:

is this a common reference or is this just a moment that really spoke to you in I

Speaker 1:

I think I may be the only person that remembers this. I'm sure I have looked for a a screenshot of it. Yeah. This And haven't found one that I didn't make. So like it it just

Speaker 2:

This this turned from a thank you into a public humiliation very quickly. I just

Speaker 1:

I I don't feel embarrassed at all.

Speaker 3:

Thank you, Josh. Thank you.

Speaker 1:

People should people should live their best lives.

Speaker 3:

Look. I'm not I am I I saw Search for Spock in the theater, and I did not I don't remember any of this. I I mean, I was

Speaker 1:

Well, fair enough. I I I mean,

Speaker 3:

the, I mean, search for Spark, I don't know that it was, it would it would but it I'm glad that it was, because when I saw it, I'm like, this has gotta be a reference to something, and I Googled it. It does Google somewhat well, So I don't think you're the only one.

Speaker 1:

Really? So you you were able to find a reference to the film. Fair enough. There must be some OCR that's gone into, like, alt text generation or something. It's just things.

Speaker 3:

Alright. Well, this is gonna be good because, Adam now knows what the image for this, he's gonna have to we're gonna have to go, get the search for Spock. This is even on streaming.

Speaker 2:

There it is. Yeah. Make my job easier, making the video tomorrow.

Speaker 5:

Exactly. I gotta listen to an audio described version of that movie and see if and and and, pay attention to whether the audio description narrator picks up on that on screen message.

Speaker 3:

Yeah. Exactly. Go flip report back. Alright. Well, that's, but it was fun to demonstrate that, Josh.

Speaker 3:

Thanks again for, the and, actually, Josh, it's also worth maybe do you wanna describe a little bit about how we have phased boot? Because I think this is actually pretty neat.

Speaker 1:

So we have a small so so the operating system eventually boots from a, just a a RAM disk that's just like a it's a ZedFS file system that we stick into DRAM and then mount as a a file system. And, but in order to get to that RAM disk image, which is pretty big, it it needs to live on some PCI devices. Uninitializing stuff, like, we are not gonna get to the PCI thing until much later. We need the drivers that are on there to get to the PCI stuff. There's some chicken and egg problem stuff.

Speaker 1:

So we we have a very small collection of kernel modules and files that live in the NAND, like the, well, sorry. The NOR flash, the little the spiral where the BIOS would traditionally live, I guess, on on a on a computer. And and we are basically just putting a very small subset of files that we need to get started in that room. And the computer pretty much jumps straight to that through some small, boot lottery stuff that Dan wrote. It like, it's pretty much straight to the kernel.

Speaker 1:

And then there's enough drivers and file system stuff in that thing that's a little bit like an init RD, I guess. Like a an initial ramp disc image thing. It's able to load those drivers and then mount the copy in the the ramdisk image from one of the big storage devices once we can get to it. And then boot really proceeds pretty much, normally the way that it would on any other system any other, like, Unix system pretty much, you know. And it starts up in a bunch of other things.

Speaker 3:

But this is a real technical challenge with this approach is that once you're you have the machine up, which is great. That was a big, like, lurch forward. But having the, like, the CPU booted is not you need, like, all of the programs that you wanna be able to run. And the it's the the Spynore is it's it's a 32 meg payload. Right, Josh?

Speaker 3:

Yeah.

Speaker 1:

And it's gonna have some Miranda, AMD, PSP crap in there as well. So it's not really like you can use the 32 meg. And also there are some constraints on the layout, I think. So, like, I think we could get at most maybe 26 or 28 meg in there. Right now, I think we're using about 9 because we're compressing it all.

Speaker 1:

So it's, yeah.

Speaker 3:

This is, like, a challenge, and it it this is, like, also architecturally capped. So it's not like we can't put more spy nor down. It is capped at I think it is capped at 32 megs total.

Speaker 1:

Yeah. You'd have you'd have to come up with some other magical storage device that you could access really early Yeah. Without a bunch of setup, basically.

Speaker 9:

It's not architecturally capped to 30 2 megs. That's that's a supply chain issue. That's because that's the just the size of the parts that we could source. We could put larger parts on it, but at the time when we designed the boards, 16 megs looked like yeah. But there's 2 16 meg chips.

Speaker 9:

And so that's why that limitation exists.

Speaker 3:

Oh, okay. You know, I thought that was AMD architectural. That's not an AMD. So they can have an arbitrary large spy device. It's just that we

Speaker 4:

that was

Speaker 9:

I mean, kind of. The thing is that you can map that into the well, into that appears in the physical address space of the machine. You can map that into the virtual address space. So if you wanted to stick more storage on there, you you potentially could. But that, you know, again, it it just becomes a supply chain issue, and that's why we're limited to the size that we're limited to.

Speaker 3:

But the and then the challenges you have is, like, not very much storage space to actually have enough operating system to be able to go, like, do something sophisticated, like hit a disk over ZFS and actually pull in the rest of the system. So, Josh, kudos to you and I and, Dan, to you to for the the we've kinda collectively pulling this off where we've been able to to, get enough in there to be able to to pull in the rest. Because and I think, Josh, you don't even execute to user land. Right? I mean, you actually this has been all internal.

Speaker 1:

That's correct. At the moment, it is all of the modules you would normally expect, like drivers and things and some bits of subsystems in the file system driver and and then the the core, like, UNIX kernel file, and then, are all kind of jammed into the the ROM. And then I wrote a small module whose job it is to because, like, by the time we actually need to mount the root files, it's actually extremely light in boot. Like, we haven't made a process yet, but we have interrupts. We have, like, multiple CPUs.

Speaker 1:

We have the the full device tree stuff is, like, ready to go. Modules can attach as long as they're available in the ROM. So we pretty much just, like, attach everything that we can and go and look for the disc device for instance or or the Ethernet NIC if that's, you know, if that's where we're trying to source around this image from at the time. And almost all of the code is just existing subsystem code that we're just calling into to do, like, the usual kind of ethernet traffic stuff or just reads and writes. So it's it's only like a 1,000 lines maybe or 1500 lines and it does also we've got all this like hashing infrastructure in there already, so we can do, like, shot 2 hashes and so on of the the contents of the thing.

Speaker 1:

So, like, there's a lot of subsystem that's already available in that 9 megs.

Speaker 3:

Anyway and honestly, I I I don't know

Speaker 6:

if you think the

Speaker 3:

same way. Like, the kernel is actually a pretty good development environment. I mean, it's like Well,

Speaker 1:

I mean, I I mean, I kernel is, like, I don't know. You can you can sleep at pretty much any time. Like, I mean, there's a lot of there's a lot of stuff going on in there that I feel like has been crafted over several decades, and that is now a pretty good programming environment, honestly.

Speaker 3:

And you have something that's debuggable, and if you've got Yeah. Instrumental and so on. So it's actually you can actually figure out what's going on.

Speaker 1:

Yeah. And then It takes

Speaker 9:

a lot of flexibility once you get into the real kernel. I mean, you know, when when you first with, like, the the very first machine instruction the the x eighty six scores invoke, like, that's a really constrained environment.

Speaker 7:

Yeah.

Speaker 9:

And then you have to sort of bend over backwards a little bit to get to the point where you can just load the kernel. One of the things that probably very few people outside of oxide realize is that we enter the kernel in full 64 bit mode with virtual memory enabled, which is very unusual. Almost no system does that. Everybody sort of gets the machine up into well, you either start in 16 bit real mode or you get the machine up into 32 bit bit protected mode and you enter the kernel then. For us, we've, like, basically done the initialization of the boot core and gotten that thing fully turned on.

Speaker 9:

And in many respects, we load the kernel image itself as just a standard ELF binary. There's kinda two places where we violate some of the assumptions that you could just normally make if you were just invoking, like, then LS or something. And that's in the way that we treat the GDT, which I'm trying to fix now and some assumptions that the virtual memory subsystem makes about the layout of the address space after the kernel starts. But that's it. Otherwise, it's like, you load this binary, you jump to its entry point, and then the kernel goes off and does its thing.

Speaker 3:

And so this is and, Deb, this is the piece that you've written. This is the the pico host bootloader. And this thing is doing it's just that absolute minimum that we need to do to to get us to run a 64 bit executable, right?

Speaker 9:

That's correct. Yeah. I mean, that that basically starts in 16 bit real mode. It it does the whole little sort of amoeba evolving into a dog dance.

Speaker 3:

Of It is. It so is. I I I have phrased this as, like, you replay the history of compute starting in the mid seventies with the 40 4 or the 80 the 8080. And Yeah.

Speaker 9:

I mean, you're you're starting you're wearing a disco shirt with a really big collar and, you know, like, the b g's are playing in the background. It it's a

Speaker 4:

constrained environment when

Speaker 9:

you first begin.

Speaker 6:

But, yeah, you you

Speaker 9:

have to turn on protected mode and get into the 30 two bit mode, and then you have to, you know, load the page tables into the MMU and initialize that. And you have to turn on all the caches and make sure that memory protection actually works. X86 is a very strange architecture. You can there's a bit that you have to set in one of the architectural registers for page mode permissions to actually have meaning when you're in carnal mode. Otherwise, you can write to read only memory and do all sorts of strange things.

Speaker 9:

But yeah. And so foible does all of that and then creates an environment which the kernel can basically sort of begin execution assuming that it has been mapped and that it's fully resident in that in the, you know, in the virtual address space and and etcetera. And then the kernel one of the kind of interesting parts of the contract between the bootloader and the kernel is the kernel takes over ownership of the page tables, and the bootloader has a setup. And so the kernel will, like, walk those and say, uh-huh. Okay.

Speaker 9:

Here are all my pages. And here's obviously the root and the internal nodes and the paging, table rate x tree. Let me take ownership of those. Everything else, including oh, by the way, the put letter that we just jumped out of is fair game for recycling and becoming, you know, just allocatable memory.

Speaker 3:

Yeah. Which is I mean, it's interesting that in terms of like because we totally control what that contract is between the bootloader and the operating system. It allows us to do things like that. Where it's like, no. No.

Speaker 3:

I've already set the page tables for you. Here they are. You don't need to actually go do any reinitialization.

Speaker 1:

That's

Speaker 3:

right. Yeah. That's what I need.

Speaker 1:

Critic critically, we build these things, like, together effectively. So, like, the foible part is, like, the first 100 or a 150 kilobytes of program text in the in the image. It's almost like a header. And then the 9 mega compressed crap that we the kernel is like, they're all one image in the in the end. Like, the they get built they built built.

Speaker 1:

There's no long term stable interface between those components. We could change them all, like, tomorrow, and then we would rebuild both components at the same time. And, like, if we come up with a better way to do it, like the That's exactly right. Yeah. We don't have to deal with the foreverness of of, like, will the buy bios and multi boot and stuff work this way, like, forever now?

Speaker 1:

Like

Speaker 9:

Yeah. Yeah. No. Floyd will actually includes the well, and and includes basically a CPI archive that includes the kernel image as a blob that's compiled into the foible executable. And that's just a rest program and uses, you know, the include bytes macro to bring in this PPAIO archive, and that just becomes part of the compiled ELF image that we generate.

Speaker 9:

And then we have another program that'll turn that into something which is adjustable by the PSP.

Speaker 3:

Steve, did you know this, by the way? I'm not sure that Steve knew this, that we are using include bytes to actually include the image from foible.

Speaker 10:

I did not, but it makes total sense now that you've said.

Speaker 3:

I I love IncludeBytes and IncludeStr so much. I cannot express how much I love them. I love them so, so much. I think I includes her was almost my first moment with Rust where I was like, Rust, I think I love you. I because I was super early on.

Speaker 3:

I was like, I'm gonna check this Rust thing out. Dan, you must have that same feeling. I just feel like it's so freaking valuable.

Speaker 9:

Well, you know, it's funny. I I I did that because it was easy for testing. And and then we and then we started sitting down and thinking about it because our initial so AMD has this thing called the embedded embedded or embeddable file system, the EFS. And this is something which the the bias sort of understands. Right?

Speaker 9:

And, you know, the bias goes in there and it finds the sort of reset vector image, which is really what all of this code is. Right? And the PSP will sort of load that PSP, like, it's kinda responsible for doing some of the early initialization on the machine. It does the DRAM training. It finds the reset vector in the EFS.

Speaker 9:

So basically in the flash part, it loads it into DRAM and then sets things up so the x86 scores will start executing in that code when they come out of reset. And, you know, I I was like, okay. Like, I've written the bootloader, but now I need to test actually, like, you know, booting the machine and loading the kernel. And so I was like, include bytes is a great way to do that. And then, you know, we had this kind of intention that it's like, well, but we really need to, like, you know, read the read this image out of EFS properly.

Speaker 9:

And then we sat down and we started talking about it. We said, wait a minute.

Speaker 3:

Wait a minute.

Speaker 9:

Yeah. We treat flash as as this, you know, to not to overuse the word, but as this holistic thing. And so, you know, the the the sort of parts of the AMD PSP goo that we ship, you know, the bootloader and the kernel image are all treated as a single unit with respect to what we ship on the individual machines. So there was no concept that we were gonna, like, update the kernel independently of the bootloader. And once we had that sort of epiphany, it was like, well, then why are we screwing around with trying to, like, you know, read this thing out of this thing, just embedded in the kernel image and have this one tool that understands how to split things up cleverly so that they all wind up in RAM contiguously and stuff.

Speaker 9:

And then let's just do it that way. And it it ended up being a really nice simplification.

Speaker 3:

I think I want, like, an IncludeBytes as my bootloader t shirt or something, Dan. I feel like that.

Speaker 10:

As as a small side note on, include bytes, this finally actually feature landed in c 23, actually. It's still not in the c plus standard yet, but we'll see if they end up adding it now that it's in the c standard pretty quickly. But, apparently, the first time it was discussed in the context of c standardization was September 1995. So, it has been a long time coming for that and as the blood, sweat, and tears of many people to get that landed, but, yeah. Apparently, that's finally gonna be in c 23, so it's trivial.

Speaker 1:

There are, like, 500 different coping mechanisms for this. I mean, we we have one called, lfrap, which produces an

Speaker 4:

a

Speaker 1:

dot o file

Speaker 4:

Yeah.

Speaker 1:

With a binary thing jammed in it with, like, 2 symbols. 1 at the start and 1 at the end of where it's gonna get mapped. And so we use that on, to include, like, firmware and drivers and stuff for wireless cards and things.

Speaker 3:

Well, I just feel like and, Adam, you and I have a long history of, like, dorking with binaries after they've been generated, and it's just so nice to have something built into the language that allows you to be like, no. No. Actually, I want all of this data from the file system, sort of put it right now as part of the compile step. It's so nice.

Speaker 2:

Yeah. And it's beautiful as a macro too because you can kind of surround it with, like, the link section that you want it to land in and

Speaker 3:

Totally.

Speaker 2:

Yeah. Yeah. It it's it's it's nicely integrated. It's very elegant.

Speaker 3:

It's great. So I wanna get, Adam, I know you gotta get to, to dinner with the toddler. Do you guys have some other hands up that we wanna get to to maybe quickly? So, Simeon, do you wanna go to to you and then and then to anno 770 there?

Speaker 11:

Yeah. Thank you. It's been very interesting. I I I just wanna understand foible, the bootloader. What is does it have does it need to do things like like speak to the spy flash?

Speaker 11:

Does it have drivers and that kind of thing? Was that handled by the PSP? And like, you know, on either side of of this, of this endeavor is how much help is it getting from from whatever AMD is doing? And and I guess the other question is, is it useful outside of context that you guys are doing? Is it is it could it be a bootloader that you could use on another system?

Speaker 9:

I mean, it could be bluntly you wouldn't want it to be. It is really special purpose. Like, we call it the Picohost bootloader because it is purposely designed to be small and to do the bare minimum thing. And outside of the context for which it was written, almost nobody is gonna want that. So, I mean, you could use it to boot on another machine.

Speaker 9:

You could certainly look at the code, which I you know, it it occurs to me that repository right now is probably private,

Speaker 4:

but there's

Speaker 3:

It is. Yeah. As I'm looking at it, I'm like, I'm tempted to open it right now. It is private, but I'm just looking

Speaker 9:

at it as well. Go ahead. There's nothing in there that's, you know, proprietary or or anything, I don't think. It's all pretty pedestrian code in some sense. You know, so somebody could look at that and be like, Here's an example of how you boot a machine from the reset vector, up into 64 bit mode, which may or may not be useful for somebody.

Speaker 9:

But, you know yeah. If you were trying to, like, run this on a commodity machine, I think people would be pretty sad pretty quick. I wanna load this from a file from, you know, like a like a disk or a SSD. It's like, yeah. Sorry.

Speaker 9:

I can't do that.

Speaker 3:

Alright. We are gonna see if I can deal with GitHub's verification code while in the Twitter spaces.

Speaker 9:

I I can do it if you want me to. I mean, I'm right here. Okay.

Speaker 3:

Yeah. I've got, that's

Speaker 1:

that's not that's not the correct procedure. So

Speaker 6:

That's true.

Speaker 4:

Yeah. We just we just had a security training about this.

Speaker 11:

Like, like, you know, just issues and and and PRs on Twitter space. This is gonna be great.

Speaker 9:

I mean, I'm not gonna get fired if I, click this button. Right, Brian? Right? Brian?

Speaker 3:

I I was sorry. The of course not. No. But no. Why?

Speaker 3:

No. Of course. But could you just do it for me? I I'm coming into the office right now, but if you wouldn't mind, also, please send me some gift cards. Okay.

Speaker 3:

Bye. No. Actually, amusingly, GitHub has popped up its notification saying you have notification code, and then I bring it up, it says, no. No notification code's found. So yeah.

Speaker 3:

Go ahead, Dan. If I

Speaker 9:

Alright. It says it's done.

Speaker 3:

Tada. There it goes. And my awesome. So yeah. So you get so, p h b l, a Dropbox on computer.

Speaker 3:

You can now the Internet can now pour in. Can now be in fact, I insist that everyone star this repo right now because that's how

Speaker 4:

that's

Speaker 2:

Smash that like button.

Speaker 3:

That's that like button. Yeah. Dan's okay.

Speaker 9:

Subscribe. Send me those PRs, and and if you find bugs, please let me know.

Speaker 10:

You don't have license headers on these source files in accordance with our d?

Speaker 3:

Oh, yeah.

Speaker 1:

Is that true? Yes.

Speaker 10:

It is true. I I do I looked for trolling. I will send you a PR, Dan. Don't worry about it.

Speaker 3:

No. Exactly. There we go. We we have our first PR. Very good.

Speaker 1:

Guess we need to take the people out of the loop.

Speaker 3:

This this repo now under computer control. So but, Simeon, your question about drivers, and you'll see it when you get there. But, like, the part of the the beauty about the way that Dan has done this is that there are it is pulled in as part part of the image. So it's already been pulled in for you effectively.

Speaker 11:

Right. But that's the that's the the kernel that you're booting that needs to speak to, you know, disconnect with kind of thing. What is by the time you're running this this bootloader, you've already something else has already spoken to Spy Flash. Right?

Speaker 6:

The PSP

Speaker 9:

has or the PSP is an embedded ARM comp oh, it's not I mean, it's it's an application profile core, but that's an ARM complex that's resident inside of the chip and that thing runs. That, you know, it like, that's kind of, in some sense, the next frontier. Well, you know, for us, I I I I think I'm okay saying that we would love to have access to programming documentation to be able to replace the proprietary firmware that's on there. That thing does boot a blob, but we haven't been able to sort of get any, you know, usable documentation on and outside of AMD. But if we could, I I I mean, you know, and I don't mean to speak for for Brian, but I think Brian would agree.

Speaker 3:

Oh, Brian would agree. Brian would agree strongly.

Speaker 9:

Yes. Yeah. We would love to to replace the firm the the proprietary firmware on there. We just don't have access to the documentation to do so.

Speaker 3:

So foible is the first instruction that the the first instruction that is not a that that is not the first non PSP instruction that executed is foible. That is correct. So, Sumeet, does that answer that question?

Speaker 11:

Yes. Thank you. By the way, there's your AMD host builder image tool is also, I think, private.

Speaker 3:

That one, we can't open source on this call. We will open source that very soon. That one is that one that one is a little more complicated. And then anno 770.

Speaker 1:

At the risk of, nerd sniping, very busy, people, I have 2 simple yes no questions. With AMD, getting their l three slowly towards 1 gigabyte, have you considered running a server without RAM just using l three, which means you will have guaranteed f weak cache hits? Is that have you looked into that? It's the one question. And if you have looked into that, is that possible?

Speaker 3:

So we have not I I and and when you say run out the l three, are you talking about running out of the l three to boot the way you kind of do historically for memory training? Or are you talking about running out of the l 3 in a more kind of persistent fashion?

Speaker 1:

Just one on the l three. There is no DDR anywhere in the system.

Speaker 3:

Right. So we actually and presume and we we don't know how the PSP works because we don't see it. I actually certainly, historically, one has to execute out of cache in order to be able to train memory. Memory arrives to us trained. So the memory training happens in the PSP, so we actually don't need to do that.

Speaker 3:

So it's the good news is it's trained. The bad news is a minute has elapsed since the machine came on.

Speaker 9:

I mean, I don't think we would necessarily want to do that. I mean, like, look, a gig is a lot for l three cache. It's not a lot for sort of running, you know, large scale server applications with large memory footprints. You know, it's an intriguing idea. Like, what would happen if you just had a machine with no RAM in it?

Speaker 9:

You just execute that cache. Sounds kinda cool, But I don't think that that would be reasonable for a server class system.

Speaker 1:

It might actually depend on the workload. I can imagine that there are peep some people in finance which will pay a lot for the capability to not wait for DDR free.

Speaker 9:

Yeah. I mean, you know, our our machines are DDR 4 right now, but, I mean, yeah, that that potentially makes some sense. There's there's a lot of variability in terms of performance if you have weird memory access patterns and things like that with DRAM. Still though, you know, like, would they be able to get away with only 1 gigabyte worth of total physical memory in the machine. Yeah.

Speaker 3:

It feels like the ratio of course to memory is pretty far I mean, there may be workloads which that makes sense, but it is definitely, I mean, you're right, and there's a lot of L3, but I agree with you, Dan, that, like, that's it feels like you're gonna need more than a gigabyte. Well, this been, it's exciting. And, Adam, I hope you're not in too much trouble with the hopefully, the toddler has some getting into the parental controls, the Wi Fi router. Presumably, that is what's been happening. But,

Speaker 2:

Yeah. Yeah. He he he got in deep while I wasn't looking.

Speaker 3:

Exactly. The, but thank you very much, and this has been a lot of fun. Dave, thanks again for the that tweet was great. You did such a great job, summarizing why we feel, anyway, what we done is is so important, why we're excited about it. And great to have everyone here.

Speaker 3:

Josh and and Dan, very fun to have you both here. And, Steve, Tom, thank you for your your thoughts as always. Alright. Thanks, everyone. See you next time.

Speaker 7:

Bye. Bye.

Speaker 4:

Bye.

Holistic Boot
Broadcast by