Bringing up Cosmo

Adam Leventhal:

Bryan, Nathanael, you're both here.

Bryan Cantrill:

Where I raised my came on stage. I'm sorry.

Bryan Cantrill:

Just

Bryan Cantrill:

You don't think you know how to not do it?

Adam Leventhal:

So that reminds me. I thought of you this weekend when I saw that Sam Sam Sam Altman was saying that being polite to Jet GPT was costing him millions.

Bryan Cantrill:

I love that. That made me feel so great. Yeah. Did that I guess, you know, it's like I because I am always played the jet GPT as we we discussed in the past, like, I'm just always played the jet GPT. And I have always felt like a little disgusted by it because I'm anthropomorphizing it even when I shouldn't be.

Bryan Cantrill:

And then I'm also eating this is like a you know, I'm a little disgusted by the whole Alpen Zeitgeist. So the fact that that that that is costing him just I I I just want it to be billions of dollars and I just that that is that is so delightful. I am so it'd be interesting to know how certainly I am being much more polite to chat GBC today than yesterday. Yeah. I am just going way out of my way to just really praise it.

Bryan Cantrill:

I wanna expand on my I wanna be very concrete with my feedback.

Adam Leventhal:

Well, I did see something by Microsoft that said that the more polite you are, the more likely you're going to get politeness and respect in return. So, I mean, it does seem like in addition to costing Sam Altman money, there may be some additional reward to that politeness.

Bryan Cantrill:

I I I think it's great. There there is I mean, there is karmic reward and there is monetary punishment for Sam Altman. What what what can you want? I mean, what I what did you for? Like, don't even have to like, wait a minute.

Bryan Cantrill:

Are are you being polite just because you hate Sam Altman? It's like, look, that's unknowable. Like, can't know that. You can't know that. I'm being polite.

Bryan Cantrill:

Like, why My reasons are my own. Yes. My reasons are my own. My reasons are my own. Am I being polite out of malice?

Bryan Cantrill:

Maybe I am. Maybe I am.

Adam Leventhal:

Does that undermine the politeness? Arguably.

Bryan Cantrill:

Arguably a little. Arguably a little. I thought that was really that was that that really was great. Kind of warm my heart to hear that. He should have he really should not have said that.

Bryan Cantrill:

I I said not to me anyway. Not to me. I thought I was the wrong person to hear that so. That's right. Well, this is this is very exciting because we you would wonder after last week's absolute banger of a conversation, Adam.

Bryan Cantrill:

That was really amazing.

Adam Leventhal:

So wonderful.

Bryan Cantrill:

Ryan Mac. I mean

Adam Leventhal:

Especially in the recording. None of the audio shenanigans in the recording.

Bryan Cantrill:

That was a heroic work. That you was that was really terrific work. And and but you would think that like, this is the only thing that can beat this is Morris Chang. And I and well, it would be true that Morris Chang would be that there are there there is there's one other episode that I feel can can rival it. And that is of course a Bring Up episode.

Bryan Cantrill:

I love I love our Bring Up episodes. Great. Did you go back to listen to our any of our Bring Up episodes?

Adam Leventhal:

Yeah. I did. I did one of our first ones, think. The the first tales from the Bring Up Lab.

Bryan Cantrill:

Tale from Bring Up Lab.

Adam Leventhal:

Yeah.

Bryan Cantrill:

And Nathaniel, did you have you gone back and re listen to any of that, or has your therapist explicitly instructed you not to?

Nathanael Huffman:

I have not. You know, I'd lived it and I'm okay with not reliving it.

Bryan Cantrill:

Yeah. It it there were some things I had forgotten. There's some details I'd forgotten there. And the thing that to know about that was so when we we did this episode when we first brought up Gimlet, which was our AMD Milan based sled, really really painful long bring up. And when we recorded that episode, we we still had like lots in front of us that we had not because at the end, we had not brought up a t sex when we recorded that first episode.

Bryan Cantrill:

We had not so we were still at Adam, you recall the 499 ohm resistor that was the difference between life and death. Yes. That was in our future then. Right? We had we said the the whole hot plug network was in our so we we were just there was an and Nathaniel, I was gonna maybe this is a good way to kick us off because after the initial bring up of our RevA sled, which ultimately we were able to actually do all the necessary validation on just took a really long time.

Bryan Cantrill:

Nathaniel, look, you're lying to me that I'm I'm sure you remember was, just so you know, we got lucky and we're never doing it this way again. And and then because we had I mean, board had a complicated history and we kind of, you know, it was just we we knew we a lot to learn from again, when we also now had a team in place, we had a lot of kind of foundation. So what what did we what was some of the big changes coming out of Gimlet as we began to look to an AMD Turin base. We skipped the general for instance we talked about when we talked about Turin. But as we were beginning to look to a next gen compute sled, what were some of the big things in your mind of like, okay, these are some of the big things we need to change.

Bryan Cantrill:

Well, mean, I think a lot of

Nathanael Huffman:

it was process. So, I mean, you know, the history of Gimlet, of course, you know, we had some external partners helping us with some of that and then took it over at a certain point. And so there was there was kind of just a sordid history of many, many people working on that design and it kind of, you know, evolving both as the company figured out what we want and also, you know, as, as we figured out, you know, who was going to design it. So, I mean, I joined, you know, in May and so we were, I mean, I think, you know, it depends on who you ask, but we were three months away from tape out in May or something like that. I mean, it turned out to not be quite three months away from tape out.

Nathanael Huffman:

But, so, you know, I think some of it was just, we didn't have like a single set of owners

Bryan Cantrill:

for that.

Nathanael Huffman:

You know, we had had, like, I mean, had started with no electrical engineers at oxide and then we had gotten a few and then we got a few more.

Bryan Cantrill:

We call that the dark period. We prefer not to speak of that.

Nathanael Huffman:

So, I mean, some of it, I think, know, and we had, we, our CAD was mostly outsourced. I mean, we were using contractors for that and like, like they did fine, but it was, you know, it's a lot of like, you don't own a library, you have kind of all of this like core functionality, like your footprints and your everything is just outsourced and you're not necessarily, you know, nobody cares about your stuff like you do. And so, you just don't necessarily get, you know, the quality or the understanding or the things that you're looking for. And sometimes you don't, I think, even need, you don't know what you care about until you have lived through it. And so, and, and like, so, I mean, it's not necessarily default, any of our contract partners at all.

Nathanael Huffman:

It's just that as we figured out what we wanted to do and we figured out where we needed to play, it was something that we needed to own. And, you know, so we had footprints and like, I mean, when you look back at the, you know, our cadence footprints, we had, we had the same parts on many boards, but in some cases, different footprints and not like crazy different, but just they're not the same. Right? And so we didn't have a consolidated part library. We were growing all of our corporate processes and our electrical engineering design processes and everything all at the same time.

Nathanael Huffman:

And so, as going into a Turin based design where we're going have the same six people basically own the hardware design from the beginning, we really wanted to spend time, fixing our library. We also did a CAD change, you know, kinda in like, I guess, as a precursor to this design. So we were moving to a different CAD package and as part of moving to Go ahead.

Bryan Cantrill:

You want, I was just gonna say that's a part of what so first of people in the chat are wondering like, wait a minute, in this kind of era before you double ease, you had a bunch of software engineers that were actually like doing layout, like, no, we weren't doing that, please. We're not quite that bad. We had we

Nathanael Huffman:

Yeah. We had electrical engineering partners, right?

Bryan Cantrill:

But exactly. We were trying to do it as kind of an I'd as any it's really tough as a as a startup where you're trying to do so much. And we we thought like, well, maybe we, you know, we can engage the engineering services for this. And again, it's not like no matter how good the kind of the partner you have is that does that, Nathan or just to underscore the point you made that nobody cares about your design the way you do. And there's just naturally gonna make, they're going to be different decisions that are made.

Bryan Cantrill:

And one of the decisions that was made that was like a very important decision, but was not something that we and just because it was so early and it was not made necessarily exactly by us was in in EDA software. Right? So we're it, we're using OrCAD for and it's like there were a bunch of reason like we had we were using OrCAD because the AMD reference design used OrCAD. That's the reason we were using it. Is what and Right.

Bryan Cantrill:

Like that is not a great reason to be using a CAD package and especially for a company that wants to take a real clean sheet of paper and do things from do things from first principles. So one of the things coming out of it was that we were we we wanted to to really make a holistic change there. We were definitely we were we we did we decided that we we were no longer gonna be a Cadence customer just bluntly we were gonna be. And we were I felt Altium was gonna be just a much more consistent of what we wanted to go do. So we were making a big shift away from Cadence and OrCAD to LTM in particular.

Bryan Cantrill:

That was a huge change. And we were using that as an opportunity like, okay, we can now like actually let's take advantage of this big change that we need to do anyway and let's do a bunch of these things you're talking about, like having a real library, getting footprints and so on.

Nathanael Huffman:

Yeah. And I mean, we, you know, at the beginning, I mean, the first tape out of Gimlet, we didn't even have Gimlet in our PLM system. Like, we were still bringing our PLM system online. And so, I mean, you know, it's just like a lot of things had had moved since that point. So, so we had a lot of process stuff.

Nathanael Huffman:

Would say on the technical side, we kind of shied away from a lot of programmable logic, mostly because we didn't have programmable logic experts enough to staff any of that work. I mean, like programmable logic is, you know, it can be tricky and it has, you know, various black box concerns and that kind of thing as well. So when I joined, I think right around where I, when I joined, we decided to add an FPGA to Gimlet for the sequencing, which I think was a great, great idea. I mean, we, we've used that a lot and that was good, but even that was pretty, yeah, it was pretty minimal in terms of its functions and kind of like a lot of the design bones were already in place when we added that. And so it was going to be that, or it was going to be some kind of like off the shelf power sequencer that was going to try, to meet our needs.

Nathanael Huffman:

But we wanted the flexibility of FPGA, especially because we didn't know, how complicated it would be to bring up the AMD chip and to sequence its power.

Adam Leventhal:

Nathaniel, a couple of things. PLM, I'm I'm sure not everyone knows what that is. In sequencing, you alluded that it has to to the fact that it has something to do with power, but I think that I'd be worth spending a couple of seconds on

Nathanael Huffman:

that.

Bryan Cantrill:

No. First of people who don't know about PLM should enjoy their innocence. Why should we force force feed them the Yeah. The fruit from the tree of knowledge of good people?

Adam Leventhal:

Well, just

Adam Leventhal:

let's tell them enough to know to stay away from it, you know, to just make smart

Bryan Cantrill:

realize they're nude. I don't know. Okay. Well, just be very careful. I don't know.

Bryan Cantrill:

Just just don't don't say you weren't warned. Go ahead. Eat the apple. Yes. Fuck.

Bryan Cantrill:

PLM.

Nathanael Huffman:

Okay. PLM is product life cycle management. And in general, it's like a big database that holds your BOM and it often holds your whole product BOM and it goes up and down, you know, up and down your product tree and you do your changes and that kind of stuff in there. That I mean, when we taped out Gimlet A, we dumped a BOM out of Cadence into like Excel or Google Sheets, and that was our BOM for the first round of Gimlet. And we were still working on getting our PLM system up to a standard to which we were willing to use it.

Nathanael Huffman:

And we hadn't even figured out how to link up OrCAD and our PLM system. So we did eventually do all of that, but that just kind of gives you an idea of where we were in the we were flying the airplane and building the airplane kind of at the same time on a lot of these things. That was that. Then sequencing. So there are a bunch of signals that go between, your power supplies and your various chips.

Nathanael Huffman:

And, in general, I would say the more complicated the chip, the more complicated its sequencing requirements are. A lot of these chips have multiple rails. A lot of the rails, are controlled in various ways for both power and, performance reasons. And if you sequence them wrong, in some cases, you're at risk of blowing the chip up. But, know, you think about a, like an SP three or an SP five socket, these big processors, so a Milan or a Turin, they want to have power domains that they can boost so that they can, you know, use more power at certain points in time.

Nathanael Huffman:

They'll even raise their voltages under control. But obviously those rails that they want to control have to be different rails than some of the other core functions like your clocks the last to run and that kind of thing. Those have a different set of rails. And so there's a whole sequence that's defined by a given chip that tells you this is the order in which you apply the rails. Then sometimes there are additional handshake signals.

Nathanael Huffman:

So we release a reset and then they have some signal that comes back to us and toggles high or low, however the case, to let us know they've moved past that state. And so we ended up on Gimlet putting all of that stuff into an FPGA. And I think that was a good decision, I think.

Bryan Cantrill:

Definitely a good decision. Yeah, yeah, no question.

Nathanael Huffman:

Loved us a lot of visibility and flexibility. And I mean, you know, we we control the resets to RT six and some other stuff in in there.

Bryan Cantrill:

Well, we also, I mean, we did things like, I mean, that we we put the FPGA bit stream is actually in the actual hubris image that we load on the SP. So we're able to attest the the FPGA image as a tested with SP image. And then there's a bunch of things that we were able to, I think it was very good to kind of get that experience with Soft Logic and things that Soft Logic could do for us.

Nathanael Huffman:

Right. Yeah. And so as we looked at and we went through the turret and stuff, I mean, were a number of features and things where we potentially wanted to, you know, like open the scope up a little bit on our programmable logic side. A couple of like really easy ones are, like on Gimlet, we base we hook the SPD stuff from the SP three from the SP three, so from the Milan socket right up to our service processor.

Bryan Cantrill:

SPD is the serial presence detect on the DIMMs. Little

Nathanael Huffman:

E PROMs on all the DIMMs. And so normally in a normal computer, like your computer sitting on your desk, the processor is connected to the has a little I squared C bus and it connects to all of the DIMMs and it can read out information from the memory, their serial numbers and some of the some timing parameters and that kind of thing. We wanted to interpose on that in in our designs, and we still do. And so in Gimlet, what we did is we, just wired that bus up to the SP and the SP acts as a I squared z target. And then the SP can go prefetch all of this data from all the EPROMs and just parrot it back to them.

Nathanael Huffman:

Well, in, we wanna keep the same SP on our Cosmo because, you know, we like our SP and it's fine, but it doesn't have an i3C peripheral and every like the industry is moving to i3C. So that was a spot where we're like, well, you know, even if the AMD turns right now aren't booting I3C, we want the capabilities of doing that. And so we either have to change our SP or we have to build some other kind of subsystem that could potentially talk I3C. So we put that to an FPGA. That's one place.

Nathanael Huffman:

The Turin support eSpy boot. So this is a slightly like instead of just booting off of a dumb spy NOR device, like your CMOS device for their initial instructions for their PSP and the ASP loads, Those are the PSP is the hidden core in one of the hidden cores in the big CPU, and it kind of loads the initial brains. Instead of doing that, you can talk eSpy. And so you can talk to a slightly smarter device where you have some kind of multiplex channels and you can do a number of things. So we wanted to be able

Bryan Cantrill:

to do Yeah,

Nathanael Huffman:

we did. Yes. We actually really love that. And one of the reasons that we drove that way was because in the turret, we use two UARTs in our system, one for the console, so that's your UART zero, and one for what we call IPCC, which is our, like a little control protocol that runs between the SP and the big processor. The Turin devices all got rid of their hardware handshake signals on their second UART because I think they ran out of pins and these chips are monsters.

Nathanael Huffman:

And so in order to have a performance, we run that at three megabaud. In order to have a performance IPCC channel, we needed to do something else there where we would be able to do flow control. And, so we made the decision to do that over eSpy as well. That our IPCC channel is being done over eSpy and that goes to the FPGA, which then busts it out to a UART and ships it over to the SP. There were kind of a lot of things like that.

Nathanael Huffman:

Think, we also wanted the whole hot plug subsystem in a server is, so it's an I2C interface and you're interfacing to a specific number of, I mean, like it's a PCA 9,506. So that's a, an IO expander basically that you can buy from, you know, NXP or TI. And they're, they're very dumb devices.

Bryan Cantrill:

Kind of expensive.

Nathanael Huffman:

They're kind of expensive. Yeah. I mean, they're old and venerable, right? And the PSP code or something in one of the hidden cores talks to these things. So you're sort of limited in some senses as to what parts you can choose.

Nathanael Huffman:

Right? There are a number of supported parts and you can kind of build out a topology that you like and you pass that into your, your host software bits and then the host software bits will interact with these devices. And so one of the challenges we had on Gimlet was we didn't have a whole lot of visibility to that stuff.

Bryan Cantrill:

Oh, you were gonna say the challenge that we got it all wrong, that we got the actual the pin numbering wrong on all those And

Nathanael Huffman:

I mean, yeah. We had there was a lot of

Bryan Cantrill:

That was one one of the horrors that I had forgotten anyway, Nathaniel, was that whole adventure with Rick having the dead bug him and Yeah. The the the I mean, the peril of discrete logic is like, if you get something wrong and it's just like really easy to I mean, I'm not aspersions on, you know, you make mistakes that's what that happens. But like, but when you make mistakes with discrete logic in hardware, it's like, yeah, now you're actually stuck reworking this thing if you can.

Nathanael Huffman:

Yeah. In fact, I mean, this is probably something we didn't talk about in, one of the bring up channels, but on Gimlet, as we went into our production rev, we discovered that, unplugging disks would not necessarily generate a hot plug events. Right. And so we actually had to give up a, so we support AIC NICs, which is just like an added, like a one gigabit add in card. When you have the gimlet out of a chassis, you can stick an AIC NIC into one of our sum slots instead of a shark fin and we can boot in certain development modes.

Nathanael Huffman:

And so we support that. We supported that. Now the K. Two made a lot of that use kind of go away, we we don't typically do that anymore. We just use K dot twos, which, you know, are the basically the same thing, but on the U dot two interface.

Nathanael Huffman:

But one of the things that are fixed because it was all discrete logic, it was all chipped down. We had to change an or gate to an and gate or something like that. And, that basically breaks the AIC use case. So you can't use an AIC on, on a production gimlet without a shark, without a shark fin and other things.

Bryan Cantrill:

If you can't use an AIC, then hot plug doesn't work properly. It's like Right.

Nathanael Huffman:

So we made the decision to make hot plug work right. But it gets complicated with all of these gates and you've got a bunch of active low logic and you have like AMD's name of things and then our intended name of things and like decoding all of that. I mean, I spent most of this last weekend, or most of a day this last week dealing with trying to re decode all of that for Cosmo even because it's complicated. But we have an FPGA, so eventually I can just write the thing that we want. So we wanted to move that into an FPGA.

Nathanael Huffman:

That was the long way of saying. So we actually have three FPGAs on Cosmo. So we have the ignition FPGA, which we have had before. Have our sequencer. A lattice ice 40.

Nathanael Huffman:

We have our sequencer FPGA. So that is a Spartan seven. So we're using a Xilinx Spartan seven for that. And then we are using up on the front for all of the hot plug IO, we're using another Lattice ICE 40. And the reason that that is two FPGAs instead of one is mostly just a physical choice because you have all of these signals at the very front of the board that are all going to all the SEM slots and there's a whole I mean, you know, it's a lot of pins.

Nathanael Huffman:

We've got like 10 pins per per some slot and 10 so you have a hundred pins basically right there for hot plug and the fan in to try to drive all of that back to an FPGA in the back of the board and do it's just not worth it. So and the ice forties are pretty cheap. So we have that thing. I've got that mostly, mostly built out and we're using like 40% of the logic in an eight ks device. So it's not a very, it's not a very big design, but, but it's certainly nice because then we just have, you know, a couple of lines going back and forth.

Bryan Cantrill:

So that was a big change. We do we had a moving and and I mean, a lot of other stuff, it should be said architecturally on Gimlet. We were actually really happy with. We were happy with our SP. We've got the same SP.

Bryan Cantrill:

The the h seven fifty three was what we had used on Gimlet. We're using that on Cosmo where the root of trust, I would say, I mean, we yes, we are the company that has I think two different vulnerabilities found on the LPC 55. People are very reasonable to ask us why are you still with this chip? It's like, well, it's the devil we know. So we definitely understand the LPC 55 pretty well at this point.

Bryan Cantrill:

Secure silicon is very hard as it turns out. We like our attestation. I mean, so there's a bunch of stuff that we actually liked. Oh, yeah. All of our all

Nathanael Huffman:

of our management network stuff is the same.

Bryan Cantrill:

Yes. Yeah. And because that we once we got that working, there was not much of an incentive to change all that. But then so the big change is this use of soft logic. And so that kind of in which we talked a little bit about also in the touring episode about, you know, the the fact that we're very bullish on eSpy.

Bryan Cantrill:

A little bit nervous because Turin was the first part, I believe, Nathaniel that used eSpy, right? Because they they were gonna have eSpy in Genoa and then they ripped it out, which is always like, it's like, god. Yeah.

Nathanael Huffman:

I think Genoa technically ports e spy, but not e spy boot.

Bryan Cantrill:

Right. Which has you like, least, like, so many questions. Like, oh my God, it doesn't work.

Nathanael Huffman:

Yeah. So, I mean, we, yeah, we had decided that we were going to go down this path where we have a spy flash connected to the FPGA and the e spy bus and we expected that e spy would be the answer, but it's kinda like, well, if it isn't, I guess we can do something a little hacky and make it look like a spy door if we have to.

Bryan Cantrill:

So If we have to. So so we've got like this is not gonna be company ending if this if eSpy boot doesn't work, but boy, like eSpy to boot to work. So this kind of brings us to a third major difference. And you know, whether you're wanna call this a process difference or not, Nathaniel, in terms of we want mean, we were trying to pull together a bunch of things for the first time on Gimlet and we wanted to have a way of of having kind of a halfway point on Cosmo where we could validate some of these things without having to have a full board. So I think we maybe even hinted at grapefruit a little bit on the touring episode, but you want to describe grapefruit a little bit because I think that was really, really important for Cosmo to bring Yeah,

Nathanael Huffman:

it sure is. So grapefruit is a little bored. It's in a, I don't even, it's the like a open neck form factor or something like this.

Bryan Cantrill:

Yeah, what is that? Yeah, that's a weird form factor.

Nathanael Huffman:

It's OCP form factor, I think. And so it has an edge connector and you can stick it into a spot that conforms to this OCP thing. So what we did is we bought a Turin bring up system from AMD. So AMD has these development platforms. So, the, the platform we're using is the Ruby and the Ruby had their own little, like, you know, their own little board in there with kind of their own BMC and root of trust and everything.

Nathanael Huffman:

So we pulled that out and we put a grapefruit in and the grapefruit was basically our SP, our ROT, us like mostly the same network configuration. So we, but we had RJ 40 fives on it. We could get to the SP over the network. And then it had, the Spartan seven FPGA on there. And we, we wired a bunch of stuff together to hopefully be able to east by boot.

Nathanael Huffman:

And, you know, when we looked at the schematics for Ruby, it was like, we can clearly ease by boot all the there are muxes everywhere. We can like flip all these muxes and do all the things. So, so we built that. I think Eric did a lot of the work on the hardware design there and, that thing has been awesome because we've been using that for months, in in order to risk retire, you know, some of the FPGA stuff and it's it's been something I've been using every day.

Bryan Cantrill:

Yeah. It's been really good. And this is the thanks to the as Luke can drop in the chat. This is the the OCPD CSM specification, which I think I haven't looked in the history of it, but this has got to be the run BMC, the kind of have that run BMC lineage. But the the just the idea that there should be a connector so you can have a different BMC or in our case service processor.

Bryan Cantrill:

That has allowed us that that brought out a lot of these connections and allowed us to build this thing. We would not have been able to build this thing had it not been for this specification. I think we're really grateful for it. And then, yes, so tell me what the experience with grapefruit in terms of because we had the objective of using grapefruit to actually I mean, there's a certain things that you can't do with a true Turin, but there's a bunch of things that you can do. So what are the things we were able to do with that and able to actually get working before we even finished taping out, Cosmo?

Nathanael Huffman:

Yeah. So the big things we were able to risk retire on Grapefruit. So the first part is the easy part. That was so I have a little tiny, like Q Spy driver and then hubris can read and write to the QSPY that's on grapefruit. So that looks like our host flash on a Cosmo.

Nathanael Huffman:

So that was kind of the first thing that we brought up because it's like, with East by Boot, you know, the when you want to use by boot, then you're going to want to like fetch data from flash. So, we built a little, E spy or little spy, spy NOR shim and Matt wrote all of the hubris code for that. And we spent a week or two debugging all of that so that we could read and write files into a Spynor. So we did that. And then the next piece up was this eSpy thing.

Nathanael Huffman:

And so the eSpy design in general is not overly complicated, I would say, but it, you know, there's a specification, by Intel. And so I took the specification and started working on, you know, our implementation of it. And so that, that was, you know, a decent amount of work, especially given that we're starting from a pretty low amount of FPGA code generally that we own corporately. And so we're having to build kind of all the infrastructure to build in a new tool. So since we were targeting Spartan sevens, we had to do some work to get our build infrastructure, which we're using Buck 2 to do our FPGA builds for this stuff.

Nathanael Huffman:

And so I had some work to get all of that working, get Vovato kind of plumbed through there. I haven't been a big Vovato user before in my past career, so I have gotten the experience of, you know, catching up on Vovato. So the grapefruit really provided like an awesome target for all of that. We could do a lot of that early, dry land work, get a bunch of stuff going there. And then, and then, then it was like, now we need to stick this thing into a Ruby and get eSpy going so I can debug my, the simulator is only so good and, I'm sure, you know, I need to get things working.

Nathanael Huffman:

So, we eventually reworked the Ruby enough to make eSpy boot and we're able to do a huge amount of risk retirement on our IPCC implementation, our eSpy boot implementation, everything all on a Ruby and Grapefruit.

Bryan Cantrill:

Yeah. That was terrific. And and Matt, you mentioned that you'd there were some things on the hubris side. Because I mean, hubris runs we've got an SP on Grapefruit as well. So we've we did bring up a hubris obviously on grapefruit.

Matt Keeter:

Yeah. And I think the the interesting part about hubris on grapefruit, which I did not write, but I Cliff the on our team wrote this was that the FPGA instead of being connected over spy is connected over the FMC bus, which means that it shows up in the SP's memory space just like other peripherals. And so Nathaniel and I put a bunch of work into coordinating the outputs from the FPGA toolchain with the inputs of the Hubris toolchain so that we could create all of these virtual peripherals in the FPGA and then have them show up and talk to them through the same kind of accessors that we would use for peripherals that were actually built into the SP's hardware.

Bryan Cantrill:

Which is really cool. Yes. That's a been that's a big change then. We I mean, that's not what we had done with the FPGA on Gimlet. That was much more

Nathanael Huffman:

like We had like the precursor to this. So, I mean, over in our FPGA land, we've been using system RDL to define all of our registers. So we output some JSON, which hubris can consume. But because of the way we did Gimlet, Gimlet was a SPI peripheral to an SP. And so while we did generate a whole memory map for Gimlet and hand that over to, Hubris, it's a little strange because you basically have one driver in Hubris kind of doing all of the things.

Nathanael Huffman:

Whereas, when you're sitting on, you know, on the FMC interface, you can, you look much more like a first class citizen to the SP's, addressing. And so you're able to, you're able to let hubris write a driver for like U Arts or a driver for Spy and a driver for eSpy and they don't have to all like be shoehorned into kind of the same thing. And so the nice things are they get, you know, they, they look like more standard peripherals that you'd write drivers for. And then the, maybe the downside is that you have to like, you probably want to better integrate things. And so Matt did a ton of work there to, you know, better consume that JSON.

Nathanael Huffman:

And, we made a few changes on our side to generate a little more of the stuff so that he can set his MMUs up correctly and, all of that really works really awesome. So

Matt Keeter:

Yeah. So we can even have different peripherals in the FPGA mapped to different tasks in the SP. So, like, the task that's doing the host to speak communication can only talk to the UART in the FPGA Right. Yeah. And would get a memory fault if it tried talk to a different FPGA peripheral, which is pretty nice.

Bryan Cantrill:

That is really cool. It the FMC where does that originate from? What is the, towards the FMC bus?

Nathanael Huffman:

I mean, it so it's, I think it's just an ST name for their bus. I mean, they, the FMC peripheral supports a number of different external things. So you can connect it up to RAM or you can connect it up to parallel flash or you can connect it up to a bunch of different things. And so, I, I have Yeah.

Bryan Cantrill:

Yeah.

Nathanael Huffman:

This is it's it's specifically a peripheral in the, you know, the STM 32.

Bryan Cantrill:

Yeah, that's very cool.

Nathanael Huffman:

Yeah, it is. There are better peripherals out there in the world, but this one is fine and it does exactly what we need. So there are a couple of like strange errata on it, but we're pretty good there.

Bryan Cantrill:

Did we hit those strange errata presumably on grapefruit?

Nathanael Huffman:

Yeah. Oh, yeah. I mean, yeah, we had workarounds for all of those things you know, all implemented. I kind of forgot about that because I did that so early in grapefruit. That was the first thing I did in grapefruit, because if you can't get to FPGA registers, you really can't do anything else.

Nathanael Huffman:

So that was the very first peripheral I wrote for grapefruit.

Bryan Cantrill:

And did in doing grapefruit, did we find errors in the Cosmos? Cause Cosmo had not yet taped. Did we find things that we wanted to change in Cosmo or perhaps potential errors in Cosmo?

Nathanael Huffman:

For sure. Probably the big one that stands out in my mind is that, the way the FMC peripheral works is you're basically like jacked into the brainstem of the bus. And so you can issue a bus transaction out the FMC and I can, I can stall your bus as the FPGA for as long as I want, which could be forever?

Bryan Cantrill:

I mean, which makes sense. Like in order to have that kind of integration, you know, like you do, it's like, it's understandable, but it's also like, oh, oh, oh, goodness. Oh, apologies.

Nathanael Huffman:

You almost have to because like if, if for example, you wanted to fetch data from, you know, a peripheral that's in a different clock domain, it could take multiple clocks, you know, to get a safely crossed across over. So you do need the ability to stall for a somewhat arbitrary amount of time. Some buses do implement, like I worked on a PowerPC Core IQ P2020 a few years ago and that one had a timeout on the bus. So after a certain amount of time, would give up and, you know, like it was on the FPGA then to recover at that point. But in, in this case, there's a weight pin and, you know, I, I think the weight pin is low and while the weight pin is low, the SP waits.

Nathanael Huffman:

And if you don't, and so there are a couple of interesting things are like, what if the FPGA isn't there? Like you haven't configured it yet and like that pin's not driven. So, we missed a pull up on Grapefruit for the weight pin. And so, we're like, Cliff found that pretty quick along the way and was able to, he, you know, he implemented a weak pull up on the Grapefruit. So we were able to like get away with no rework there.

Nathanael Huffman:

But in, Cosmo, we put a real resistor there. So it'll be strongly pulled up when the FPGA is out to launch. That's probably the biggest one we found. I would say, it turns out Aaron and I found an an additional problem, but it was too late because we had already taped out COSMO and we didn't get, we didn't have enough time to get all the way to the end, which is that we needed pull up resistors on our I squared C level translators. So that's a rework in COSMO.

Bryan Cantrill:

Yeah. That's right. But but so and I mean, the the weight pin thing is an interesting one because had you not found that, had we had we would have found that in Cosmo. We Cosmo would have been the first time we would put putting it together. And we we really can't do very much without this FPGA functional.

Bryan Cantrill:

I mean, actually need this thing to be up and functional. So it's like, we would have hit that like pretty early and bring up. We would not have been able to to bring up SP five. We would not be able to power on SP five because we would have had to in, you know, within a couple of days or whatever it took to debug it and work around it, what have you. And during that time, it's like everything is blocked because I mean, effectively like you you you've slipped everything by not.

Bryan Cantrill:

And by by by kind of pulling this out in the grapefruit, we're able to basically hide that latency and not able allow not allow Cosmo to move forward without it, which is great. Right.

Nathanael Huffman:

I mean, yeah, I we had solid months of FPGA work done on Grapefruit before we show and so I mean, like, you know, during I mean, we'll get to more of that, but like during during the Grapefruit bring up week, I mean, we got to East by Wiggles and so like that, and we knew they were valid East by Wiggles and they were like, you know, hitting the peripheral the right way. So like that, those are things that would absolutely not have happened without grapefruit because I mean, it would have taken me another month just to bring all that stuff up.

Bryan Cantrill:

Right. And it would have been again another month where everyone's like kind of like, well, we can't get to the rest of the validation. You know, I've got like, you know, I'm I'm here to do the, you know, the DIM validation and we actually can't do that because we can't power the thing up. So it's Right. Grapefruit was very very important.

Bryan Cantrill:

Of course, it's the we couldn't test on grapefruit too. And so what were some of things that we that we are gonna have to like some things we're going to have to test for effectively the first time on Cosmo. What were some of those things?

Nathanael Huffman:

We knew, I mean, I just alluded to this, but we knew the east, the SPD proxy stuff. We did not get tested on, we did not complete our testing on Ruby. And so, I mean, Aaron and I spent, more hours than we'd probably like to admit, dealing with trying to get that working. And I mean, it's a real kludge because, I mean, we had about, we had about three feet of, rework wire run into that thing in order because of the so because of where the dims are in the ruby and where the grapefruit is and the fact that they didn't intend you to do this and that we eventually couldn't actually control all the muxes we thought we could control. We ended up needing to like, you know, add wires and like remove a mux and that kind of thing in Ruby.

Nathanael Huffman:

And so we did some rework on that. But then we're always chasing, we've got, we're trying to run something at a megahertz and we've got three feet of rework wire in there. And so like signal integrity is a problem and, you know, it's just, it's just, you know, a complicated mess there. And so we were never really able to get to, get to ground on whether that was like working or whether that was like a function, like was the FPGA code, right? Or was the test fixture bad enough that it's not a valid test?

Nathanael Huffman:

So anyway, Aaron had done a ton of work on that and then handed some of that over to me while he got to go work on different problems. So we knew that was gonna be a risk going in, I would say. Obviously sequencing we could not do. But when you look at the Turin EDS and the sequencing, it looks very similar to Milan. So we weren't too concerned about that.

Nathanael Huffman:

And, and then, you know, a bunch, like none of the hot plug stuff and, you know, a bunch of the other things we were going to all have to be brought up for, first.

Bryan Cantrill:

Right. And and we knew that hot plug like without that would if we're testing hot plug, that means we've got an SP five up and working. So that's all that's already working. Exactly. So that's it's it's fine that stuff is further down.

Bryan Cantrill:

But like the I mean, sequencing is obviously gonna be I mean, we're gonna be gonna hope that we've got this thing working. So okay. So the I think this is all very good backdrop for for Cosmo. So we get our first boards back. We are in our manufacturing facility with PetroRoc Electronics in Rochester, Minnesota.

Bryan Cantrill:

And you all so this was on the the the Monday and I actually thought it was gonna be like a like maybe two days of beeping out. And it definitely wasn't like you guys like were really you're able to apply power pretty quickly. I mean, what, tell me about kind of the process there.

Nathanael Huffman:

Yeah. So I mean, think a couple of things, you know, we learned a lot of stuff in Gimlet in terms of like, we know that the t six has a pretty low impedance on its power and ground planes. And so there were a lot of concerns that we could kind of rule out right away. I mean, we also had, you know, we had a number of people there. So I know Ian and RFK and Eric spent a lot of time beeping some of the boards out, but, know, when you have less problems than you did on Gimlet A, there are less, less, weirds, you know, I mean, I don't know if you remember, but like Gimlet A, none of the I squared C addresses were strapped correctly on the entire board.

Nathanael Huffman:

So like, you know, and, and I mean, and we were building all of the humility, like validate and the humility sensor stuff.

Bryan Cantrill:

We had built that stuff. We built that stuff out of our experience with Gimwaday. I mean, so yeah. I mean, we had no. I mean, it was it was just very crude all around.

Nathanael Huffman:

And and and I think when you've got when you got the engineers there, so you have Ian and you have RFK and Eric and like they did all of the power supplies. So like they're, they're already like super familiar with the design. They're ready to like get in there with probes and poke everything. You know, with within a few hours you've kind of got like, okay, we have no shorts. Everything looks reasonable.

Nathanael Huffman:

We're ready to turn, you know, we're ready to turn power on and then we put like a dummy ignition in there that just turns our, our IBC or our our 54 volt to 12 volt converter, which is kind of the main, the power stem into the board. So we put, know, a simple FPGA in there that just did that one pin. And then it's like, well, okay, our whole, we call it our a two domain, but it's our, it's basically our standby power domain with the SP up and the ROT up and everything. Like all of those rails can be checked out and it's like everything basically came up kind of as it was designed and we're in pretty good shape there. So it was like, well, now it's time to start flashing Yeah.

Bryan Cantrill:

Yeah. So that, which is great. So you start flashing things. Now we want to get an actual and, you know, Matt you'd prepared a We had a Cosmo hubris image. I mean, obviously, you know, you're you not able to test that on an actual Cosmo until we have one.

Bryan Cantrill:

So, and it's really easy to fat finger addresses there. I mean, we do try to double and triple check the stuff, but it's just really easy to make mistakes there. So you kinda know what we're gonna find in but, Matt, you we were able to get a Cosmo image on there and Yeah.

Matt Keeter:

Was when I was booking the flight, I was joking that surely they will not be to the point of needing a hubris image by, like, 2PM on Monday afternoon when I arrive. And it was basically as soon as I arrived that they needed the hubris image ready. So it was Yeah. Good timing.

Bryan Cantrill:

That's great. Yeah.

Matt Keeter:

First, Hubris image, I mean, it kind of worked. Like, it flashed. We could run Humility to get all the task status. I think we might have had a couple of things crash looping just due to, like, bad peripheral addresses and everything.

Nathanael Huffman:

Yeah. And you had your FPGAs configured, I think, pretty much out of the chute.

Matt Keeter:

Yep. And It does one issue talking to the FPGA for mysterious reasons, which we will get to. Right.

Bryan Cantrill:

So yeah. And did we so because the and I was had intended to be out there on on Monday, but our our United Airlines had other plans for me. So United thought maybe maybe I would prefer to take the red eye and show up on Tuesday morning. So I I that's what I guess I did. But so by the time

Matt Keeter:

The very first image we flashed, we could actually not talk to the FPGA, if I remember correctly, because we actually had a pin that was not one of the sync pins between the FPGA and the SP on the FMC bus was not wired up correctly. Physically, not wired up correctly.

Nathanael Huffman:

We had we had one there there are a number on the s on the SP. There are a number of special purpose pins, and one of the special purpose pins for the FMC had, apparently been moved from grapefruit to in Cosmo. You know, it's unclear exactly where that happened, but, at any rate, it wasn't really actually allowed to, to move and it moved to it. We could have probably survived that if it had moved to another pin on the FPGA because we would just rebuild the FPGA with a different pin. But in this case it had moved to, the dedicated pin was now connected to like some circuitry on the board that we didn't have access to.

Nathanael Huffman:

So we we did actually have to do a rework there to basically not use that that pin for its intended purpose on the board and we could we could live with that. And we had to jumper that pin over to a pin that was accessible to the FPGA.

Bryan Cantrill:

So that's a way in how gnarly was that rework? Was that?

Nathanael Huffman:

Oh, it was no, no big deal. It's a little jumper between two balls underneath or two vias underneath the SP. So it's a little, you know, it's just a little tiny wire.

Bryan Cantrill:

Okay. And is that, are you able to do that with it? Do you have to, to deball the SP to do that or is that

Nathanael Huffman:

what I'm happy to do? No. So the SP, the SP is a BGA part and, both are least for these two pins, all of the vias, were, you know, exposed out the back. So we could just jumper via to via. And Oh, that's easy.

Nathanael Huffman:

Okay. We basically tell Matt, you know, you're no longer allowed to use this other pin because we just jumper to it.

Bryan Cantrill:

And Right.

Nathanael Huffman:

Right. You know, like it's fine.

Bryan Cantrill:

So, great. Okay. So that gets all done, I think on Monday because the, certainly it felt like by the time that,

Nathanael Huffman:

that found our second bug or at least so that one, that one found. So, so after we get that, now we're able to talk to the FPGA, but we're, there are issues. So when we, when the FPGA responds, you don't always get, you're not, you're not like, it looks like certain bits are stuck in, in its data path. So like, if it's supposed to read dead beef, you see like, you know, dead o one or whatever, you know? And so we spent a good bit of time trying to figure that out.

Nathanael Huffman:

I mean, I put our the ILA, which is like a logic analyzer you can compile into the FPGA. I put that in there. You know, Matt and I are peeking and poking different addresses and trying to figure out like, what the heck, you know, like, why am I seeing? And so that's like, well, the, like the data is coming in wrong on these, on these bits always. Right?

Nathanael Huffman:

And so it, it turned out that that was a different hubris peripheral was also attempting to use the same pins.

Bryan Cantrill:

Yeah. Oh boy.

Nathanael Huffman:

Unfortunately in the process of debugging that, because it was data pins and the data pins on this bus are bidirectional. So the FPGA, gets to drive them at certain points in time as well. We manage to blow up. Probably the FPGA pins, on that one. And so, that once we figured that out though, we were able to, you know, fix the hubris peripheral and try it on a different board and prove that everything was working.

Nathanael Huffman:

And so then we just sent that board back for rework and had the SP and the FPGA replaced.

Matt Keeter:

And in fact, we got very lucky because we tested out a different board, like a second Cosmo board with the image that would have led to these pins blowing up, but we had not done the rework to fix the FMC bus controller. So we didn't get to the point where the two things would fight.

Nathanael Huffman:

And so then Ian walks over to

Matt Keeter:

me and says, alright. I've done the FMC bus controller rework and hands this board to me. And I realize if we power up this board, it will immediately blow up these pins. So I hand the board back to Ian and say, Ian, can you please undo that rework? And I can put a safe image on here.

Bryan Cantrill:

Right. It's it's like, let me just I let me I'm sorry. Let me understand this correctly. You want me to, like, literally break the thing I just fixed because the fixed thing is gonna break the board.

Ian Sobering:

Well, by that point, I'd gotten really good at soldering jumper wires on the back of of these boards. I'd done like, I don't know, 10 of them. So it really wasn't that big of a deal. It took it took three minutes instead of like thirty minutes, but it was just funny.

Bryan Cantrill:

Yeah. That is that is pretty funny. And good on you to realize that, Matt, that it was like, make we cannot power this board has become the board of death. We must not power it on.

Nathanael Huffman:

Well, and it, you know, that that's kind of one of the perils of bring up is that you're, you always want, like, we have a whole team of people there and you're kind of single threaded until you can start like handing additional hardware out. But there's always a risk when you haven't checked everything out that you're potentially You have a bug like this that you don't know about. And so it's just a problem. You, like, you kinda have to balance that. So you wanna get enough stuff working where you're like, you feel like you're safe.

Nathanael Huffman:

But, you know, in this case, like we kinda got lucky because had we had we reworked things or discovered things in a slightly different order or had Matt not noticed that that was true, then we had to go back through and make sure that all the boards, you know, we had seven or eight boards on hand. All of those had, you know, had the correct SP image flashed before we got the rework done. Yeah,

Bryan Cantrill:

we, and, and like in

Nathanael Huffman:

this case, like we didn't even lose that board. So before, by the end of the week, our our Centimeters had replaced the two parts that were damaged. And so that board is back in rotation. No problem.

Bryan Cantrill:

So, and Ellie's asking you the chat, boy, like these blown up boards, do you are these like these potentials souvenirs? Like, well, actually these boards are so valuable. They get, we need, we need to really, really, really try hard to rework them if they're Yeah. And these were all reworkable. Right now, I mean, you figure,

Nathanael Huffman:

yeah, our FPGA is probably a hundred bucks and the, I mean, the SP is probably 25 or $30 and, and it's probably, I don't know, maybe $200 to have, you know, have a rework technician replace them, but that's totally worth it for us, on these boards, especially because we only built 12 or only built 14 or whatever.

Bryan Cantrill:

That said to as as a reward for Oxide and Friends listeners who have managed to endure this far in to the way it can get pretty technical. I do actually have a bunch of and those are revsies I think right now. Right? That we gimlet revsies that are dead, that are like Yeah. Literally that are literally marked e waste right now in the office.

Bryan Cantrill:

And I got we've got like I think 10 of them and something like that. So for the first Adam, we're we're gonna go we're gonna go call in show. We're gonna go for the first end listeners. So if you if if anyway, if you leave something in the chat here, we'll make and we'll what we'll we'll get you a a very dead gimwet rev c.

Aaron Hartwig:

Might be some b's there too.

Bryan Cantrill:

Might be some b's. Yeah. Some Yeah. I I the the I did go through I definitely know a's though. We've got we've got all the a's.

Bryan Cantrill:

The a's. The a's have got some some facial scarring that we you you know when you're looking at an a because the a's got a but anyway so we'll these things only become souvenirs when they're when they're obsolete, I'm afraid. But we do have some that will what some some revs seek in once anyway. But but though and then, Nathaniel, we rework those or I should say benchmark rework them. And we were, I mean, hugely grateful.

Bryan Cantrill:

My God. I mean, I just go back and think about how much rework we did, how much we tax the benchmark rework. Mean, obviously, like it's a service and they, you know, whatever. But they I guess what I really shouldn't be thinking about is our rework bill from Ginlet Reve, but they were great on Reve Ginlet in. Yeah.

Nathanael Huffman:

Yeah. I mean, Gimlet RevA, we didn't, we didn't have that much stuff done at benchmark because we were still so broken after the two weeks. I think Eric and I did most of the rework on RevA's. We might've had, we might've had a few of the inductors replaced and maybe a couple of chips here and there on a few different boards, but where, where we really used a benchmark was on sidecar when wasn't it wasn't it side? Yeah, it was side sidecar B that had the footprint mistake on the power rails.

Nathanael Huffman:

Was that

Bryan Cantrill:

Gimlet B? Was with the brand B.

Nathanael Huffman:

Yeah. So that's where, I mean,

Bryan Cantrill:

that was a brutal rule. Yeah.

Nathanael Huffman:

And that, that was nasty. And, and, and we're all sitting there waiting. Like once you discover that you can't do anything, you can't do anything until you get those boards back. So, you know, like, please help.

Bryan Cantrill:

Right. So we were very grateful for the, for benchmarks rework, for Revvy. We didn't we needed much less of it for Cosmo, but it was great. Because we were but these boards, once they got we got the SP and FPGA back on there. They we those they worked.

Bryan Cantrill:

So which is great. So that was really nice to not have to lose the board. Boards are expensive. So so we and then so we're kind of like we're getting ready to begin like on sometime Tuesday begin an SP five power up. I know or actually I guess in parallel now Eric is working with the SDLE and trying to understand what our margin looks like.

Bryan Cantrill:

Right. And he he is having some issues getting this thing to admit its telemetry that we need, the SVI three telemetry if I recall correctly. Yes. Yeah. And I I think I mean, unfortunately, he's unable to join us tonight, but I think that the that I think Renaissance was hugely helpful on this.

Bryan Cantrill:

I think that we needed a firmware update, I believe was the was the challenge there. But I don't know, maybe Dave, maybe you know more details on that one. That was that problem would got resolved.

Nathanael Huffman:

Yeah. That got resolved pretty quickly. And I mean, was, it was nice in the sense that I think Eric didn't start SDLE work until extremely late in the second week of Gimlet A bring up. Whereas in this case, like, I mean, by day two, he had a board kind of to himself and then he could, you know, we had to give, I had to give him an FPGA image that just turned all the supplies on and then he could go do SDLE and he, you know, there's a lot of control loop loop loop tuning that has to happen there and a bunch of other things to make sure that we're meeting the margins that we designed for. And so, and, and like the runs take a long time and, you know, it's kind of like a full factorial test in some cases.

Nathanael Huffman:

Cases. So, and you gotta, you gotta swap load pods and it's, it's pretty intensive. So he got to kind of sit over in a corner and really work on that, for most

Bryan Cantrill:

of And then I think needed again, we weren't having things work originally and part of the reason that we like this Renaissance part of like a relationship with Renaissance, the FAA Renaissance Brian Faili has been hugely responsive and got was able to get us what we needed and and I think we were from a power perspective, we're looking good. And in particular

Nathanael Huffman:

Yeah.

Bryan Cantrill:

We had talked about the group group g which as Eric says group g is for guzzler is the this highest consuming part. And we wanted to be able to make sure that we could support a dense And it was looking good. So great stuff.

Nathanael Huffman:

Yeah, I think some total from him at the end, I think we had to add two capacitors. Amazing. Not a big deal. So he did a really awesome job, getting getting those designs. This is all a TLVR design, which I don't know the definition of that acronym, but it's a, it's a fancy control loop kind of power design.

Nathanael Huffman:

So

Bryan Cantrill:

Yeah. Yeah. No. Sorry. You say I don't know if you've got any additional color on that.

Ian Sobering:

Oh, no. I was, it would have been it would have been good earlier earlier, but I think I think Nathaniel covered it. That's all.

Bryan Cantrill:

But so we're looking good there and, it's time to power up SP five. Let's let's give it a go. And of course,

Nathanael Huffman:

we have high hopes at this point because like this has gone really well.

Bryan Cantrill:

Why do we have high hopes? How many times how many times we have to have our hearts broken before where I like, I just, I mean, I guess it's like it's part of the human condition is that like you get your heart broken and yet your hopes still get high again. But we did have high hopes. We absolutely had high hopes.

Ian Sobering:

Yes. Just summarize at this point where you were because like compared to to Gimlet A, like it's it's like Tuesday night. And in two days we've powered up, mean, I don't know we've probably we probably have five of the 10 boards running. All the firmware's programmed, all the FPGAs are flashing.

Bryan Cantrill:

We've done all of A2 more or less checked I mean we how

Ian Sobering:

we're tuning, he's already, you know, done the tuning for group E, and the only reason group G tuning is not done is because the runs take like forty five minutes each and there's like, you know, twelve runs or something, but he's working on it. Like he's he's got aboard since Monday, he's over there, you know, doing the thing. And we're looking at each other going like this is this is great. Like all we have to do is put the CPU in.

Bryan Cantrill:

Let's go.

Ian Sobering:

Like like, at at this point, like, everything I've worked on on the board is done. And so my job is just like, okay. I'll I'll I'll keep people fed with reworked boards. Like, I'll just do reworks as they come in.

Bryan Cantrill:

Or I do whatever rework Matt tells me I need to undo apparently.

Ian Sobering:

And and so I think I think yeah. Or undo reworks. Yeah. And so, like, thing things are I I'd like yeah. We get our hearts broken, but at the same time, Nathaniel pointed out to me, like, this was months of work to get Gimlet revved at that point.

Ian Sobering:

And hundreds of like 50 reworks that he did by hand, you know.

Bryan Cantrill:

You're making me feel that optimism all over again. So, alright, I'm right, we're optimistic and we powered on and Nathaniel, what happens?

Nathanael Huffman:

Well, so, so we don't, so they're like in the sequencing, there are two handshakes with the processor. So, we release, like a low level reset and then, within so many clock cycles, basically, they should respond back with, you know, like two pins should wiggle a different way. And that doesn't happen at all. And like, this is, this is super early. I mean, it's like the earliest signs of life that we have from the processor.

Nathanael Huffman:

And so we did kind of a bunch of looking around at things and it's like, I mean, we bring the sequencing traces out to a header so we can look at them and they're, they all look correct. And, you know, it's like, well, this is so early, like maybe our clocks aren't starting. So, but like measure, we have crystals on the board because we're following, you know, the recommendations of AMD And, but it's really annoying to measure a crystal with this without special equipment, because you can, by even by putting a probe on it, you can change the circuit and the crystal can stop or the crystal can start. And so anyway, the long and short of this is basically our clocks weren't actually starting. And so we needed 10 mega ohm or two mega ohm burden resistors across there, which we had footprints for because the AMD guidance recommends that you leave space for them in case you need them.

Nathanael Huffman:

So we did, we did have them and, and so we installed, we happened to have, I think, some two mega ohm and some 10 mega ohm resistors in one of the kits that we had brought with us. And so we install those and we get, it looks like clocks are up and running

Bryan Cantrill:

reliably. How did we measure that? Because I mean, it's, you say like, this is like the original probe effect is, I mean, this is like the placement of the probe distorts the circuit.

Nathanael Huffman:

I mean, you stick, you stick a scope probe on there and, and kind of hope really. And then, on, on these chips, in some cases, they have outputs. So in the AMD cases, because they support two processor designs, you can cascade some of your clocks. And so they have clock outputs that are buffered inside. So they have an RTC clock that comes out.

Nathanael Huffman:

So that's your 32,768 kilohertz you know, clock output. And then you have, I think a hundred megahertz output from your 48, which, you know, your 48 is another oscillator. That's kind of the core, the core thing that we use. And so we could measure on the outputs in some cases, like if you get up past, past the point where the output should be on, then you could measure the outputs. And so we weren't getting there, originally, but we, you know, with a little guessing and like you could kind of put the probe near the circuit and decide that it wasn't running and then you could touch it and sometimes it would run.

Nathanael Huffman:

So I think like we RFK definitely had one point where he's like, yeah, the clock's running, but it was running when he touched it because he was, you know, he was able to get it to start. And, so anyway, we did figure that out. So we reworked the boards and convinced ourselves that the clocks are starting. The RTC clock still starts really slow, but we've kind of decided that that is I mean, that was true on Gimlet as well. And it I I think it's because we're weird and we're like the only people in the world who don't have a RTC clock battery.

Nathanael Huffman:

So that that oscillator is actually just slow to start and, doesn't actually meet the, like, startup times of, what you'd expect because it it powered on later in our design because we take power away every time we cycle it. So we got clocks.

Adam Leventhal:

Nathaniel, sorry.

Nathanael Huffman:

Yes. I,

Adam Leventhal:

I, I just, I, I dumb guy question here. So when you turn on the power, the clock is slow. It needs to like warm up like an engine?

Nathanael Huffman:

So, I mean, crystals are funky. Right? They're, they're I mean, there's a little thing that's actually vibrating. And so it's it's this, like, funky little analog circuit. And if if you're too burdened or you don't have the right load capacitance, it will start it'll take a long time to actually reach its stable vibration.

Nathanael Huffman:

And the end result of that is like you don't see your clock start. Or like, if we're looking at the output, we don't see, you know, the output is typically gated, so you'll see nothing and then you'll see the clock show up

Adam Leventhal:

and it just takes time. Is it like

Nathanael Huffman:

hundreds of milliseconds to seconds.

Aaron Hartwig:

Okay.

Adam Leventhal:

So, and the way that most folks work around this is just a battery to keep it.

Bryan Cantrill:

They always have a battery.

Nathanael Huffman:

Most, I think most servers out there don't actually turn this rail off in general. So because it's never off, then, you know, the it doesn't matter that it took a really long time for it to start. Gotcha.

Bryan Cantrill:

But it and Adam, these I mean, it is amazing that all of this whole thing starts with a quartz crystal and it is really the whole thing is like just very physically remarkable that you're basically whacking this thing generates electricity. And I I dropped a link into a video that I thought to be pretty really interesting about like why because it is it is very peculiar to silicon dioxide in it in its crystal that that causes this. And it's really interesting. And it's all you're also just like, wow, this is like at the absolute like the the the base of the this is the the earliest you can be in boot, right, is having an actual reliable core. So so when you can't measure it, which is the other thing.

Bryan Cantrill:

It's kinda nuts.

Nathanael Huffman:

Yeah. I mean, can with special equipment, but we just didn't have that on hand.

Ian Sobering:

When we were troubleshooting this, we had one or both of the crystals, it's not clear at this point, when, you know, when you touch touch the oscilloscope probe to it, could get them to start oscillating, and when you pull it away, it would stop. And in the process of you know going through this the first time you know it's like okay well you know the at this point you know the board has the CPU installed and it's got the heatsink on it so you can't like flip it over on top you know without it flexing a little bit. So we have it stood up on its edge and we're in there poking it with the oscilloscope probe trying to see if there's any sort of signal on these things and at some point somebody brushes their hand against the heat sink and goes, guys, it's hot. And we had apparently just just by by touching the probe on and off of the crystal, glitched enough clock edges on the crystal to advance whatever internal state machine is in there to start turning on more parts of the chip, and it was getting warm. And so at that point, it's like, okay.

Ian Sobering:

Well, it like, it's a crystal thing. You know, we will get farther if it glitches itself, you know, at the frequency it's supposed to run instead of us poking it and, you know, then things proceeded from there.

Nathanael Huffman:

Yeah. Well, and in some cases, like on the RTC clock, there's probably a small amount of design of logic in their design that runs on that. And so after you get through that sequencing, I mean, else might be broken, but the remainder of the sequencing may still happen if you're 48 started.

Bryan Cantrill:

So we spent much of it in the next day getting the crystal rocking. Yeah. Yeah. And so we we get the crystal rocking, and we we think that we're getting further, but we're we're still

Nathanael Huffman:

not We definitely are further. I mean, at this point, like like Ian said, now that we can reliably make the processor hot, but I'm still I'm still claiming that we're not sequencing. And so like now we've sequenced past that first handshake and we're up at kind of the final handshake. So the way this process works is you have this early handshake and, they give you back these, you know, SLP signals. And then, we do, you know, we turn some more rails on and do some more things and then we tell them, Hey, your power is good.

Nathanael Huffman:

So we're telling them all of the rails are up and ready to respond. And then the AMD goes out and talks to the power supplies it can control and maybe has some internal monitoring. And so then it handshakes back with a power okay. So then we know it's agreed, like power's good. And then the very final thing is it releases its reset pin.

Nathanael Huffman:

And so we have visibility to the reset pin and it should release that. And my state machine is claiming that we are not getting through that state. So we're not seeing reset release, which, if you remember from previous history is pretty much, I mean, it's like right there where we were with Gimlet A. Although Gimlet A, we would see it release and then it would reset, you know, at a frequency.

Bryan Cantrill:

Right. So we're beginning to get some, beginning to get ready for the kind of the long grind here in terms like, okay, this could be this could be a while as we what do we kind of explore from there?

Nathanael Huffman:

Well, so, I mean, there were a number of things. So we're, we're still a little concerned about clocks because it's like, you know, yeah, I mean, they look like they started, everything's fine. So we put some, we put some things in there. We also, also put So I changed some of the FPGAs. So we have some debug headers.

Nathanael Huffman:

And so I brought some of my internal signals out to the debug header so we could watch the sequencer run on it's like as it sequences. And, you know, like there again

Bryan Cantrill:

Is now the right time to talk about super dongle or what what we call mega dongle? What do we call that? Dongle source? I'm not sure. Do you have a name for that dongle?

Bryan Cantrill:

Oh, yeah. That's the oxide

Nathanael Huffman:

programming adapter. There's an RFD on it and everything.

Bryan Cantrill:

Do we sort of more a colloquial name for it?

Nathanael Huffman:

The because this this is It's programming adapter.

Bryan Cantrill:

I It's a mega dongle is what it is. It is it's got it's a dongle with all the trimmings.

Nathanael Huffman:

What one of the things that we have discovered over time, you know, and it's like we probably knew this at the beginning, but it just it was just one more thing to change that we weren't gonna change. Especially in manufacturing, installing all of the dongles on different headers. So we have like an SP dongle and a UART dongle and a ignition dongle. So we have all of these little programming adapters that we have to connect to discrete headers on. So our manufacturing partner, when they manufacture a new gimlet, they're like five cables they have to connect in order to program the thing.

Nathanael Huffman:

And that's fairly annoying. And so, one of the things that we had also proved out on grapefruit was this, this programming adapter. So we basically just consolidated all of those into one little tiny circuit board And then we have a higher density connector on the, on the actual gimlet and on the grapefruit that, and I think on mini bar as well, that, that use this like consolidated thing. And so you just plug this thing in. But what that allows is that allows our contract manufacturer to leave all of like all of their stuff connected to this little board.

Nathanael Huffman:

And then this little board is just one connection to our main board. That, so we have that, which is one of the things. Yeah, it's super nice. We also do have some additional debug headers. So we have like our standard little Sam tech debug headers out on sprinkled out on the board in a few places.

Nathanael Huffman:

And those like I have, I've got 16 pins of debug header connected to the Spartan seven, so I can bring 16 different signals out on two different voltages, which is super helpful. We've used that a ton, in bringing, in bringing the stuff up. So we do all of that and it's kind of like, yeah, well, I don't know, reset's still zero. Right? And, I mean, we tried a lot of things and like we went a bunch of different ways because, you know, I was making changes to the FPGA.

Nathanael Huffman:

We're trying to bring various things out and like you start to get this sinking feeling that like you're feeling FPGA build to build variability. That's like as an FPGA person, that's a really bad place to be because that means probably like an assumption you have is not valid anymore. And like that assumption, the tool should be telling you these things, but like if you've done something wrong or you have improper timing constraints or that kind of thing, like those assumptions don't hold. And so it was kind of feeling like that for a while. And, but I'm like, I'm looking at the design and I'm like unable to find anything that interesting there.

Nathanael Huffman:

None of this stuff is running super fast. I'm looking at, you know, my, my build logs from, you know, a build that like nominally would sequence and one that wouldn't, because we did at some point get one to sequence. Fact, we got one to Yeah. That's huge. It was huge.

Nathanael Huffman:

And so we're like, Oh look, I saw eSpyWiggles. And so then Robert and I thought we would be cute and not tell anybody. And so I, I built a new FPGA that had the ILA in there so we could do more FPGA stuff or we could do look at, look at some of that

Bryan Cantrill:

stuff because you have embedded logic analyzer.

Nathanael Huffman:

Yep. That I can compile that into that's a, that's a Xilinx tool that you can compile in, in Vovato. And it basically just gives you a little like model SIM type, you know, way view of things and you can it samples into RAM in the FPGA and spits it out over JTAG.

Bryan Cantrill:

Which is super cool. And I had we had talked about this. And this is like a great thing about soft logic is the fact that you've got the ability to do this. And I think, you know, this is one of the things we're gonna be able to leverage. I mean, there were a bunch of reasons why we were using we're not able to use kind of open FPGAs.

Bryan Cantrill:

We have to kind of use a Vovato tool chain, but it's like, hey, one of the nice things about this is we would have this particular tool. I had never seen this, but I was super curious about it.

Nathanael Huffman:

Yeah, and you had missed our So I had already pulled this tool out to help debug our FMC problem, but you, due to your United delays and other fun, had not managed to see that. So you had only heard the story, so you wanted to see it. So I was like, Hey, this is great. We're going to bring Brian over here. We're going to show him the ILA and the little thing.

Nathanael Huffman:

Then rapidly through the course of our conversation, you're going to figure out that we're actually booting. Right? And so like, this was like my,

Bryan Cantrill:

you know, gotta say like, this is like cracking the seal on something pretty dangerous. You know, this is like, you are introducing like pranking in the bring up lab. I mean, I just think handle this carefully. I guess this is true. I guess this is what oh, is this pranking or is this like this is like kinda oxide demo bit?

Bryan Cantrill:

I mean, I agree with you.

Nathanael Huffman:

You're great learn. Learn Showmanship.

Bryan Cantrill:

Yes.

Nathanael Huffman:

Yeah. We learn from, you know, local horse not to prank you at bring up.

Bryan Cantrill:

So

Nathanael Huffman:

at any rate, I get a new FPGA loaded in there. We call you over. We're going to be super cool. And then we get

Bryan Cantrill:

another sequence. By the way, as it turns out, like I am an absolute mark. If you like, oh, hey, want to show you the embedded logjith of the FPGA. I will willingly like, I'm just like fish chomps down on the bait. I'm just like, I've got no clue that there's.

Nathanael Huffman:

We owed it to you. Right? We were gonna show it to you. So it was like, it was gonna be kind of a cute way to show it. And then also, like, you know, get the cheers of, hey, we're booting.

Nathanael Huffman:

Except that when I I built that FPGA, it didn't boot.

Bryan Cantrill:

So you're showing this boot to me and you are obviously very disappointed. I'm like, okay, this seems like I don't know. Yeah, seems interesting. Right.

Nathanael Huffman:

Yeah. And I mean, yeah, it was doubly sad because nothing happened, but you didn't think anything could happen because we're still having problems sequencing. You're kind of like, well, what did you expect? Goofy people. And we're like, well, actually we did boot once or we started to boot and you know, something happened.

Nathanael Huffman:

So I'm like, you guys

Bryan Cantrill:

you mean you like, you had an SP five over here booting and you're like, it was just your first thought was like, let's go. I was, you know, I was impressed. I was impressed. I was impressed with kind of emotional regulation, frankly.

Nathanael Huffman:

It had started booting and then it stopped and we didn't know why. And so we were going to need the ILA anyway, but it was just so at any rate, so it, it doesn't boot. And so now we're into this world where, like, I have an image I have an image that I have made that seems to, like, somewhat reliably sequence. Well, was so much

Bryan Cantrill:

you didn't boot. You explained to me, like, what what happened? I'm like, yeah, I swear. So then we put the old image back on.

Nathanael Huffman:

Yep.

Bryan Cantrill:

And that image sure enough, like, does get these pie wiggles. Yes. Yeah. And you're like,

Nathanael Huffman:

oh, no. And it's like It

Bryan Cantrill:

was annoying.

Nathanael Huffman:

Like, yeah, I know. Because like this is I mean, these are problems that like you should be avoiding by design in general when you're doing FPGAs and, know, I've, I've been doing FPGAs quite a while and, like, so it's never a good feeling because now you're into like, is my tool chain lying to me? Did I miss, you know, I miss some kind of clock crossing somewhere in the design? But like, again, all of this stuff is like, all of the sequencing is all in single domain and it's like 125 megahertz and it's not very complicated. And like, I've got all the pins coming out and I can see that like I continue to do the, you know, I wiggle the right way and everything.

Nathanael Huffman:

So, so, you know, then we're like, okay, I don't, you know, so I'm like, you know, going through build logs, all the stuff, like really can't figure it out. We get another build, put more things on the debug header. We get a build now where like, when I, when I attach, I have a salient connected to the debug header. We seem to sequence. And when I pull the debug header all the way, we don't sequence.

Nathanael Huffman:

Wasn't it the opposite? Or maybe we put it on.

Bryan Cantrill:

Yeah, we wanted to get like the Salea on there to understand like, okay, if we can just understand that like we now need to look at these East by Wiggles and understand exactly where we are so we can understand like why we're not making further progress. Like, oh, get Asaya and they're like, well, now I don't boot at all. Now I don't do anything. Right. You're

Nathanael Huffman:

right. And and it's complicated because every time we every time we send something out to the the debug headers, we're getting a different FPGA build. And, but it's, it's like kind of weird because like I can go back and rebuild those builds and I can rebuild those builds with like other minor changes and not make any change. So it's not like, it's not total build to build variation, but like there's something strange going on here. And then we're like, well, maybe, you know, maybe there's a floating pin somewhere else.

Nathanael Huffman:

And so, you know, loading the bit stream or like, I mean, anyway, we went, you know, like, it's like the stages of grief, you know, you go to bargaining and you know, all the other things. So

Ian Sobering:

Okay. So so

Bryan Cantrill:

put this sadness in the oven here Yes. Because we do actually have a we've got to come another opportunity to validate things without the SP five working. So because Ian, you'd been working with with Matt and Aaron on getting minibar fully functional for and we talked about minibar a couple episodes ago, but getting minibar fully functional for Cosmo, which is gonna be very exciting.

Ian Sobering:

Yeah. So one of one of the things that was happening in parallel with with trying to get SP five to boot was, you know, because it's it's Tuesday, and right about this time, Benchmark releases all the rest of the minibar boards to Doug and I. Doug has the big mechanical chassis there for the first time all assembled, which is really cool. He's got one the minibar light units there. And so, you know, I I do some some quick checks.

Ian Sobering:

There was one more rework we had to do on them just pulling a bunch of ESD diodes off it had been installed backwards. And then they were ready. And so Matt and Aaron and Laura jumped on them, got all the software loaded, and and I don't know how long it took, to me it was like half an hour because I'm I was, you know, working under the microscope, but it it seemed like pretty much immediately they popped up and they're like, we're done, it works. And so we sort of look at each other and go, well, let's go get one of the flight test gimlets, you know, from from the production floor. Let's put a sled in it and see what happens.

Ian Sobering:

And so they go get get a sled and shove it in minibar and some LEDs come on and Aaron goes, oh, yeah. We can see, you know, we can see both the ignition links. And, you know, someone you know, Matt grabs his laptop or Aaron grabs his laptop and plugs into the ethernet port on the back, and we go, boop, I can talk to the SP. The management network's working. And so we sort of look at each other and go, well, we could get a Cosmo and put it in there.

Ian Sobering:

And so the gimlet comes out, and one of the Cosmo boards that, you know, doesn't have a a CPU in it, but it has everything else on it, goes in there and LEDs come on and plugs the laptop in and, yep, management network's working. And so we were able to really quickly check out, you know, something that we wouldn't have been able to get to until we had a rack, you know, to and dedicated to that purpose. So it really cool.

Bryan Cantrill:

Yeah. And well, while Nathaniel is kind of weeping over on the SP five. The mini bars having a wild success over here. That was it was Nathaniel, sorry to be hopefully you've but it was it was actually great to have an and I just dropped in a picture of kind of team mini bar in there as we got it would that was great to have. Because, Matt, we were able to really validate the entire management network on Cosmo.

Bryan Cantrill:

And, Aaron, because I think you plugged in your laptop. It pretty great.

Matt Keeter:

Yeah. Basically, went from Aaron's laptop to the r j 45 jack on the back of minibar to a phi to the management network switch into the Gimlet over the backplane connector that normally goes into the the rack backplane, and then through another PHY and then out to the SP. So, yeah, basically the full chain.

Bryan Cantrill:

For Cosmo as well, which is great. Yeah. That was and Aaron, what was it with the the work that was involved for that to get to in order to get that kinda all the way over the line?

Aaron Hartwig:

Honestly, not not all that much. What did I feel like we had two two snafus. One was I had a bug, an RTL bug with the button that ripped me up for far too long. The power button that actually controlled power to and from the sled. And I feel like we had I'll just go look at the commit history to figure out.

Aaron Hartwig:

There was one other thing that we fixed. Think, Matt, the spy interface maybe didn't work immediately out of the box. But for the most part, the goal with minibar was to just reuse as much software as possible. So it was just kind of like taking down blocks. I think, Matt, you maybe had the VFC parts configured in a a new to us way.

Aaron Hartwig:

Yeah. But otherwise, lot of

Matt Keeter:

is actually, if I was running in SGMII to, like, RJ 45 normal Ethernet mode, but luckily, that is, like, the most common use that 99 of people are using for these PHYs, so it's a lot less weird to bring them up like this.

Bryan Cantrill:

And so that was great to to have just all of that. I mean, it was just very validating. I mean, like literally and emotionally to have and Ian must felt great for you to see kind of mini bar in its intended purpose. And of course, Doug had this just absolutely beautiful enclosure for the thing. I mean, it's it's amazing.

Bryan Cantrill:

So that was but Nathaniel, meanwhile back now now to take your sadness out of the oven, you I mean, we are actually it feel it felt to me like we were because you get to the point where you actually have you are seeing true eSpy wiggles in the logic analyzer, it definitely felt like at that point, that was I felt like a very important, like we know this thing can work. There is something that is preventing it from working reliably, but this thing actually can work. Right.

Nathanael Huffman:

Yeah. And, and it's weird because like we would see East by wiggles start and then like, I've looked at a lot of East by traces with all of the work on Ruby. One of the neat things we did with Ruby is we broke all of that, all of those traces out to header so I could like see the spy. So I've been looking at e spy traces for the last two months, or more. And so I know what they're supposed to look like.

Nathanael Huffman:

And, like, I'm start, I'm, I see valid transactions. I see, you know, exactly the start of what's supposed, what's supposed to happen and then things just stop. And so it's not, you know, like my logic, it doesn't look like I'm sending anything wrong. It doesn't look like the SP is asking for anything or the SP five is asking for anything wrong. It's just, it just stops.

Nathanael Huffman:

And so that's weird. And like we spent a lot of time building a lot of FPGAs and a lot, and like that's kind of where we ended bring up for in terms of Cosmo because it was like, well, this can work. We have something we have to go figure out. And, so Eric and I are gonna take sleds home with us and we're gonna hold the rest of them there and let them start, getting all of the rework that we've identified, you know, which, which isn't much. I mean, we have maybe we have nine or 10 little reworks that need to happen.

Nathanael Huffman:

And then we'll try to figure out this, identify any other further rework and then get these sleds distributed out to the team.

Bryan Cantrill:

Well, and then the fact that it can work, mean, how did you what was your anxiety level kind of coming out of that first week? Because my I mean, I mean, obviously, there is an issue here. I just feel like this may take a little while, but I felt like unlike with Gimlet where we definitely didn't know it can work. Yeah. In fact, it didn't work.

Bryan Cantrill:

And and it's similar with the t six when we had all the the frustrations of the t six as Aaron was saying earlier in the chat, like, won't you come out of reset the oxide story? Right. It it just feels like the I mean, so it felt like we were we we there is something amiss clearly. Something is floating somewhere. Yep.

Bryan Cantrill:

But what was your little kind of level of anxiety coming out of it? I mean, it just feels like, you know, we just need a couple of like good nights of sleep and, you know, a little bit of a restart and it feels like it's not going to take too long to nail this.

Nathanael Huffman:

Yeah. I wasn't too worried about it. And I mean, like there are answers to all of these things and we will find them. And after, I mean, while it was a great five days out there, it's five pretty long days and you're kind of on the whole time. And so it's helpful to just like, set things down.

Nathanael Huffman:

We, you know, we had to take the afternoon to pack the car up basically. And, you know, like we bring most of our lab equipment out there, Eric and I do. So, you know, there's, there's a good bit of stuff that loads back into the van as we get ready to head home. And you know, it's like, we'll get these boxed up, take them home, you know, take the weekend to recover. And then on Monday we'll hit it hard again and you know, find some answers.

Bryan Cantrill:

Okay. So what happens on Monday?

Nathanael Huffman:

So on Monday, Monday at like it's, so it's interesting cause I'm like, man, I really want to get working on this, but also my entire lab is in boxes and like I've got to put enough stuff back together so that I can like solder if I need to, or, you know, I need to get a scope out or my DMM. So I took kind of the morning to unpack enough stuff to where I was like, okay, I feel like I can sort of work again. I still had some stuff. I still have a half of the lab packed up, but it's like, I'll source that stuff if I need it later. And so, I get to building and I go back and it's like, okay, there are a few different things.

Nathanael Huffman:

It could be environmental. So does it even, if I take this image that seemingly does boot or seemingly starts trying to boot, does that still happen at my house? It does. So it's like,

Aaron Hartwig:

that's good.

Bryan Cantrill:

That's good.

Nathanael Huffman:

And if I take a nominally broken image, then it is also broken. And so it's like, okay, that's good. And so we're kinda like, I just want to rebaseline everything, relook at everything. And so I spent, gosh, probably most of the afternoon messing around with that. And man, I, it's been such a wild week now.

Nathanael Huffman:

I'm trying to even remember. I think I did discover the issue by that evening or was it Tuesday?

Bryan Cantrill:

Oh, you definitely discovered it that evening. Yeah. Start that if you don't remember where you were when you discovered the issue because I definitely remember where I was when I saw you drop in the chat.

Nathanael Huffman:

Yeah. So, you know, it's one of those things where you're like, okay, I just need to go back and look at like absolutely everything again. So, I'm not gonna look at clocking anymore. We're pretty sure that that's fine. Like this is feeling like it's an FPGA problem.

Nathanael Huffman:

Like what in the world could I be doing wrong in the FPGA? And so, I go back looking through and discovered that the reset pin that I'm looking at is supposed to be an input to me. And I have it configured as an output. So that is like not great. And, but then I'm, but I'm looking at it.

Nathanael Huffman:

And like, it's a little comp like there's some VHDL inside baseball and stuff here, but it's an output, but it needs to be tri stated because well, so it's an input, so it doesn't really matter. I had it as an output that would be tri stated Because in general, all of the things going to and from the SPE, we try to try state so we don't cross drive our buffers and that kind of thing. So I had made this a tri state output. And so after I found that, I'm like, okay, that's a problem. I definitely need to look at that as an input, I don't, given that I don't understand how we sequence because how could we ever sequence?

Nathanael Huffman:

Because I don't understand. So, and I'm like, one of the things I had noticed during the week was like, there seemed to be sensitivity to what I put on my debug pins. And I was a little bit lazy and I made my debug pins in outs and made some assumptions about how the synthesizer and the tool should behave. And it's unclear to me if my assumptions were incorrect or if the tool has a bug. I still haven't gotten to ground on that yet, but the long and short of it is I have an undriven So like deep inside my design, this pin is an input, but up at the very top level, I made it an output.

Nathanael Huffman:

So what you effectively have is a signal in your design that has no defined driver. So when the synthesizer runs, will get a warning for this. But like, if you've ever looked at FPGA tools, there are lots and lots and lots of warnings. Anyway, so this reset line has no driver inside the design. Up at the top, it's an output.

Nathanael Huffman:

Usually what I would expect the synthesizer to do at this point is, it will pick a value for you and it will throw a warning that says, you know, I picked a value for you. That value is typically zero in most of, the stuff. Like most synthesizers will just pick a zero for you and they'll drive a zero out for you. Yeah. And like that zero and like the way I had my design done, that zero inside the design should drive a zero out and effectively I would have held the processor in reset.

Bryan Cantrill:

Right. It would have never worked at all. It would have

Nathanael Huffman:

been presumed would have gotten here much faster, I think. And that's kind of a normal thing in a 2P system. So if you have two processors, this line typically gets shared between the two so that the system comes out of reset when they're both happy. And so like, it's totally fine for, you know, the SP will happily look at the signal and, or the SP five will happily look at the signal and just not continue in its process until, you know, its friend is happy, but its friend was the FPGA. So the interesting thing is because of what I did with the debug connector and because of how the VAVO connected all of this stuff, it basically took the input buffer from my debug pin and drove that net inside my design that was not driven otherwise.

Nathanael Huffman:

So it basically wired an input and that's surprising to me. Spent It's

Bryan Cantrill:

like enemy action. Yes.

Nathanael Huffman:

So, so I fixed this first and immediately everything like sequence. Mean, boom, we're good. And so I'm like, okay, this is great. But now I need to understand how in the world did this ever work? Because, and so, one the things these tools do is they'll give you a little schematic that kind of shows you how your logic is.

Nathanael Huffman:

So you can see that in the bug that's linked there. In courts, there's issue number three thirty three. And Vovato shows you what it did and it did exactly what I just described. It connected the, the input buffer of my debug pin to the, this internal net that had no driver in my design. So we did have a floating pin.

Nathanael Huffman:

That floating pin though was my debug pin. And when we would touch it or do other things, we would occasionally allow my, my signal to tri state. And when I trip my signal tri stated, the processor, which was out of reset then would continue out of reset like it should. And, I did stick that code. I saw I made a simple reproducer.

Nathanael Huffman:

There's a reproducer in that bug now. And, we'll see what you know, Vovato has to say about that. I did run that in Cordis, so the Intel tools, and I got the circuit that I would have expected there. Which is driven zero. So I would expect the internal logic to have picked that and we would just drive all the things zero.

Nathanael Huffman:

That makes sense to me. I can make Vivado do that if I give my internal signal a default value and that's all, that's what I sort of would have expected here. So we have, you know, I have a request out just to like figure out if that's spec behavior. I did kind of go into the language spec a little bit, but I'm decidedly not a language spec expert and trying to like get to ground truth on that is, more time than I had this week to continue looking at that.

Bryan Cantrill:

But this is also one of those because I mean and the thing that is so interesting about SoftLogic is it has these properties of both hardware and software. I mean, truly. Right? I mean, this is it. And the the issue you had has this kind of software pathology of like, I changed seemingly something innocuous and I get is reproducibly now correct or reproducibly incorrect, which definitely feels like a and know, Adam, I'm in our group many many many times you're like, I know there's gonna be a single bug that explains all of this behavior, but goddamn it, if I've got no idea what could that could possibly be.

Bryan Cantrill:

And this feels like certainly an FPGA variant of this. Because ultimately, Nathaniel, this explained everything. It was just like, yes, I guess if like the I guess if if the the the the devil is in here actually hooking my debug header up to my reset pin, I guess it all does make sense. Remarkable.

Nathanael Huffman:

Yeah. And it's it's certainly an area as an FPGA designer that makes us uncomfortable as well, because like, there is a lot of magic going on here, you know, under the hood. And so we have all these rules that we follow and we, we, you know, we put timing rules in and that kind of thing. And to make sure that, like, the variability that is inherent in this process doesn't end up being a functional change for you. And so then when you see, when you think you have followed all the rules and you still see build to build variability, like, you know, that that's like, you know, you hear the the Jaws theme song and you know that, probably you've done something wrong, but it it makes us pretty uncomfortable.

Bryan Cantrill:

Well, I think also one of the challenges I just got from you when you just talking about kind of like that part of the challenge with this tooling is that you also get a lot of warnings that you need to ignore that are Yes. And Yes. That Yeah. Immense amount. Which makes it then really tough when when it's like, oh, I'm supposed to know to this because there's like one load bearing warning in this sea of warnings that I need to ignore in order to, mean, it's like, boy, that's, that's really tough.

Nathanael Huffman:

And interestingly, because of the way this worked, my runs that work and my runs that don't work often have the same set of warnings. But it's just like the warning apparently means that they made this circuit and this circuit was not what I intended. So

Bryan Cantrill:

Right. Right. Right. Right. Yeah.

Bryan Cantrill:

Absolutely brutal. Well, so this but this is I mean, ultimately, I mean, when but you went back to kind of like first principles and you were able I mean, you you got this thing like nailed pretty quickly. And now we're really at this point and this is last Monday, so a week ago. Things really begin to cook at this point and we start getting because I think you've got Unix up like in a day or two, Yeah.

Nathanael Huffman:

Yeah. Mean, yeah, not even, I mean, was, I feel like it was almost hours because like once this works, then I mean, else had all their, you know, we had ram disks and we had, you know, all the things. So I mean, by Tuesday morning, I think we were, or Tuesday sometime we were booting. So,

Bryan Cantrill:

I do like the fact that they know that my first reaction to this, I mean, aside from like the, just the verbal exclamation of this when you nail this issue, was I did say that I I appreciated you considering our need to provide podcast content in

Aaron Hartwig:

the Right.

Bryan Cantrill:

Right. I mean, it's great content. I gotta say.

Nathanael Huffman:

That's right.

Bryan Cantrill:

We have a but it was and so then we hit the SPD issue. You kind of made a passing mentioned to that, that you got, we, that was took you an Aaron maybe a day or two to get that resolved and we, turns out we had, we need what stronger pull ups.

Nathanael Huffman:

Yeah. So we use these, so the SPD i3C bus runs at like 1.1 volts or something. And so we have to level translate out of the FPGA for that because the FPGA only goes down to 1.8. And so we use these level translators, which we had put on Grapefruit. But again, by the time we got to doing, SPD stuff on Grapefruit, Cosmo had already taped.

Nathanael Huffman:

So it, Aaron had done a ton of work there. I mean, like he, you might want to talk, through some of the like, you know, terrible VNC shenanigans into my basement. But he had done a bunch of stuff there, but we just couldn't convince ourselves that everything was like totally working. And so, I don't know. Aaron, you want to fill in any details there?

Aaron Hartwig:

Yeah. I mean, we'd we'd been booting off of grapefruit. Yeah. And I was using, I'm like tunneling into your house, SSH ing through machines to use a that's hooked up to, like, I don't know, a couple couple bucks. It's really painful.

Aaron Hartwig:

But it's better than

Bryan Cantrill:

driving. Yeah,

Nathanael Huffman:

I have, I have what we call a Butler in the basement. So that's an Alumos box that has the like humility tool chain on there. But, we're not running on Alumos. And so I have another Linux machine.

Bryan Cantrill:

I mean, we're barely running it on Linux.

Nathanael Huffman:

Yeah. Running, running Salea. And, and so then with a little bit of like SSH port forwarding shenanigans, Aaron could, reverse tunnel a VNC connection from my Linux box out over the VPN over to his house and get a very slow graphical display.

Aaron Hartwig:

Yeah. So that was rough.

Bryan Cantrill:

Yeah. That was enough to nail it.

Aaron Hartwig:

Well, that was how we were doing some of the initial development. And weeks that I was working on this in my spare time was I felt like I was just chasing ghosts because of the voltage translation problems and the fact that I think we had touched on this in the beginning where we can't we're not allowed to know how to change the muxes on the Ruby. So we have to back in these SPD signals because we're basically putting ourselves in between the processor and the DIMMs, which resulted in due to all the mechanical form factor stuff are multiple feet of wire that we're running the I three c bus at. And so with with, like there are just so many levels of jankiness that it was tough to, like, have high confidence in some of the weirdness we were seeing. And it got to the point where we're just like, well, we'll try it on Cosmo, and we we'll just make it so we don't try to play any tricks.

Aaron Hartwig:

We just let the CPU talk to the DIMMs, we get out of the way. And that was working just fine on Ruby and Grapefruit. So we expect it to be the same on Cosmo. And it wasn't. And we spent some time I mean, we were looking at ILA traces.

Aaron Hartwig:

We were looking at the salea again and probing and just seeing strange things with it would work. It smacked of quartz three thirty three.

Bryan Cantrill:

Yeah. Interesting.

Aaron Hartwig:

You you you could drop out the the RSPD proxy logic and watch the watch the CPU talk just fine to pins that were effectively tri stated. But as soon as we wanted to try to get in the mix a little bit, things would kind of like maybe wiggle once and then get stuck. And yeah, I mean, the problem ended up being after we I mean, Nathaniel, I think you had looked at were you probing power supplies and we were making sure that everything was connected because we saw like one across both buses. We saw one of the SDA lines was just hanging low basically when we'd have our image installed.

Nathanael Huffman:

Yeah. Cosmo, the end result was basically that once an I squared C line went low, it never came high again, sometimes it did. And so the level translators that we're using, they say they have 10 ks pulls on them, but if you read their data sheet, they talk about those 10 ks pulls really being high keepers, which means that, like I squared C is complicated or I three C is especially complicated because you could active drive high, active drive low, or you could run-in I squared C mode. And the level translators kind of suggest that, like their 10 ks's are there for high keepers. So once you've driven it high, they'll keep the bus high.

Nathanael Huffman:

But maybe don't function totally as actual like solid pulls. Yeah. I

Aaron Hartwig:

mean, the funny slash sad thing about all this is that before we ever went out to bring up, we expected we were going to need to do this. And then we spent another day looking at this from end to end and then arrived at the thing we already were suspicious of that we would need to go do. Because of the jankiness of our Ruby Grapefruit hack, we were doing some things like, okay, enable the internal pull ups on the FPGA. And actually, let's active drive this bus because we're driving three feet of wire. And part of our application is us trying to also use the bus at a higher speed.

Aaron Hartwig:

So getting some of those things out of the equation led us to actually look at the physicality of what was going on because it didn't seem to be working as it should have.

Nathanael Huffman:

Yeah. So a bunch, I mean, I did a bunch of builds. Aaron and I chatted some on Tuesday and finally I was like, okay, I think I'm just gonna break out and put, put the two ks pulls down on all this stuff. And let's just see if that moves the needle because this is sure looking like a floating bus that isn't really doing the right thing. And so the way we have this, we have, we've got two buses from AMD coming to the FPGA.

Nathanael Huffman:

And so that's a level translator for each of those. And then we have two buses going from the FPGA out to the DIMMs. So that's two more level translators. So you end up having four level translators and you have ins and outs on each one. So that gets you, eight, eight pulls per level translator and we have four level translators.

Nathanael Huffman:

So anyway, there's, you know, a bunch or maybe four poles level. Four poles per level translator and four level translators. So we have 16 little resistors. So I was able to tack 16 little two ks resistors down on the back side.

Bryan Cantrill:

Yeah. Okay.

Nathanael Huffman:

So I just, I picked like 2.21 k. That's a one percent pretty standard resistor when you want a nice strong pole. And this stuff is all at one, like 1.1 volts or it's 1.1 on one side and 1.8 on the other side. So, that seemed like a reasonable choice and we want this to run pretty fast. And on the FPGA, if we want it to go even faster, we can active drive it if we need to.

Nathanael Huffman:

And, and then sure enough, of course, I put this in and then boom booted. So that was Tuesday afternoon, I think.

Bryan Cantrill:

Which was, that was great. I mean, now we're, so I imagine that and in particular, the dims are training. Like, we're actually like because, I mean, the the you need to have the SPD working so the thing can see the dims. But now it actually has to like train the dims and this is now we are around the the dark side of the moon. Although we thanks to the FPGA, we actually have got what we we get little morsels of what it's up to on on Gimlet.

Bryan Cantrill:

It's like it is in the dark side of the moon for a minute and twelve seconds. Not that anyone has ever sat there with a stopwatch and counted that over and over and over and over and over again. The training time is actually longer and we're gonna be doing things to we talked about some of the things that we're gonna do to actually kinda cache that information. But we actually got a little bit more information. So we can see what this thing is actually doing and what what what bank it's it's training and so on.

Bryan Cantrill:

So we but it it is able to make it all the way up, which is great. We have we've got a computer. Now we're not we that and they actually I was surprised. I don't know. I'm not sure it and no Dan and Luke and I'm not sure to what degree you all were around, but the I was surprised how far we made it on that boot.

Bryan Cantrill:

We it was great. We basically got all the way up to multi user. We got CPUs up. We got a lot of things now working and which is great. I think we were now just shy of Nathaniel just getting hot plug working, right?

Bryan Cantrill:

That was kind of the last nature thing we need to get working.

Nathanael Huffman:

Right. So none of the PCIe stuff came up, in an expected way. And so then it's like, well, guess who's back in the hot seat? Oh, that's me because we took all of level translators and everything and, put them out, and made basically the FPGA is emulating all of that. And so, it's like, okay, I probably I didn't have, I mean, I've got good simulation coverage for all of that, but like this is the first time I've even seen you know, traffic coming, out of the chip.

Nathanael Huffman:

And so it actually it turns out that the only bugs that I actually had there were around my, my indication of stuff like to the hot like how I had wired up things to the hot plug controller. And, and I I think I also was not necessarily when they would read an input pin, I wouldn't necessarily show the outputs. So there was one minor little little thing there. But all of the rest of it was just like I had I had improperly translated stuff from the sum slots and that kind of thing, you know, to the correct pins on these level translators. But all the I squared C stuff, like I pretty quickly got a trace there and it was like, hey, this all looks like really normal.

Nathanael Huffman:

You know, we're acting and acting and doing all the things like we should. So

Bryan Cantrill:

That is awesome. So you and I mean, to be expected that we're gonna have to have I'm just gonna be because these were another things that we couldn't actually test with Ruby. Right? We the end great for it. Like we actually needed actual real Cosmos.

Bryan Cantrill:

So but then the fact that like we we now have you were able to get all that stuff resolved and we've got actual devices. We actually add on by by the end of last week, you are actually seeing NVMe drives. We are seeing which means we we've got PCI training. You've got and so this point like you we've got certainly way more computer than we had Yeah. When we recorded our Gimlet episode years ago, but we all got like we've got a lot of computer at this point.

Bryan Cantrill:

We've got a and still things to do, plenty of work to go do. We got we've got to go do dim margining. We've got to go look at what that eye diagram looks like. We got to go look at PCI link training. How's that looking?

Bryan Cantrill:

How's that looking for the links that we know are gonna be a little more marginal in our simulations, but damn, we're we're looking good. It feels good anyway.

Adam Leventhal:

Brian, about how long did this level of bring up take with Gimlet?

Bryan Cantrill:

I mean, I I like five to six months, honestly. I

Nathanael Huffman:

was out there in February, I think for T six still wasn't up and like we have in manufacturing mode. So hopefully tomorrow we'll get a T six flashed and it'll be up and it's like done three by 16 mode. And I think once we do that, we'll be ready to get the remainder of the boards, you know, through rework and shipped out to you know, coworkers.

Bryan Cantrill:

Yeah. It it it was a long time because everything took longer and we I mean, this was the the 499 ohm resistor on t six and that with the clock issue. I mean, we had a bunch of issues that were. And fortunately, know, we just we we learned a lot from that. We we one, we just we got much better experience.

Bryan Cantrill:

We in the term that we we in terms of like we got experience with these specific parts and now we know we we know that that needs to be a four ninety nine ohm resistor for the t six. We know how to do the like, I I think the the the resistors to select the internal clock. I think we are very very familiar with that one on the t six. But then also we just like, again, grapefruit was absolutely clutch. And we so it's been it's been really good.

Bryan Cantrill:

We've just been and again, for the things to go, but now things are just it feels like Nathaniel also things are like really paralyzed and we've been able to just like knock tons of things down. We had an issue where we would only work with the unsecured parts and Lupin just today nailed that issue and so we're And sorry Lupin if I'm speaking to you too soon. I know we got it, but that was looking promising anyway. So we're making to answer your question, Adam. We are way way way way way further along, which is great.

Adam Leventhal:

It's awesome.

Bryan Cantrill:

So yeah, it's been a it's been fun, Nathaniel. I think this has been I mean, I think we it was a, you know, I think we obviously wanted to get further in that first week, but I would say given where we are now, I don't think we would have thought would be. I think this is exactly where we want to be. I would say.

Nathanael Huffman:

This feels more like how a complicated board bring up should go. So, know, the appropriate amount of pre work and, and, and you know, the fact that we're building on stuff, I mean, like lots of hubris exists already. And like we have all of these great tools now that we didn't have, you know, day one. So it's super nice to, we can rapidly, you know, figure out what's wrong and, and close, close on, you know, the stuff.

Bryan Cantrill:

So it's good stuff. So we're, again, we got work to do to get this thing out in the a product, but we are excited to get it all the way over the line and get all do all the next things we need to go do to get this. But we we we're gonna live. Yeah. We have a computer.

Bryan Cantrill:

It's gonna work. Thank God. Which, know, didn't there were moments where might not have felt that way not that long ago. So, well, Nathaniel, terrific work. And I mean, obviously a bunch of folks and I mean, I know Eric couldn't join us, but the fact that power was so clean on this, as you said, like two caps and it's been.

Nathanael Huffman:

And there's just been an immense amount of teamwork. I mean, was telling some of the electrical guys earlier, this week, just like every time I have to go to the schematic to look, the schematic even just feels so much better than Gimlet's did where, you know, Gimlet had a lot of hands in there and a lot of, you know, a lot of iterations on it and our new schematic just is is so readable and, you know, all all of the eyes have done a really and and really, spent very little work, you know, on this project in schematic. So that's a credit to the rest of the team. I mean, they've really made a really nice artifact that's fun to use. And like, you know, I, I dread having to go back and look at Gimlet when I was trying to compare, you know, like what's different between Cosmo and Gimlet and the Gimlet is just so nice to read.

Nathanael Huffman:

So, like, and an RFK and Eric especially did and Tom did a great job there.

Bryan Cantrill:

Well, and I'm I'm hoping we're everyone's gonna be able to appreciate their craftsmanship because I wanna get that schematic out there. So I wanna I want I want everyone to be able to look at the glory that is because it is great and like just the way it was done, the notes on it are great. If you're writing it's it's the way a schematic should look. It's really terrific work. So, well, thank you.

Bryan Cantrill:

Thank you for for coming here and regaling us. I thought it was definitely fitting that we had a an issue that is in the netherworld of like, it really feels like it would be nice to get more help from the FPGA tools. I think that fitting for us that we had such a bizarre issue that was at the root of it all and kudos to you and everyone for getting that resolved so quickly and onward. Great stuff. Alright.

Bryan Cantrill:

Thanks everybody. Now I gotta get Morris Chang at this point. They like the pressure is quick, very clearly mounting. It's like, okay, go great. You pulled the Cosmo bring up out of a hat.

Bryan Cantrill:

Like, what's next? You gotta get Morris Chang. So I think we've got we've got no choice.

Aaron Hartwig:

Looking forward to it.

Bryan Cantrill:

Exactly. Alright. Thanks everyone.

Bringing up Cosmo
Broadcast by