Oxide and Friends | Transcript: More Tales from the Bringup Lab

More Tales from the Bringup Lab

April 18, 2022 / 02:07:10/S2 E12

Members of the Oxide hardware team talk about their recent bringup struggles and triumphs with the server sled (Gimlet) and rack switch (Sidecar)

Speaker 1: 00:01

Hello, Brian. Hey, Adam. How you doing? Doing well. There's Arty in.

Speaker 1: 00:08

And then, Nathaniel, are you there?

Speaker 2: 00:11

I am.

Speaker 1: 00:12

And do you know if is Eric gonna be able to join us? He said he was going to, but then I wasn't I think he will. Yeah. Okay. I will

Speaker 3: 00:21

He just responded in chat

Speaker 1: 00:23

as though he's going to. Oh, that's good. Just now or, Yeah. Yeah. Like,

Speaker 3: 00:28

30 seconds ago.

Speaker 1: 00:31

Excellent. Alrighty. Yeah. I know. We've got and I don't know if Matt was gonna be able to join us.

Speaker 1: 00:40

Matt Keter. I believe he was installing the app on it. Uh-oh. I think I think Robert might be doing the same. It's like okay.

Speaker 1: 00:51

Look. I I know it's very on brand for us to open all these spaces complaining about Twitter spaces. But can we Twitter spaces. Can we please make the desktop app allow you to speak in addition to listen? I and I get that.

Speaker 1: 01:04

I know that's a totally separate code path, and this is, like, much more complicated than we could possibly imagine, but it is really annoying that people cannot join for the desktop.

Speaker 4: 01:13

Yeah. Until they come up with a fix for that, Robert is coming in by way of my proxy, so you'll get 2 for the price of 1 on this one.

Speaker 3: 01:22

Right. Wait. Are we getting 2 voices, or is this gonna be, like,

Speaker 1: 01:26

Brian, Robert?

Speaker 4: 01:28

Right. Exactly. It depends.

Speaker 1: 01:33

I I don't like, this is really weird to be seeing Steve's avatar have Robert's voice. I'm not sure if I'm ready for this. I I I do I can do You just gotta you'll have to deal with it. Sorry. Go ahead.

Speaker 1: 01:44

Go ahead. Oh

Speaker 5: 01:46

my god.

Speaker 1: 01:46

This is I think

Speaker 3: 01:47

Now I don't now I don't know who's talking. It could be anyone telling Brian just

Speaker 1: 01:52

It it's on brand for both of them, so it's very hard to know. Alright. Well and I'll be there is there is Matt. And then the, I think Eric is the just waiting on Eric, but we can alright. Well, we can get going here.

Speaker 1: 02:17

Ariane, do you want to, to there it is, Eric. Do you wanna pick us up with, so, actually, let me set the stage a little bit here because we were trying to pull off something so ridiculous that it's actually very hard to have a single Twitter space about it. We were doing the bring up of the sidecar switch, which we're gonna be talking a lot about. And Ariane did a terrific Twitter space on on the the design of that. While we were doing that, we were also completing the bring up of Gimlet, which was our compute sled, which we had, we did a Twitter space on successfully bring being able to bring the SP 3 out of reset.

Speaker 1: 03:05

As it turns out, we, we definitely felt the wind was at the back, which was true, but we had some, very serious adventures ahead of us when we last talked about Gimlet. So we're finishing up the bring up of of Gimlet and bring up all the components there. And then we also have we got we got a cameo from another board that we also did that we brought up in the middle of all this, the PowerShell controller, which it it fortunately ended up being that was that was much smaller. And then we're also trying to finish up the trying to to make the necessary revisions for the rev 2 of of our Gimlet schematics. There's a whole lot of stuff going on more or less at the same time with not many more people than are here.

Speaker 1: 03:48

Basically, the people that are speakers were the ones who are doing this. So it was an incredibly small team. Really exciting time. Arianna, I thought you might we might kick us off with you and I before the starting bring up, which we're gonna do in Emeryville, not Minnesota. I just remember us walking, kind of musing about the differences between sidecar bring up and Genwhip bring up, and the things that we're gonna be, that we felt were be more in hand, and then the things that we thought were gonna be way more challenging.

Speaker 1: 04:22

Do you wanna maybe set the stage with that?

Speaker 3: 04:27

And are while you do that, you might define will will recall, but, you know,

Speaker 5: 04:39

we wanna appeal to everyone. Sure. So let's start with Gimlet, which is our compute node. It's a, an AMD, EPYC based server, you know, with a NIC and NVMe, drives in the front. So that was the that was the board the team was already, like, in bring up with, but we were struggling to get the, the AMD CPU out of reset to get it out of reset and then, later some some follow-up tales with the Nick.

Speaker 5: 05:10

And then we have Sidecar, which is a a a board built around the Tofino 2, switching ASIC from Intel, which is a a large, like, Ethernet switch, capable of up to 12 terabits per second of throughput. And that thing does not have a host CPU on the on the board. We're using we're we're connecting that using an external PCI Express cable to one of our compute nodes in the rack. And so one of the things that we were musing about was that, well, there's a lot of circuitry on this on this sidecar board that we have sort of explored already once in in Gimlet because they share, quite a bit of, functionality, especially when it comes to, board controls. So there's a there's a what we call the service processor, which is a CPU that like an ARM CPU that manages, things like like power supplies, turning on and off, fans turning on and off, like, all all the all the sort of environmental pieces that you need in order to even start working with an ASIC or a a large SoC.

Speaker 5: 06:17

And so a bunch of that stuff was replicated to to Sidecar. So we felt like, okay. This is that that should be, you know, pretty doable. We'll have that in hand because this is the 2nd time we've seen it, and we've made some revisions that should help us get past the first hurdles, that we encountered while we did the the initial ring up of gimlets. That turned out to be mostly true.

Speaker 5: 06:39

We did not have significant issues that would get us stumped to basically power up all pieces and to sort of get some get get the things out of reset. But then the Tufino 2 ASIC is a rather beefy device, requiring lots and lots and lots of power and a very tight, envelope in which it needs to operate. And so, that was that was gonna be our main challenge. We felt, to get the the the PDN well, to get the PDN to power up would have been fine. We we had some we had confidence that that would work, but then to really keep it within the the narrow, operating range that needs to operate in under low conditions for this ASIC, that was gonna be a challenge.

Speaker 1: 07:24

And, Aron, you had a great Twitter thread earlier today that kind of expanded on this challenge. And Eric is here who designed the PDN. Do you Pete, could you elaborate a little bit on why this is so challenging? Like, what a load step means, and why this load step was really gnarly?

Speaker 5: 07:43

Yes. So, basically, when an ASIC turns on or when an ASIC suddenly starts doing more work, more transistors start switching, or they suddenly, you know, turn on. And during that process, a, an AC currently draws immediately draws a lot more current if the voltage stays the same. This is not this is different to, for example, an AMD CPU or an Intel CPU where they change the core volt. So the step up the voltage in order to not need as much current to deliver the same amount of power through the device.

Speaker 5: 08:19

This is this goes hand in hand with clocks that go up in frequency. That does not happen in a lot of these larger ASICs. So may that might be happening in a GPU, but definitely in this networking ASIC, that is not true. Everything stays rather sort of fixed. But what that means is that once the device once you apply power to this device and you then release the reset, it suddenly jumps by 300 amps in power.

Speaker 5: 08:45

And that happens in the span of about a microsecond. And during that and you and then because of how a PDN is designed on a circuit board, such PDN is never can never instantaneously respond to a a load step like that. It needs to it needs to sense that more current is required before the controller will apply like, basically allow more current through the the power MOSFETs that it's driving. And so there's a slight lag behind, like, as the as the PDN tries to catch up when the ASIC suddenly, you know, stomped on the accelerator. And as a result, the voltage will drop a little bit because of hand wavy hand wavy inductance and and as as current flows through the board.

Speaker 5: 09:37

When that voltage drops, you need to stay above the absolute limit of the ASIC. Because if you drop below if that voltage drops below the the the limit, then you're gonna get, you know, best case, you're gonna get errors, like bit errors because transistors will not be switching in time or they, worst case, you might have sort things that might hang, like they might not switch completely. And so you might induce more current through these transistors, potentially causing higher currents that might damage the parts. So there's a variety of different reasons why you do not want to, have that voltage drop below the And now there's this basically excess current that is underway because the PDN hasn't realized yet that it needs to step down the amount of current that is being, being released into the into the into the board. And now you get an overshoot because the the there's a similar effect, on the on the way up.

Speaker 5: 10:46

And that overshoot can potentially cause an over voltage. If that spike is sufficiently high, that pushes you out of the upper limit, and that might now damage the part. So you have to operate within this narrow window. Never undershoot and never overshoot too far because otherwise, your your part is either not gonna function properly or it might damage itself.

Speaker 1: 11:08

And When we first looked into this part, Ariane, I remember you just being wide eyed about you know, like, wow. This is gonna be really hard to hit. And, we were, I mean, very excited, Eric, when you came aboard, as a, someone with a great deal of domain expertise. And, Eric, I'd be curious, like because it's what what are the actual numbers in terms of the requirement for troop and overshoot for this part?

Speaker 5: 11:33

Yeah. I I I wanna I wanna do say that I don't know how confidential the numbers are, so I've been a little bit vague on this.

Speaker 1: 11:40

Approximate. Yeah.

Speaker 2: 11:41

Be approximate. On on these parts, you know, that you're running on these process nodes around 900 millivolts, you know, 800 to 900 millivolts. And generally, you have, you know, a few percent of undershooting you can have. So they expect these things. You know, TDP is a is a, you know, the total dissipated power or whatever that stands for.

Speaker 2: 12:05

The the TDP is somewhere in the, you know, however many 100 watts, and that's specified to a specific voltage. And so you don't wanna go above that. Otherwise, you'll, you know, dissipate more power than you intend to, and you'll blow yourself up. And then there's an absolute max where you'll cause physical damage to the transistors. And then on the minimum side, you're talking a couple of percent.

Speaker 2: 12:27

So, like, 3 to 4 percent of undershoot from the power rating. So let's say your power rating is, you know, a volt. You can drop 30 millivolts and be okay. But if you drop more than 30 millivolts, you're hosed.

Speaker 1: 12:47

That just feels very tight. And this is this is all at, like, 308?

Speaker 5: 12:51

To make it to make it tighter, yes. We're talking about a step that is a couple 100 amps, but we're we've designed for 300 amps. And we have I think on the on the bottom end, we

Speaker 2: 13:01

have second.

Speaker 5: 13:02

How how many how many millivolts do we have

Speaker 1: 13:06

on the bottom end? Something like 5 or 6 millivolts above her?

Speaker 2: 13:07

Yeah. We have we have, like, 7 millivolts of margin, I think. Yeah. Above the spec, which apparently is, like, you know, world class.

Speaker 1: 13:21

What was the end? So well, part of what made this very exciting is that we were also doing our we're not using the same power controller that everyone else is using. We're using actually the same part that we're using in Gimlet. The the r a 2 29, 618. And you know it's a good slash scary sign when your vendor is like, we're really excited to see if this works.

Speaker 1: 13:46

And you're like, like, you know, this works. It would be great. You're like, you're saying

Speaker 3: 13:52

if a lot.

Speaker 1: 13:52

Right? You're saying if a lot. Right? It's like, no. This is important human experimentation.

Speaker 5: 13:56

That's that's literally what they said. Like Yeah. They they told us that we were the first to try this design. And while it all works in theory, we were gonna put the theory to the test.

Speaker 2: 14:06

They said everybody else just copies the reference design, which used, I think, Infineon.

Speaker 6: 14:11

Yes.

Speaker 1: 14:11

Yeah.

Speaker 2: 14:12

Yeah. So, like, these are Infineon controllers and, you know, basically, the way these the way these parts work is there's there's all kinds of highly proprietary ways of tweaking their control loops to make them happy. This isn't a traditional just PID controller, and you go back to your merry way. This is a, you know, nonlinear controller. There's all sorts of crazy things they have in there.

Speaker 2: 14:33

And apparently, with the Infineon parts, and maybe this is just, you know, their experience or whatever, but it's kind of a they send the app engine here out and they tune the thing for you. And, you know, they spend, like, a week in the lab and they twiddle their little bits here and there and things get better and magically your stuff's done, but you really have no idea what the hell just happened. Like, what's going into the chip, how it's being programmed, what those are doing, whatever. Renaissance, plus their arts, they have a fantastic guide that shows you here are all the different control loop parameters you have access to. Here is what each of them does on an example design.

Speaker 2: 15:12

If you go up, this is what it looks like. If you go down, this is what it looks like. Here's kind of the order you wanna tune things in. And so with that very, very clear guide, I was able to get that thing tuned in within a couple days into, you know, 7 millivolts of margin, which was fantastic. It And so it it was really nice.

Speaker 1: 15:32

It was amazing. And so and it's also this might be a good time too, Eric, to talk about the load slammer. We had mentioned the SDLE when we were talking about the the bringing up the s p 3 on the Gimlet. But the load slammer was really important here. Maybe you wanna explain what that is and and how we used it.

Speaker 2: 15:48

Yeah. So with a chip like this, you don't just, you know, plug the thing in and go YOLO and hit power on. Because even if you were able to do that, you still have to have software to, like, turn the thing on and have it do intelligent things and provide clocks and all that other crap. So there's a whole lot of, you know, things that build up to actually getting it to pull real power other than train you know, static transistor leakage. So ahead of time, what we did is we got a load slammer adapter.

Speaker 2: 16:16

The load slammer is a company, and they make electronic loads for industry. And they have this little module that you can plug in that'll do, like, 500 amps and, you know, some absurd time, like a 170 nanoseconds or something if the voltage is right and caveat caveat asterisk. But, basically, you have this fake load that you put on just like the SDLE and an AMD design or the Intel version of that. You plug this fake load in and you get it to, you know, basically simulate a load transient like the worst case the chip would ever do. And you want to go above and beyond that because even if you could just plug the chip in and, you know, turn it on and say, hey.

Speaker 2: 17:00

Here you go. Let's go. You can't go above what the chip will pull and you also can't test other VID set points because based on process variations, you might wanna run each of these chips at slightly different voltages. And so the load slammer or something similar to it allows you to test the entire power delivery network, before you ever power on actual silicon. And this is done heavily in, ASIC designs, especially, like, 1st article silicon designs.

Speaker 2: 17:28

You wanna make sure that your PDM is good and happy before you plug in your, you know, your a red a zero silicon into it because, you know, you don't wanna blow up your babies. And you wanna make sure that the PDN can't be blamed for anything. As the power engineer, you wanna make sure that you are blameless and nobody ever calls you up and goes, hey. I see something funny happening. And so this these kinds

Speaker 1: 17:51

of devices allow you to test that.

Speaker 2: 17:53

But to do that, you also have to have an adapter because the part is a 4000 pin BGA, and you have to create an adapter. So there's an adapter that's designed for this. And it has some connectors on it. But the the thing that was kind of challenging is, well, I frankly, I didn't wanna spend another, you know, however many, you know, $10,000 or whatever on 2 of these load slammers in parallel because these are these are meant to be transient. They're not meant to be steady state load.

Speaker 2: 18:22

So they they have very low average power capability. And so you want some average power and I think couple of pictures in our end's Twitter thread. They have they have this, adapter that has a bunch of extra adapters on it. And so I made some adapters to go from that using literally sheets of copper and some phenolic. I got some from, like, Mastercard and some 3 m BHB and cut all those up and voltage, 4 gauge battery cables to them, to our electronic loads to provide the the DC load of, you know, 300 and some amps, and then have the load slammer provide the 300 amp step load on top of that.

Speaker 1: 19:06

So, Eric, I'm gonna interject. When you do these 300 amp step loads, you're actually seeing the cables bounce. Right? Oh, I didn't realize.

Speaker 3: 19:14

That's amazing. So and and and these are these are car battery cables that we're

Speaker 1: 19:21

seeing in that picture? Yeah.

Speaker 2: 19:21

So they're they're actually from a car audio company. They they're new concepts, knuk0ntepz or something that they have the most hilarious spelling ever. But it's these are car audio enthusiasts, and they have they just happen to have the, like, most reasonably priced, ultra high flex, high strand count, 4 gauge battery cables, that I've ever been able to find. So they and they're really they're good quality. But, yeah, even without the load slammer doing its crazy fast transients, and it's just the, you know, the static chroma load we have.

Speaker 2: 19:53

I can make the cables jump. There's a way.

Speaker 1: 19:56

And, yeah, they I mean, they really hop they they, like, bump on the bench. Oh, okay. So what makes them hop? What what is that?

Speaker 2: 20:05

The transient magnetic field.

Speaker 6: 20:07

Yeah. It's a

Speaker 1: 20:07

magnetic field effect. Each other.

Speaker 2: 20:09

Wow. Either it attracts or opposes each other depending on, you know, what you're what you're doing. Wow.

Speaker 1: 20:16

And the and so the the in the load slammer, it should be said is this is to be clear, it is has replaced the Tofino on these select boards. Yes. So we have got and this is not socketed. So these boards are gonna be we've got the 2 load slammer boards. They will always be load slammer boards.

Speaker 1: 20:32

And if we blow those up, we don't have a load slammer. And, similarly, we don't have a way of putting a load slammer on a on a Tufino board. So these are are totally, either same board design but with a different part in them.

Speaker 5: 20:44

Yeah. The purpose built for this.

Speaker 1: 20:46

And so we we get the boards. Eric, you start doing your tuning, which seems like it was just going, like, pretty well straight out of the chute. Things were going pretty smoothly, it seemed.

Speaker 2: 20:59

Yeah. So renaissance, like a lot of companies, they have simulation models. And the the tool we use is called Simplus. And it's a it's a it's basically a way of approximating a real SPICE simulation that's still pretty accurate and is super fast for things like switch power supplies. And so I had initial tuning parameters from that, simulation, from running on my computer.

Speaker 2: 21:25

So that is what's allowed me to get basically a head start to say, okay. This is kinda where I wanna start, and it'll at least be somewhat stable. And then we'll go from there. And then it was just going through that tuning guide and sitting there and blasting at a bunch. We did run into one quirk where we kept seeing things that weren't real, which was interesting.

Speaker 1: 21:49

And you were able to do this completely paralyzed. You've got basically one of these load slammer boards. You've gone off to do this. And then, Arion, you had taken I think with you didn't have an actual Tufino on that first board we were using. Right?

Speaker 1: 22:01

That was also still a load center board.

Speaker 2: 22:02

No. It was the other load center board. So he and Matt were working on the other one, getting the management network stuff run up.

Speaker 1: 22:09

Getting the yeah. So the alright. So we're talking about that and and then we, in terms of all the things we needed to do, all the other things on this. We obviously got the the the the main event, but we now got Eric is off tuning and and getting ready for what will be the main event. What were some of the other things we had to go do to bring this thing up?

Speaker 5: 22:28

Yeah. So Aaron and I worked on so there's an FPGA that controls most of the, act that is actively controlling most of the pieces on this board that actively controls the enables and resets on most of the parts. The intent there is that we can more or less re reboot all pieces of software without the ASIC losing, the ability to continue switching packets. We can't update the ASIC, so no switching table updates can happen for a short for a very brief period of time. But at least the ASIC can continue switching packets, which is kind of important.

Speaker 5: 23:03

So, yeah, Aaron and I were working on on RTL for the for the FPGA to to be able to properly sequence the power rails that are, enabling Tufino as well as some of the power rails for there's a there's a secondary switch ASIC on this board that provides our management network. And there's a clock generator and a bunch of auxiliary pieces that need to work. So all all those had to be turned on and the was freed up, the math then was able to take over and use that for some work on the the secondary switch ASIC that he has been working on for the past months.

Speaker 1: 23:49

Well and before Matt arrived in in the, as we are bring I mean, we gotta get to how the a Well,

Speaker 5: 23:55

the math arrived the 2nd week, so we were already a week in at that point. That's that's

Speaker 1: 23:59

right. We were already a week in, and I I feel we have to talk about south 0 versus south 1 versus south 2 and the the how TOML actually resulted in us depopping a bunch of components.

Speaker 5: 24:11

Well, Yeah. So we are trying to configure the so there's a there's a there's a pretty fancy clock generator on this board that, given an appropriate system crystal and then a high very high precision oscillator will generate the required frequencies for all the components within, you know, jitter specifications. There's potentially all these, these these qualifications that you might wanna hit for use, for example, in 5 g networks or in and if you're part of a 5 g radio network or or other applications, where you have to meet these particular jitter properties. And so we have a we have this this very nice part that will more or less let us program time however we need. We can arbitrarily put things in the same time domain or in in a separate time domain, meaning that clocks are guaranteed to be fixed in, like, in phase with each other or not.

Speaker 5: 25:06

They can drift separate from each other. It's a it's a very nice programmable part, but it needs to be programmed over I squared c. And it was pretty much the first thing we needed after power was online. And so we we are here desperately trying to connect to this part, and for some reason, we can't find it on the bus. And we keep twiddling different things, like did we get the address wrong?

Speaker 5: 25:28

We're scanning the bus, we're looking at the schematic, trying to piece some things together. And ultimately, we realized that, I think we got the silk screen wrong, and so we're we're plugged into the wrong bus with the wrong header, and the and the part was out on a different bus. So it was never gonna answer to us. And, and and so we're but we've gone at that point, we've gone through and desotted most of the components on the bus because we were wondering if maybe the bus was because the I think the bus was stuck in in in reset or the bus was being pulled on

Speaker 1: 25:59

Well, that one that one had the, the LPC on it or the, yeah, the, the lattice part on there, and it was, like, weak pull downs, which was surprising.

Speaker 5: 26:10

Oh, yeah. Yeah. The FPGA was attached to that, and the FPGA in default in its default state, in its unconfigured state will pull down on the on the, on the lines. And so the I squared c bus was being held in, basically being stuck. And but that was the that was the wrong I squared c bus.

Speaker 5: 26:27

It was because we were looking at the wrong thing. And so we started depopping most parts. We had some resistors in line with most parts that we could so we could decouple devices from the bus except for the FPGA and and one other device. And so here we are doing all this surgery, and and then at some point, we step back and we realize, like, oh, we've been operating on the wrong bus altogether. And so, hey, like, lo and behold, we're connecting the right bus and the the clock generator wakes up and and is it can be configured.

Speaker 5: 26:54

So we spent a couple hours puzzling on that one.

Speaker 1: 26:57

But Yeah. It's I guess I didn't realize that the artwork was also wrong because I was fixated on the fact that the that Hubris was talking to the wrong bus.

Speaker 5: 27:06

Oh, it's possible. That was the ultimate I think the art the artwork may have been wrong, but the ultimate yeah. The ultimate problem was we have this configuration that will that allow us to basically specify a device tree for our in our embedded operating system. And the way that that file is ingested, the order in which these devices exist, changes? What do you think is that?

Speaker 1: 27:28

Oh, god. Yeah. Matt, you wanna talk about this one? Oh, god.

Speaker 6: 27:33

Yeah. This was a little bit before my time, but I I remember seeing the aftermath. So, basically, the device is configured where you say, like, here are all of the different devices on the I squared z bus. Here are their addresses. Here are what they are and so on.

Speaker 6: 27:45

And then elsewhere in the code you say, okay. I wanna talk to device number 1 in that list using an index. And the problem was that Right.

Speaker 5: 27:53

But we don't we don't we don't refer to by name.

Speaker 6: 27:56

Yeah.

Speaker 2: 27:56

Yeah.

Speaker 6: 27:57

And so the list of devices was being sorted. And so device number 1 in that list ended up being some random other device because at some point in the process, it was stored in a sort of data structure, and then spat back out in a different order. I think that's that's the quick version. Does that sound right?

Speaker 1: 28:12

Yeah. I mean, the yeah, that's right. I mean, the the tables and Adam, do you how much do you deal with TOML? Do you do you deal with TOML in, I mean, it's I mean,

Speaker 5: 28:25

I don't know.

Speaker 3: 28:26

I mean, I use JSON which which is not a better way to live. It's just a different way to live.

Speaker 1: 28:31

It's a different way to live. All of these things are I've got, like, strengths and weaknesses. Definitely, TOML of TOML among them. One of the things that is really frustrating about TOML is that the table order is not stable. So you have something that looks ordered in a file, but then it will the action it will go through a b tree and it will actually be in alphabetical order.

Speaker 5: 28:58

But but this is is this for does this have to do with the the, like, quote unquote specification of this of this file format or is it simply the implementation of the parser that we're using?

Speaker 1: 29:08

It is. It's the specification, sadly. And there's an issue of a link to it. Oh, and this has caused us no end of grief. It caused Tomlin in the grief.

Speaker 1: 29:16

Matt, there was a very funny thing that that you had quoted, which, I don't know where you found it, but a table comparing TOML to other, other file formats. And one of them was, comparing it to an any file. And, it was like one of these, like, you know, yes, no kind of, strength, weakness kind of table. And as it's comparing TOML to, like, something like in any file, which is kind of what TOML's model themselves off. 1 of the the columns in the table was strongly typed.

Speaker 1: 29:48

And if you were, yes, strongly typed, that was considered to be a negative. That's red. And if and if if you are not strongly typed, that's positive. And that had a very funny line of, like, didn't any file actually write this table? I mean, it but, yeah, that was, that was a frustrating issue.

Speaker 1: 30:07

I feel it's that has that issue has bit us many, many times with Tamil. I feel like that didn't bite us like I felt like it bit us on the spy stuff as well Not that long afterwards.

Speaker 6: 30:19

I think it did. That sounds right. I just know that the fix was just to switch to a custom, deparser that was called ordered TOML, which just maintained ordered lists for all of these lists of dictionaries and so on.

Speaker 1: 30:32

Yes. That was when Cliff hit the last straw. Cliff is like, forget it. We're never doing this again, and I'm actually changing this over to my own thing called order TOML just like TOML except tables are ordered. The end.

Speaker 1: 30:43

But, anyway so, yeah, we we deep and, Nathaniel, what you guys end up deep hopping on that? Did you you to get you're able to get it all back on, obviously.

Speaker 7: 30:50

Sorry. Nothing was nothing was too bad. I mean, we took some series resistors out, and we might have taken a small, you know, like, 4 or 8 bin chip off. I think we took one of the temp temperature sensors off that we maybe didn't really need, as at least on a bring up board. And, you know but, yeah, it wasn't it wasn't, like, massively destructive.

Speaker 1: 31:11

But we really love a lot

Speaker 8: 31:12

of things to take off.

Speaker 7: 31:15

I mean, I yeah. I mean, yeah. It wasn't but, yeah, we did we did get to a spot where it was like, well, it's the BGA and the processor, and there's not much we can do about, you know, either of those two things.

Speaker 5: 31:25

So so anyone listening, including my future self, anytime you do a new board and you don't know what the I squared c bus is actually going to end up with, make sure that you put series resistors between the bus and every device so that you can easily depop the device from the bus in case something is misbehaving.

Speaker 3: 31:41

Yes. I I did wanna hop back and just tell a

Speaker 7: 31:44

funny story about the 1st week of bring up, which was, like, all power all the time, including our hotel room because our hotel room lost power. Right. And so we, we you know, a number of us were staying at a local hotel, and the hotel had, like, a main bus fault. And so, you know, on day 2 or something, we had to, like, leave work and go grab our stuff and move to a different hotel. Yeah.

Speaker 7: 32:06

And to be clear, when you say have power again the rest of

Speaker 1: 32:10

the week. Right. When you say lost power, this is not like the power, like, was out for, like, an hour or 2 hours. This is like the hotel no longer has the told them to

Speaker 7: 32:27

kick the people out. So

Speaker 2: 32:29

Yeah. I I tweeted that one. It's just I tweet very little, but I tweeted that one. It's a restricted use sign that says, yeah. This is not safe to be abiting, you know, to be to be in.

Speaker 2: 32:40

So, yeah, that's not good. The thing that blew up is drastically bad to have pull up. It's called a bus duct. And it's, it's pretty pretty horrible to have one of those go. You're having a real bad day.

Speaker 7: 32:53

So even when even when we weren't at work, we were still dealing with power issues.

Speaker 1: 32:57

Oh, right. And I'm not sure if that was a good omen or a bad omen, but, that would right. I totally forgot about that that data. So you you all had to move to a different hotel in in all of this, which was very disruptive. So then the we are doing all power all the time as you say.

Speaker 1: 33:16

That first week went pretty smooth, and now we're I I I don't think there was anything that was really I mean, I think I think, again, we were kind of where we wanted to be going into week 2. And now, a couple of things were happening. 1, we were getting ready to actually to actually bring this up on the actual Tafino. So, Eric, do you wanna describe kinda what you had done with the tuning and and, I mean, your confidence is growing that we could actually not blow this thing up?

Speaker 2: 33:45

Yeah. So we we, I kept I kept abusing it in different ways, trying to see if see if it would blow up, and it it seemed to be behaving. And I there were some there were some noise that was happening in the, in the load slammer because of some some quirk that I saw and figured out. But, basically, we were seeing, fake overshoot or undershoot, that was not actually there. And so that that took a couple of days to, to convince myself I knew what was going on.

Speaker 2: 34:17

And the the nasty thing about this is, like, you can't just measure this with a normal scope probe because ground changes when you pump 500 amps through it. Because even your ground planes you know, there's a 20 layer board, and we have probably 12 layers of ground 12 or 14 layers of ground. I mean, it's just it's stupid how many ground planes we have. But even that, like, will change by, you know, 10 millivolts at 500 amps. And 10 millivolts doesn't sound like much, but when your margin you know, and your tire tolerance is, like, 30 to 40 millivolts, 10 millivolts is a big margin.

Speaker 1: 34:53

It's just so, Eric, why is ground changing? What's causing because

Speaker 2: 34:57

you have you have current flowing through it.

Speaker 5: 35:00

So Yeah. These are the same electrical magnetic properties. These fields change, and therefore

Speaker 2: 35:05

It's not in the field. It's just straight resistance. I mean, ground is a wire. Right?

Speaker 5: 35:08

Oh, that's true. Yeah. This is resistance.

Speaker 2: 35:10

Like, let's say, a milliohm of of resistance, right, which is pretty low. You pass 500 amps to it, that's 500 millivolts of drop. You'll get across it. So your your ground planes have to be in the micro ohm range, and the PDN actually is in the range of, like, 200 micro ohms. But

Speaker 1: 35:29

200 micro ohms. Sorry. I'm just okay.

Speaker 2: 35:33

200 micro ohms up to, like, you know, some megahertz. Wow.

Speaker 5: 35:37

To be able to make those

Speaker 2: 35:38

kinds of tolerances.

Speaker 5: 35:39

Eric Eric, do you what is it what is it how much power are we resistive power are we burning in just the copper layers to the supply 500 amps? It's something like 40 watts.

Speaker 2: 35:49

Yeah. Somewhere around 40 or 50 watts.

Speaker 5: 35:52

So that's 40 or 50 watts dissipated through the PCB just because of the resistance of the copper as you deliver power from the the the power FETs through the AC. Wow.

Speaker 1: 36:04

And the and the resistant properties of that mature that's the we're we're in the domain of material science with that. Yeah.

Speaker 2: 36:10

Because that was the resistance of copper.

Speaker 1: 36:12

Of copper. Okay.

Speaker 2: 36:13

You're you're running into fundamental limits of okay. And that's that's why these chips all implement remote sense where you they remotely sense both the power and the ground. So they measure the voltage differential at the load with all these high power high power controllers and these high power chips. And so because you can't use a normal, you know, scope probe for this, you have to use a differential probe. But many, many differential probes are limited in their ability to measure with low noise.

Speaker 2: 36:44

And again, when you have, you know, tens of volts of margin, noise is important. And so is DC offset and all these other things. And there's, like, one differential scope probe that apparently can do this, and I didn't have it at the time. You know, because we're a start up, and I don't have, you know, a1000000 dollars of testing around that we've built up over the last 2 decades. So it's, so it's you know, it was challenging, but it's like, okay.

Speaker 2: 37:10

I measure the the voltage drop, and I measure the how much the voltage changes on the scope probes when I do different, you know, DC load levels to see how much ground balance I have. And, you know, I I figured out that, okay. Yes. I'm seeing this and it's fake and, you know, I don't have to worry about it. And so then I can trust the results ignoring these couple little quirks, which are also not physically possible.

Speaker 2: 37:32

So it's not like I was just trying to ignore something that was, you know, not pleasant. The the quirks we were seeing were not physically possible. So

Speaker 1: 37:40

And and would they not physically how would they not physically possible? It's like the

Speaker 2: 37:43

So usually, when you have a load step, like, meaning load is added, your voltage drops. But in this case, the voltage was rising on a load step, and it was dropping on a load release. And that's not what happens. Right. And and the no.

Speaker 2: 38:01

The way we prove this to ourselves is we we also dead shorted the inputs in this thing, to the device that was actually measuring this.

Speaker 1: 38:09

Okay.

Speaker 2: 38:09

And even when it's measuring ground from both inputs, it saw the same spikes. And so there must have been some sort of either external coupling mechanism because of where we had to run cables on the on the load, you know, near the device, and you can see it's fairly crowded. So it's either something like that where there is some sort of magnetic field coupling in there or something, or there's just, you know, some other quirk that we just didn't understand. But basically, it was measuring noise and we proved to ourselves that, yes, it was noise, and we can ignore it safely. And then once that once that was settled, which took a while, for me to convince myself of that, it, you know, was basically, okay.

Speaker 2: 38:51

You have the main rail. You have a few aux rails, which are, you know, only, you know, 80 amps or a 100 amps

Speaker 1: 38:58

or something. They're they're

Speaker 2: 39:00

they're little pilly

Speaker 1: 39:01

ones. These these little small rail. Right? Exactly.

Speaker 2: 39:03

These little small rails that, you know, they take, like, 39090 amp multiphase controllers. And so once once we kind of once I kinda work through all of this, I'm like, yeah. Okay. I'm, you know, I'm reasonably happy with all these results, and there's a nice spreadsheet that, that until it provided about the, like, the the rail tolerances and things. And I filled that out, and I checked every VID, so every and not more weird or anything.

Speaker 2: 39:33

And I was like, okay. Well, Arianne, now what? What do

Speaker 8: 39:40

you want me to do?

Speaker 1: 39:42

And and, Eric, I think it's about, like, what did I remember one of the metaphors you had for this is, like, you can see this is like jumping out of a plane and landing on a on a on a tight rope, that is made of, like, dental floss. I mean, that is like how how and I just like as if I'm like, okay. I like, wow. This is this is gonna be, definitely holding our breath for that. And the and the the way these things, you know, you don't necessarily pick the timing of some of the stuff.

Speaker 1: 40:14

And, Steve, do you wanna talk about the fact that we had a potential investor who had a a previously scheduled time to come visit us while we were gonna be powering this thing on for the first time. Mitch,

Speaker 4: 40:33

Yeah. I mean, we were we were at our office. This was, as we were starting to get back to where, one might have folks in a well ventilated office. And, so had a prospective investor up here, having a meeting with them, and, it was when we had the hardware engineering team in town to do bring up on the switch to which, you know, we had not yet gotten to a happy place. And in the midst of the meeting, all of a sudden, all the white lab coats abruptly kind of got up and huddled around the board and with, like, furrowed brows and were, you know, kind of turning and talking to each other and muffled voices and which made it impossible for me to do anything other than crane my neck the entire time, which made the rest of the meeting very awkward.

Speaker 4: 41:25

After the fact, of course, it was, revealed that that was a planned event from the engineering team to to to to prank us as we're sitting over there talking to said investor. But

Speaker 5: 41:40

they because come on. This is the birth of it. Like, this is the birth of a little machine over here. It's kinda rude to invite the outsiders. I mean,

Speaker 4: 41:49

events. Yeah. Very well deserved. I mean, I have all the respect in the world for the prank. It didn't make the the, the moment any, less uncomfortable for all involved.

Speaker 4: 41:59

But yeah.

Speaker 1: 42:00

Well, and in particular, for Steve. I have to say that, like, I was just like, look. Like, whatever's happening over there is happening over there. We're gonna have this conversation, and then there'll be plenty of time to weep tonight. I don't need to weep right now.

Speaker 1: 42:11

Steve is, like, trying to read body language. I'm like, no. Don't. What are you doing? Just like anyway, you definitely, it was a very effective prank.

Speaker 1: 42:22

And, I mean, the good news is that the, I mean, Eric, you I'm sure you were holding your breath, but, I mean, stuck the landing when we when we powered it up.

Speaker 2: 42:33

Yeah. No smoke. It I'd I'd I don't know how many how many more hairs I lost off my head that day, but

Speaker 1: 42:42

that's why that's why I shaved my head. So I don't

Speaker 5: 42:44

For the listeners, Eric

Speaker 7: 42:45

is bald?

Speaker 1: 42:47

Yeah. Well, not totally bald.

Speaker 2: 42:49

I shaved my head, but I I accept the fact that I'm going to fall. I own it. So but, yeah, it was it was one of those, like, man. I don't say no regret hiring me.

Speaker 1: 43:03

I hope this works. There.

Speaker 5: 43:05

The power up was really uneventful because we we we put the heat sink on,

Speaker 1: 43:09

and Yeah. The heat sink was

Speaker 5: 43:11

It was, like, alright. We're we're plugging in this board. We have we have tested the power up sequence. We know that the rails are powering up in the right order. You know, all the

Speaker 1: 43:19

timing is correct. Hey. Can we just talk about the heat thing

Speaker 7: 43:23

for a second? Yeah. That thing's,

Speaker 1: 43:24

like, £8.

Speaker 2: 43:25

It's pretty awesome. It's an entire the whole bottom of that thing is a vapor chamber. So, like, if if people know, like, a a a normal heat sink, you'll see those copper pipes coming up, and those are those are, heat pipes, and they have, like, water usually in them, and the the water evaporates and it moves to somewhere else, and then it condenses and flows back to the heat source and everything. The entire bottom plate of that heat sink is one giant vapor chamber.

Speaker 7: 43:52

And Which is the only

Speaker 2: 43:53

way in hell you can move that much heat over that big of a space in any effective manner.

Speaker 1: 43:59

And it's gotta be what is it? I mean, I I it's it's 9, 10 inches. Sorry to use imperial here, but I mean, it's a warped. It's

Speaker 2: 44:07

Well, so if you a 100 millimeters wide.

Speaker 5: 44:10

Yeah. If you're looking at the pictures on the board, if you look at where the like, some of these pictures of the adapter that I showed in the thread that we're we were talking about earlier, there's a there's an outline there, like a large rectangle around this whole thing. That's the outline of the heat sink. This thing is almost as wide as the board is wide, And it is about an inch and a half tall. And,

Speaker 1: 44:31

And we had one of our the Adam, did you have any insight into the moment arm crisis? What are the, like

Speaker 3: 44:37

The moment arm crisis?

Speaker 1: 44:39

Yeah. Exactly. So the so one of the early concerns what we got a very, partner doing, MechE work for us, very partner, and they had one of their concerns was the heat sink was gonna be so large and so heavy that you've got the the the moment arm, which is to say the the, mechanically, the the force that would be necessary at the very edge of the heat sink to crack the PCB would be scarily slight. And we you get very nervous about the thing, actually.

Speaker 5: 45:12

Not crack the PCB. Crack the dye of the chip.

Speaker 1: 45:16

The crack the die. Right.

Speaker 5: 45:17

Because because in in in transport, the the the the heat sink might vibrate, you know, as it as it is as it undergoes, like as a thing is transported. And those vibrations will cause the far outsides of the heat sink to potentially rock back and forth ever so slightly, but this is an exposed die device. There's no heat spreader. There's no protection whatsoever. And so if if you apply just enough force on that little on the edge of those little dies, then you might crack them off.

Speaker 5: 45:47

And therefore, you know, who knows what happens then?

Speaker 1: 45:50

And fortunately, we this is, changing the heat sink from copper to aluminum. Right? Ariane, that was, like, the big win in terms of being able to pull it in a little bit. Have you

Speaker 5: 46:01

So the the the the biggest so the thing we at some point settled on initially, we weren't looking for a vapor chamber design because it was, slightly more complicated, slightly more expensive, and, lead times were longer on that because it's just a a more involved manufacturing process. But we we did eventually settle on a on a paper chamber. And then, what we could do if I this is a while ago. But, basically, the fins are not the fins are aluminum. So we're using copper a copper base with with the with the vapor chamber and an aluminum fins.

Speaker 5: 46:36

And the fins only heat travels only so far into those fins. And basically, initially the design was much much taller than it needed to be. Basically, simulation showed that the amount of heat that would reach the the very tip of the fins was rather minimal. So we could easily shave off, I don't know, I would say 10 millimeters or 15 millimeters of the fins at the top like, from the top, basically. And that alone, cost a significant reduction in weight, something like 400 grams, I think.

Speaker 5: 47:04

So that was that that brought us down into the sort of 2 and a half kilogram range, which was where the mechanical engineers were like, okay. This is we're we're we're okay with 2 and a half kilograms. We can we can sufficiently support the heat sink on both sides to make sure that it won't ever rock far enough that it will cause enough force on the on the die to be a problem.

Speaker 1: 47:24

But it is a big heat sink.

Speaker 5: 47:26

It's a very big heat sink. It's a very chunky boy. And and soon, your your graphics card coming to your PC near you will have equally large heat sinks. The next generation of GPUs are are dissipating, like TDPs are in the same ballpark. And there's a real me in the curve.

Speaker 5: 47:55

Etcetera, and and so everything becomes much more difficult. So, it'll be interesting to see what future GPUs high end GPUs will look like because those are gonna be very beefy pieces of of, of metal too.

Speaker 1: 48:08

Yeah. And, of course, we're all air cooled, so that's the other the the other kind of factor here that you can but I expect most GPUs oh, I actually shouldn't say that. I'm sure gamers are all water cooled now. So we get this thing. We you get it, flip the switch, and then, Eric, are you pretty happy?

Speaker 1: 48:26

What are you monitoring when you let current flow through that thing for the first time? What do you what were you monitoring to ascertain whether we had achieved success or not?

Speaker 2: 48:40

So at this point, it was mostly just making sure the voltage is right versus the the VID of the part we were using.

Speaker 1: 48:46

Got it. Yeah.

Speaker 2: 48:46

Okay. I wasn't I wasn't really concerned about the load part of it anymore because, you know, I was watching the input power and the the load monitor on the the renaissance chip. It wasn't like, there wasn't that much load coming from it because stuff was a lot of stuff was still on reset. But most of the concern at that point was, okay. Is the heat sink working?

Speaker 1: 49:09

Is it getting warm? Because if

Speaker 2: 49:11

it's not, then our chip is gonna self immolate and desolder. Like, nickel plated or whatever. So it's it's very shiny, and it doesn't work well with, with thermal cameras, but you put some Kapton on it and it gets it reasonable enough that you can see if it's if it's getting toasty. So we had some Kapton on it and it, otherwise, you take a thermal camera and point it at a really shiny surface like metal, you'll see the reflection of whatever it's pointed at.

Speaker 1: 49:42

Oh, I don't think I didn't realize that. So we okay. Yeah. I I think I missed the fact that you've done that for the thermal And

Speaker 7: 49:48

I I think Aaron put a picture of the heat sink with the Kapton on it in reply for the Twitter space.

Speaker 1: 49:54

Correct? Yeah.

Speaker 3: 49:54

And it's pinned in in this if it shows up for you, although it doesn't seem to

Speaker 1: 49:58

show up

Speaker 5: 49:59

for everybody.

Speaker 2: 50:00

Yeah. It shows up for me. But, yeah, the so we put some caps on on that, and that's that's a close enough approximation that we at least gives you some reasonable, you know, hesitation of the, the the temperature of the heat sink. And once we once I saw that that was, like, getting a little warm in the middle and, you know, warm meaning, like, 2 or 3 degrees above ambient because the heat thinks a beast. And at this point, there's no, like, screaming fans blowing at this thing.

Speaker 2: 50:29

It's like some little puny desk fan that's propped up in the background.

Speaker 5: 50:33

That that puny desk fan is actually

Speaker 1: 50:36

It's fucking ruined. Quite a bit of Yeah. I'm trying to make it.

Speaker 9: 50:39

Yeah. They've been doing pretty

Speaker 2: 50:40

self levitating compared to the self levitating Oh, yeah. Stacked fans that'll run, you know, as fast as a jet engine. Yeah. No. No.

Speaker 1: 50:47

Yeah. But they're, like, 9 inch fans, the diameter of those

Speaker 4: 50:50

fans. Right?

Speaker 2: 50:51

It is. It it's a legit diameter.

Speaker 1: 50:55

But and then is this where you, is this what 2 to 3 degrees above ambient is sweet. You call puppy dog warm. I know you you refer to puppy dog warm. I haven't figured out what puppy dog warm is yet. I I just It wasn't

Speaker 2: 51:06

it wasn't even puppy dog warm at that point, but once we once we left the thing out of reset, then it started getting, like, puppy dog warm, which is, you know, like, a few a few more degrees of ambient. It's it's basically, like, puppy dog warm to me is, like, if you pet a dog that has, you know, not super thick hair, super long hair, you know, like a puppy. If you pet a puppy, you feel kinda general, like, cozy comforting warmth from the dog, that's puppy dog

Speaker 1: 51:32

warm. Alright. So I I feel like

Speaker 3: 51:35

Term of our tier.

Speaker 1: 51:37

Yeah. When I think I never got the all I got from Eric's, like, that's puppy dog warm. That's not puppy dog warm. This is not yet puppy dog warm. So I'm like, I'm trying to figure out what exactly puppy dog warm is.

Speaker 2: 51:45

Do you

Speaker 5: 51:45

need them to put it

Speaker 3: 51:46

in cat terms

Speaker 1: 51:47

for you, Brian?

Speaker 2: 51:47

It's quite literal. It's it's, it's happy it's happy kitty worm.

Speaker 1: 51:52

Thank you. Yeah. If you could just convert it to cat for me, that'd be a lot easier.

Speaker 2: 51:56

Oh, this is this is black cat in the window in the sun for warm.

Speaker 1: 52:00

This is you know? There you go. So we so the we we like what this part is doing, and now maybe it's actually a good time to kind of, like we we we've got that working. Maybe, Matt, you wanna talk about what you were doing, because this is not actually one switch. It's actually 2, and you were working on getting the other switch up during this time.

Speaker 1: 52:23

You wanna talk about some of the things that you've done with that and some of the things you've found?

Speaker 6: 52:28

Yeah. Definitely. So the like Brian said, the, main sidecar, the switchboard actually has a second lesser network switch on there, whereby lesser network switch, I mean, a 54 port, 80 gigabits of switching network switch. So, like, much heftier than anything you would use in a consumer device. And the purpose of this switch is to run what's called the management network, which is the network of all of the service processors, which are equivalent to baseboard management controllers.

Speaker 6: 52:58

So this is very low level. It comes out of reset, kind of before the main switch, and its job is to shuffle packets around. And it needs to be pretty dumb. Like, kind

Speaker 1: 53:06

of tune

Speaker 6: 53:11

where they go. But you can't sorry to interrupt. Yeah. Kind of tune where they go. But you can't

Speaker 3: 53:16

And that's sorry to interrupt.

Speaker 6: 53:18

Yeah.

Speaker 3: 53:19

Matt, sorry to interject, but just to tie it together for other folks who've listened, these service processors are where we're running Hubris just to to complete the picture, a little bit. Sorry, Matt. Go ahead.

Speaker 6: 53:30

Yeah. Definitely. And so it turns out that you can't buy a dumb 54 port 80 gigabit switch. Like, no one sells that because that's a really weird thing to want. And so instead, you get to buy a very fancy like, this is meant for your light home network or light server rack switches, with, you know, 800 page data sheets and 800,000 line SDKs to work with.

Speaker 6: 53:55

And so bringing that up was very exciting because the datasheet, despite being 800 pages long, doesn't actually tell you everything you need to do to bring it up. And so there was a lot of reverse engineering the SDK and kind of tracing through it to see what it's writing and looking at all of the secret bonus registers that exist in the SDK but not in the datasheet, to get this thing actually up and running.

Speaker 1: 54:17

And this is, this is the the the VSE 7448 is the the, is the is the chip here. And I would say the vendor here is like many of ours, which is, like, not unfriendly, but also made it very clear that we are not on the they're well trodden path. They're just like

Speaker 3: 54:35

Yeah. What what is this SDK? Is this something like that they expect you to have, like be running Linux somewhere?

Speaker 1: 54:41

Yeah. So this is

Speaker 5: 54:42

interesting because it

Speaker 6: 54:43

is a, it is a switch, and that it's got a bunch of switching fabric, but it is also a MIPS processor in there. And so you can strap some RAM and flash to this switch and boot up Linux, running on the same chip as

Speaker 1: 54:56

the switch. Wow. Which is Yeah. That's awesome.

Speaker 6: 54:59

Their happy path. Like that's kind of what they expect you to do. That's their dev kit does. That's kind of the most common use for this chip.

Speaker 1: 55:06

That's right. And so this is and this is a MIPS score. Adam, if you're wondering where we had MIPS in the product, it's this MIPS core that we are not that we actually can't use because it's not attached to any memory.

Speaker 5: 55:16

Yeah. We didn't attach the DRAM.

Speaker 3: 55:18

364 or anything on

Speaker 1: 55:20

there. No.

Speaker 5: 55:21

But but the but the result is that, we've now been told after Matt and I have asked several rounds of questions about, hey. How is this supposed to work? Or how like, we're not seeing x or y happening. They've now told us, like, listen. You need to use the SDK or not bother us with any questions arguably, the critical the critical sections.

Speaker 5: 55:46

And we were probably able to do it from this end on our own, but, yeah, we've been it's definitely been an adventure.

Speaker 1: 55:52

It was definitely an adventure. And, Matt, you had gotten us a a a dev, a dev board for this thing. And I'm used to these dev boards being, I don't know, you know, like, something you can hold in one hand. And your dev board, of course, it's a 54 port switch. I mean, it's a huge thing that you've got to go develop.

Speaker 1: 56:12

Oh, yeah.

Speaker 6: 56:12

It's like, like, 1 foot by 2 foot and is, shielded on both sides by, like, quarter inch acrylic plates. Is it Yeah.

Speaker 5: 56:20

It's more or less a a a a translucent rack mountable thing. It's 19 inches wide. Exactly. It's it actually could go into a rack, I think, if you were to put mounting ears on it.

Speaker 1: 56:31

And so, Matt, and you had basically, before we brought this stuff up, you had managed to talk that thing directly via spy from a from from Hubert's warning on, on, I think, on the game with that. Right?

Speaker 6: 56:44

Yeah. Exactly. Because we don't wanna be using the internal image processor. You can pins and cause it to boot up, without that processor running, and then it acts as a spy device. And so we had a separate microcontroller, that was talking to it over spy and just writing, you know, hundreds of registers on startup to bring the switch up with a mix of documented and undocumented settings.

Speaker 6: 57:06

Like, some of the surveys in particular, the things which actually send stuff down the line are just blobs of cryptic configuration that they don't really tell you what they do.

Speaker 1: 57:16

And is is this weird the hidden the hidden 8051 was not or the not so hidden 8051 was not in this part, or was it this one have 80 50 ones? I mean, I'm sure the the 80 50 ones everywhere. Why am I

Speaker 6: 57:28

The secret 80 50 ones are in the PHYs, which are a separate chip. But it goes through this chip to talk to the PHYs. So, yeah, one of the things we discovered is that the SDK includes, sets of code to configure very specific PHY chips. Like if you have this PHY, you need to apply a patch, which is this array of binary numbers, to the 8051 one core running inside the 5 because, of course, that's a thing.

Speaker 1: 57:54

And then so, Matt, somewhere along the line, I should remember where it was in that week, you were so we we've got the 7448, then you've got these additional fives that you need to go talk to. And if I recall correctly, you could you were able to speak to one of the 8550 twos, but not the other. Is am I remembering that right?

Speaker 6: 58:13

I'm not sure. That that doesn't sound that doesn't sound like what I remember.

Speaker 1: 58:18

The the the this is the thing that ended up being yet again the slew rate.

Speaker 6: 58:23

Oh, yes. So that was, yeah, we could talk to one of the PHYs that was configured one way, but one of the other PHYs, which we configured through a different interface, was refusing to talk. I'm sure Brian remembers this because I worked on this for an entire day. I told Brian I was gonna go home, and I'm sure that I would solve it first thing the next morning. And sure enough, you know, I go home and relaxing in the hotel room, I think to myself, well, we've already had 2 issues where the microcontroller slew rate on pins has caused things not to work, where you get to pick.

Speaker 6: 58:53

So when you're configuring a microcontroller pin, you can say, I would like this pin to be slow or medium or fast or very fast. And, of course, everyone picks very fast because, like, why would you pick anything but the fastest possible slew rate? I can hear all of the other, like, electrical and analog engineers wincing in the background here. And so sure enough, these pins were configured as very fast. And if I came in the next morning, the first thing I did was I changed them to slow, and the chip immediately came up.

Speaker 6: 59:20

And Brian came in, and I told him it was working now. And he looked at me like, what what did you do?

Speaker 1: 59:24

Well and in particular, because you've just been so, like, well, this doesn't work, but I'm gonna go home. I'm gonna sleep on it. I'm gonna come in. I'm gonna fix it. I'm like, wow.

Speaker 1: 59:30

Okay. Alright. Well, that sounds like that's a good plan, I guess. It sounds like you know what's going on. Then, that was that was great.

Speaker 1: 59:37

I was very impressed. I also have to say, I the you've been bit by this issue a couple times. I got bit by this issue on the slew rate. So the Adam, have you been did you get did you get any

Speaker 3: 59:49

No. I have not read into this

Speaker 1: 59:51

one at all. So So when you configure these GPIO pins, just as Matt says, you've got these what you are configuring is how quickly it rises or how quickly it falls. And as Matt says, it's like your options go from very fast to slow. So anyone you're writing software. Right?

Speaker 1: 01:00:08

Obviously, got it. Yes. I want very fast. Of course. Like, why would I want the slow one?

Speaker 1: 01:00:12

It's like, well, you want the slow one because if you if you raise the the level very, very quickly, you are getting you are much more likely to get signal bounce. You or rather you have to pay attention to how things are terminated. Because with the the signal now is this, like, abrupt sharp edge, and it's likely to bounce off when it it's it hits an impedance mismatch. And and I'm double used to correct me for my arm for this, like, I'm butchering this explanation. But you'll get this reflection back.

Speaker 1: 01:00:45

And when you're talking over, like, spy so you start literally hearing yourself if you're talking over spy. And as a software engineer, the you this makes you wanna cry. Because you see, like, rampant data corruption, but not total data corruption. It's like, I can see that, like, this thing is trying to do the right thing some of the time, and then for reasons that I don't I but then it does the wrong thing a lot of the time, and it can be really, really frustrating.

Speaker 5: 01:01:11

Yeah. What you effectively get is you get this ringing behavior where it takes a a while for the signal to settle, because you get these reflections that are basically rattling back and forth between all your parts, especially if you have a bus that has multiple drops and then there's a header that is unterminated. So all that stuff starts reflecting back in all directions effectively. So we'll take a little bit of time to settle. And if your sample if your part is either sampling at the wrong time or your your controller is sampling at the wrong time, then it might be sampling, you know, at the trough of 1 of those one of those, those oscillating, like, as as it as it bounces.

Speaker 5: 01:01:49

And, yeah, it will read a 0 or a 1 depending on where that is at that point.

Speaker 1: 01:01:54

Scroll. The other cool

Speaker 6: 01:01:55

part about this is that, if if you attach a logic analyzer, you'll be adding a little bit of capacitance to these lines, and that is often enough to make the problem go

Speaker 1: 01:02:03

away.

Speaker 5: 01:02:03

And that will yeah. That might go make it go away because that will provide just enough termination effectively to make these rattles be a little bit less violent, and therefore, they will settle quicker and and things will get happier.

Speaker 7: 01:02:17

And, Adam, I mean, it's funny because obviously we

Speaker 1: 01:02:19

we In terms of the setting,

Speaker 3: 01:02:20

how do we how do we change the setting? Like, what it's just, Matt, you're just literally changing a line

Speaker 7: 01:02:25

of code?

Speaker 1: 01:02:25

Yeah. For a line of code.

Speaker 3: 01:02:26

From fast to slow? Yeah.

Speaker 8: 01:02:28

Yeah. That's awesome.

Speaker 9: 01:02:29

You can add a capacitor.

Speaker 7: 01:02:31

I mean, it it's basically setting a register in some you know, in the processor somewhere that tells it what to do.

Speaker 1: 01:02:38

And we would love to have different nomenclature for this. It would be great to go from, like, safe to dangerous, which feels like people would not be using dangerous unless they knew what they were doing, but it goes from that. It goes from slow to fast. Everyone picks fast, and you actually wanna be, like, slow or medium. You you you wanna be you you want that rate

Speaker 5: 01:02:57

slow rate is as slow as you can get away with.

Speaker 1: 01:02:59

That's right. That is as slow as you can get away with is what you want. And you wanna default to be slow, not fast.

Speaker 6: 01:03:11

The fastest possible.

Speaker 1: 01:03:13

Right. Right. But now in the in the year or so through that week making really good progress, getting all the I we had the p a PLL failing to lock. I I remember that issue. I've and I I know that you got it resolved, but I don't know what the what what's resolution on that

Speaker 6: 01:03:29

one? I think that was just one of the, like, bazillion configuration registers that I had to bring up exactly right in the right order. And what I ended up doing was compiling the SDK on my desktop, and replacing all of its register reads and writes with calls that would actually print what it was doing.

Speaker 1: 01:03:45

Oh, that's awesome,

Speaker 6: 01:03:46

Matt. It would pretend to configure the chip on my desktop, and then I would just read through every register write and compare that against what I was running to see if it matched. And sure enough, it was, you know, one of them that was different, and that was the problem.

Speaker 1: 01:03:58

Oh, man. That's funny. I didn't realize you've done that. Yeah. Well, that was great.

Speaker 1: 01:04:01

And, again, a a great find. And then at the so we are end up at the end of this 2 weeks, and, Ari, we are right on the cusp of being able to bring this thing. I know you really wanted to get this thing all the way up to talk PCIe at the end of those that that 2 weeks, but we just we weren't quite there. We were super, super close, but we weren't quite there. But I thought we were we were ahead of where we wanted to be.

Speaker 1: 01:04:25

I don't know. Nathaniel, what what do you think?

Speaker 7: 01:04:26

Oh, I think we need to talk about the load bearing dongle right at the end.

Speaker 1: 01:04:30

Oh, yeah. Yeah. Yeah. Please go ahead.

Speaker 7: 01:04:32

So, I mean, we we had, you know, over the course of this 2 weeks, for some for whatever reason, we basically had 2 FTDI dongles that we were using to program these, the FPGAs. And and we only had 2, and one of them I had brought with me. And so, you know, as we got these boards in different places, we have, you know, people running these dongles back and forth because every time the board power cycles, we have to reload these these images. Company down by taking my dongle home.

Speaker 1: 01:05:11

That dongle the entire company is that it's gonna be easy in that dongle. And I think Aria had already, like, I I I recall, like, running out to get Ari in before he left so we could get

Speaker 7: 01:05:20

the 2ndongle. Ariane had left, and I think he came back to drop off an extra dongle.

Speaker 1: 01:05:27

We we the the the second load bearing dongle. That's right.

Speaker 5: 01:05:31

Steve, did you drive me back to the office? Was what what happened there? I know that we or or I drove back to the office maybe. Or we all drove back to the office. Yeah.

Speaker 5: 01:05:38

But, anyway, like, for the listeners, we we had parts. We just hadn't had the time to solder more dongles together so that we could, you know, use them. This was all happening rather quickly, and those things quickly fall by the wayside. But, yeah, we we we now have sufficient downloads for the company.

Speaker 1: 01:05:56

And so now we are going into but I I mean, I thought we were right where we wanted to be in terms of we we had I mean, I think Well,

Speaker 5: 01:06:05

we managed to not flame one of these boards.

Speaker 1: 01:06:07

That's right.

Speaker 5: 01:06:07

Because we only had a couple, and so that would have been a real setback. So, hey, like, stuff's still working. No magic smoke has been released. So and and the PDN is running. So that seems good, which was good.

Speaker 1: 01:06:19

And we're getting I mean, we talked about the spy wiggles and Nathaniel's spy wiggles on the SP 3 and how those initial spy wiggles were kind of that initial sign of life. And we've got something similar here, right, Ari, in where we've got the Tufino is gonna be loading its configuration from an attached EEPROM, and we're seeing that sign of life.

Speaker 5: 01:06:35

Correct. So well, so and Nathaniel posted some pictures on the on the on the thread, the the space thread, mentioning u 35. U 35 is a little spy ePROM where there's a bunch of parameters that the the SerDes IP and the Tofino ASIC needs in order to configure its PCI Express link, which will then once PCI Express is up, you can then configure the rest of the ASIC. Everything goes over PCI Express. We messed up messed up the footprint, so that was one.

Speaker 5: 01:07:00

Nathaniel had to do some surgery there to to get us to, to get a part in place that we happen to have. But then getting that to program was a little bit of a, like, a little bit of a thing where then we run again into the problem of of slew rate being too fast. So here I sit with at at at a table on the last Friday of bring up, trying to frantically trying to get a an external programmer to work, programming one of these little ePROMs so that we can get these parameters in there, solder like, get it to the board and then fire up this thing. Because without these parameters, the PCI Express link is never gonna come up. And, so we we get stumped again by this slew rate being too fast, bit errors happening.

Speaker 5: 01:07:42

And, like, I could program this thing, you know, once out of like, I could program some pages reliable and then some pages would not be reliable. And so, after a day of work, I was, I was about to jump off the bridge almost. And, and and and and both Brian and Steve told talked me off the ledge and they're like, you go home and Brian was gonna go and fix this over the weekend, which he ultimately got. But I think it even cost you still.

Speaker 1: 01:08:07

Oh, no. It it it definitely yeah. Absolutely. No. I got I got nailed by this thing as well.

Speaker 1: 01:08:12

So it it definitely was it was more painful than I thought it was gonna be. But we got we we got to the point where we could we could pro and what we're doing is taking this this kind of binary payload from Intel that describes, you know, a format. I don't I don't think we've got any visibility with that format. Right? Or, Ian, this is some it's a little goober that's gonna describe how it configures the the PCI links.

Speaker 1: 01:08:33

And we, we got it programmed that that next week, and I definitely felt like it was like Arnon, I think you and I were both joking that we're taking, like, this little astronaut and kinda programming this little RAM and then kind of, like, sending it off into the rocket to go you know, the the this little tiny, e prom has got all the information that's gonna be necessary for this fire breathing monster that Eric has configured to actually work properly.

Speaker 5: 01:08:58

Well, actually so when we powered it on, I I should I should what what what we first saw when we after we had some some some headers so that we could look at the the the waveforms. But, so as as the device powered on and we we let it go out of reset, the first thing we see is that, hey, it's starting to fetch. It's trying to read to figure out what kind of ePROM is there and then trying to read the, the configuration. There was no configured configuration in the eprom, but at least the part was showing signs of life. Like, the clocks were running, and it was it was doing stuff over SPY.

Speaker 5: 01:09:28

So that's where we that's where we left off probably Thursday. And so we knew that the part at least had come out of reset and was doing something, and it was way and and it was up to us to deliver these these parameters through the spy realm to bring them up to then, you know, get PCI Express to work. So, yeah, that took us about 2 weeks to get to. From boards in hand, not knowing what to expect, to powered up and and, you know, ready to go and explore, get this part to work.

Speaker 1: 01:09:58

And I felt like I mean, obviously, you know, Nathaniel and Eric, I defer to, you know, Matt, I defer to your expertise, but I felt like that's about as smooth as that was gonna I mean, I felt like that was not gonna be a lot faster than that. That was about, as smooth as it's gonna get.

Speaker 8: 01:10:13

I think, like, one way to gauge that, especially relative to Gimlet, was just our our MCM process, which is just we we it's almost like capturing our our bugs and hardware and things that we need to fix, for the next round. I was just looking in Gimlet, We created 42, bugs, hardware bugs. And on sidecar, we have 6. Yeah. Which just shows that we we really cleaned up our process a lot.

Speaker 1: 01:10:44

Yeah. We did. I mean and and huge. Kudos to you, Ariane, and to the rest of the team. We we will definitely we got better, which is great.

Speaker 1: 01:10:50

I think that was very, very gratifying. And I think that went that went pretty smooth. And then, Ariane, going into that next week, you everyone kinda leaves Emeryville, and it's kind of like you know, I feel like everyone kind of, you know, leaves the the the, you know, the everyone comes out for the birth, you know, she's alive, everyone leaves. And now it's, you know, Ariane is now left with this, you know, this 3 week old that we need to, like, to kind of, like, begin to, get to the next stage, which is getting it to talk over PCI. So, do you wanna talk about the the the the, I mean, the contraption?

Speaker 1: 01:11:30

The, the kind of the how would we get this thing to talk over PCI or and your your eBay score on that?

Speaker 5: 01:11:40

Yeah. So long story short, the or what I what I referred to earlier, this thing is connected over PCI Express to an external cable, but that is not a standard cable in any way. This is part of our our custom cable backplane that we're developing for this rack. And so there's no off the shelf cables you can buy for this, and so we had some, we basically cobbled together, what what what I love to refer to as the contraption using some, some an adapter board that would let us break out the individual, the individual conductors, like, the twin x conductors effectively into SMA cables so that we could rig up, basically, a a cable that would go from the connector underneath the board to a, a a regular, PCI Express slot on a on a on a consumer motherboard because we we had did not have the compute slot ready to be connected to this thing yet because we could not run enough software on that just yet. And so we needed a a a a more ready to go a more ready to go PCI Express host.

Speaker 5: 01:12:48

And so we, put together a small motherboard with a bit of just a an an off the shelf AMD client part and a 56100 x, I think. And we use a, a PCI Express test card from the PCI Express SIG organization that you can use for a leak validation that happens to have SMA connectors and then or, SMTP connectors, SMPT connectors. And we then jumped that through the through the the breakout board and then into the custom cable that goes underneath sidecar. At this point, we have replaced that with a we have made a little adapter that will take our custom cable and pin out to a regular SIM slot so that we don't need that anymore. But as we were

Speaker 1: 01:13:30

starting Did

Speaker 3: 01:13:30

you literally get this on eBay?

Speaker 5: 01:13:32

No. So as we were working through this, we we were running into some basically, the link wouldn't come up. And so, okay, where do we start debugging this? Because now we're talking about a PCIe Gen 3 link, you know, 8 gigatransp per second per lane and high speed clocks. And now you now you need serious measurement equipment to go and figure out what's wrong.

Speaker 5: 01:13:55

Fortunately, I I my my bad eBay, like, late at night, test equipment buying habits, led me to hoard some of this stuff. And we happen to have a PCI Express analyzer and a PCI Express exerciser, an older model from LeCroy, but we happen to miss a crucial cable for this thing. And the cable is still current, and LaCroix will not sell it to me, won't sell me a used one of this. They tried to charge me $7 for a new one, which was a little bit out of the question. But it so happened that we managed to score one cards that goes with these devices that came with a cable for only 50, $1600.

Speaker 5: 01:14:39

So we got the cable, so we bought the adapter card for the cable. And then with the cable in hand that which arrived like 1 or 2 days later, we managed to get to, to fig to to we we we figured out that at least the PCI Express lanes were connected correctly and the clock was connected correctly because the analyzer, or the exerciser could connect to this board.

Speaker 1: 01:15:03

And this thing and, Ari, did you drop in a photo of this thing in your Twitter thread? This thing is true Frankenstein. I mean, this is like you're you got, you know, a a a third of it bought on eBay. I mean, the thing is, like, this is, and when it it was not working, Arion, you had a very clear idea of what you thought was going wrong. Yeah.

Speaker 5: 01:15:25

So I I I don't know. I have a picture, and I posted a picture at some point. But the problem is if I close the Twitter app, I don't know.

Speaker 1: 01:15:32

Yeah. No worries. Yeah. Yeah. We'll do we'll get back to Sarah.

Speaker 8: 01:15:34

Yeah. Yeah. Yeah.

Speaker 5: 01:15:34

I could I I will post the picture later. But, yes. Because when we start I started digging a little bit, and what I realized was that there's several different clocking modes for PCI Express, and we were assuming that both these systems could be independently clocked, which is a supported mode, but which is not the default mode. Most, most PCI Express configurations are what's called source clocks, which means that the root port will also or the root system will also provide the source clock by which all the carts or or or PCI Express devices, are clocked. They will use that as their input to synchronize their SerDes directly.

Speaker 5: 01:16:12

So there's no there's no clock recovery going on. They basically just simply take that that clock input. There's a transfer function in order to determine how to how to phase how you need to change your phase relative to the clock. And then you start sampling bits. And then when implemented correctly, that will cause the bits to be sampled correctly and the link to work.

Speaker 5: 01:16:32

It's

Speaker 1: 01:16:32

This and this Could you just expand a little bit on what clock recovery is? Because it's

Speaker 5: 01:16:37

So clock recovery is the process of, so a lot of these high speed serial links like PCI Express or SATA or SaaS or or, Ethernet now too, instead of sending a clock along with the data as a separate signal, that at these really high data rates, that becomes a now you need to make sure that the clock and data stay in phase and so that causes some challenges because you need to make sure that the clock never either leads or trails your data, but is right there within that exact window where then the sync is going to capture use that clock to capture the data. A link that, for example, does do this is, to some degree is HDMI. The work as clock ascends along, which is used for recovery of the of the of the sample point. Instead, PCI Express relies on on or can rely if you're not using this this source clock mechanism, you can rely on clock recovery, which means that you're using the data stream to embed the clock. You're using certain character transitions in the data stream to happen such that you can recover the clock.

Speaker 5: 01:17:54

You're using certain symbols, that have certain bit patterns effectively that you then use to, to train a PLL so that you can recover the clock and then sample appropriately. But for, like, most client PCs that just have PCI Express slots in which you insert, you know, your graphics card or your networking card or whatever it is, any PCI Express card, assume more or less that these are source clocked systems. And so the source the the the AMD CPU or your Intel CPU will or or the board itself will provide a clock to these cards and the cards, the chips on these cards will affect will align themselves directly. If you don't want if you if you do not want to do that, you need to turn on a a there's 2 different modes that you can use for PCI Express where you can say, oh, both systems have an independent clock and now the chips both need to recover these clocks. If you don't if you don't set that appropriately, the clock recovery the functionality of these of these, links will basically not be enabled.

Speaker 5: 01:18:53

It will be bypassed. And now you need to so and then what if if that is not enabled, what you're then gonna have is that your data is gonna be out of sync. The clocks are gonna be out of sync between these two systems.

Speaker 1: 01:19:03

And I I mean, are you fair to say it's rather unusual to have a PCI device that's much larger than the host, which is the case precisely.

Speaker 5: 01:19:11

Yeah. This is not this is not a regular like, normally, a a a directly plugged in device, you know, this is not a problem. So the solution here was that we would we also had to send the clock through the external PCI Express cable to to sidecar, which we had accounted for. The the the cable included, a diff pair to do that. It's a but we needed to configure the clock, the clock generator that I was talking about earlier in the in the in the space to, basically send the to generate the an appropriate PCI Express clock from the received clock that we got from the host.

Speaker 5: 01:19:45

And once that was in space, the link would come up. The analyzer was not as picky in the analyzer that or the the, the exerciser rather that we attached to the system with that cable and the and the cart that we bought from eBay. That one doesn't really care and it just oversamples and then figures out where the clock needs to be and it will still work, but your AMD client system will not. So once we once we figured out that this was simply a clock being out of phase, the fix was fairly simple. And, you know, responsible.

Speaker 1: 01:20:18

How did you figure that out? Was that just by kind of pedankment and knowing that No. I just

Speaker 5: 01:20:24

I had a hunch that that might be the issue and then start to look for some documentation. There's there's some good PCI Express reference material out there, including a clock chip, base basically a data sheet for a clock chip from Renaissance, which describes these different scenarios and why these different things work the way they work. And so based on with that armed with that, I will have a pretty good sense that this this wasn't it was a thing that we should explore. And once we, you know, did that and figured it out, it was it was pretty quick.

Speaker 1: 01:20:52

Yeah. I mean, once you did that and that that wasn't the issue. Once you did that, we were able to actually talk to it over a PCI, which was that was an exciting moment.

Speaker 5: 01:21:01

Yeah. And that because that once you get that working, it means that you can now get the 1,000,000 plus line SDK from Intel to go and configure the rest of the chip. Because this chip is like, there's no way that you can do this without their help. Like, with like, there's there's so much logic and functionality going on in this thing that you you you you you are solely beholden to what the the the the software that this provided for something like this. This would take a long time to rebuild if you did if you wanted to write that from scratch.

Speaker 5: 01:21:32

If you need to know how these ASICs work. Like, it could really work.

Speaker 1: 01:21:35

The route that we're taking that Matt was able to take with the VSC 7448 is just not practical. With Tufinos, it's it's

Speaker 5: 01:21:42

Over over time, we we we still want to probably write a lot of this stuff ourselves because there's a lot in the SDK that, because this ASIC is designed for know, for for a variety of different rack switch configurations and and and sort of rack switch deployments, and they are targeting Sonic, for example, as the operating system for this thing very clearly. And there's a lot of functionality in the SDK that we don't necessarily need. There's a lot of dynamic configuration ability that we are not going to be using and so there might be some over time, we might start to cut cut out some of these pieces and maybe only rely on the bottom, the lowest level c library that they provide to interact with the ASIC and then build up functionality over time on on top. Because we still like, you know because it's, like, a pretty large chunk of c plus plus that is difficult to understand. So

Speaker 1: 01:22:37

So while we're doing this so we're we kinda hit the end of January. We made a great progress there. Maybe to back up a little bit and now talk about Gimlet. Because Gimlet's now happening in parallel. And we have we've got kind of we've hit a couple of stumbling blocks with Gimlet in January that are proving pretty frustrating about the same time.

Speaker 1: 01:22:59

Nathaniel, do you wanna talk about what we I'm not sure if if Robert is there via his his, voice on earth, Steve. But, do you wanna talk about some of the things that we were hitting with the with the t six? I know our case or 2, actually.

Speaker 7: 01:23:15

Yeah. So the t six, you know, we have a PCIe connection on the board that's soldered to the gimlet, with the Milan processor. And so, you know, pretty pretty quickly after we got, power sequencing set up there, we were able to, start communicating over PCIE, with that thing, I think. Right? We were I think we were talking.

Speaker 7: 01:23:40

Well, I guess No.

Speaker 1: 01:23:42

No. No. I mean, we thought we were

Speaker 7: 01:23:45

at one point in time, but, like, as we went through it, provably, we weren't. And it's like, there's not that much stuff for that chip. I mean, it's got a few strapping options and, and a a big PCIe lane.

Speaker 1: 01:24:01

And it is that thing is not wanting to come out and reset. RFK, do we do so Well, in

Speaker 3: 01:24:06

in the context, this this is the the NIC in the server sled.

Speaker 1: 01:24:11

In the server Sled.

Speaker 3: 01:24:12

Just to be really clear, like,

Speaker 4: 01:24:14

very important, like Yeah.

Speaker 3: 01:24:15

Our need this to work.

Speaker 7: 01:24:16

Yes. Our company does not exist if this doesn't work. Right? So, this is the dual 100 gig NIC, and, you know, we have there's a lot of software that we have created to get access to the PCIe stuff. And so, like, Robert has done a great job, like, on twisting all of the, like, deep details, on, you know, the Milan's PCIe cores and everything.

Speaker 7: 01:24:42

But but we have a lot of software that, like, hasn't been totally checked out in here. And then in addition to this, we were struggling to get this chip to respond the way we expected it to over PCIe.

Speaker 1: 01:24:54

Well, and in particular, this thing is just not coming out of reset. Yeah. And and the alright. RFK, we did we we pulled up the the clock and it was pretty clear that the that the clock was could use some some improvement.

Speaker 9: 01:25:06

Sure. It it could be better.

Speaker 1: 01:25:09

I mean And then we we got better, and then it did not go out and reset.

Speaker 7: 01:25:13

Oh, okay. It got better. Problems. Like, RFK and I were out there in in February, and it was like I mean, we we basically, off off boarded the chip I mean, off boarded the clock to, to our own clock generator, and we used a I mean so we desoldered and and hacked on SMAs and because the clock was looking bad. I it turns out after going back through it, we realized that due to the part shortage, we had, dual stuffed oscillators to have options, and and both of them were actually still stuffed and and were driving the PLL chip.

Speaker 7: 01:25:53

And that

Speaker 1: 01:25:53

was making the PLL chip

Speaker 7: 01:25:55

a little bit unhappy.

Speaker 3: 01:25:56

Wait. Wait. So, Nathaniel, does this mean we had both parts on the board

Speaker 5: 01:26:00

so that

Speaker 3: 01:26:01

if we weren't able to source 1, we'd be able to just, like, but it's always intended to sort of be half populated. Right?

Speaker 2: 01:26:07

We're not

Speaker 7: 01:26:07

Correct. Yeah. We we only need one of the 2 possible oscillators we can use here. However And we need exactly 1. We need exactly 1 and no more.

Speaker 1: 01:26:16

That's right.

Speaker 7: 01:26:17

And and as we went through bomb generation for the board, you know, something got lost in translation and we ended up actually both buying and soldering in both circuits the clock chip. And it turns out that, the PLL, you know, does some, like, clock cleaning, and it's, like, kind of good, but it gets a little bit mad when it's one input has 2 clocks. And

Speaker 9: 01:26:51

you get some non in your signal, and it just won't go away. Right. And you can, like, try to change the termination around and stuff, and it just, like, doesn't stop. And it's very strange because it looks like termination issues, just like impedance mismatches. But turns out that that's incorrect.

Speaker 9: 01:27:09

And I'm shocked the thing works at all with 2 clocks.

Speaker 1: 01:27:13

Well, it

Speaker 7: 01:27:14

yeah. I yeah. And so we we probably spent a good, like, 2 days or more chasing that down.

Speaker 1: 01:27:21

Right.

Speaker 9: 01:27:21

Well, to the point of yeah. Like you had said, like, basically, we removed everything and, like, in jacked in an external clock, bought a clock generator that didn't get there in time. So we used a dev board of the same chip that is on the board in order to get the new signal, piped in the best clock possible. We're sure it was gonna work. Fired it up and it failed, which was really sad.

Speaker 7: 01:27:47

It and it failed exactly the same way Yeah. That they did 2 years ago.

Speaker 9: 01:27:51

Information available from this large effort. Excellent.

Speaker 1: 01:27:57

When I feel like this was I this it was such a flashback to the SP 3 where it's like we just could not get the sync to come out of reset. And What's

Speaker 2: 01:28:05

the power of this time?

Speaker 1: 01:28:07

Exactly. That's right. Well, it's and then start talking about the Eye of Sauron. Well and we don't know what this thing is. And it's like and and Nathaniel, who would you come out to Emeryville for the sidecar bring up.

Speaker 1: 01:28:18

I mean, RFK poor RFK. RFK used to do all this, like, because I all these hypotheses take time. And so RFK said they're reworking this thing over and over and over again. Nathaniel's like, I'm gonna go out there to help him. And that was was that the 1st week of February, Nathaniel, that you did?

Speaker 1: 01:28:34

Yeah. Yeah. And and we are going through absolutely everything in this part. Chelsea is being helpful. But on the other hand, we're hearing, you know, you know, we've kind of never seen this before.

Speaker 1: 01:28:49

It's like, oh, shit. This again. But the, you know, why is this thing not coming out? Reset And exploring, and I we got the Nathaniel, you were out for that week and I think that you came out with confidence that you and our fair was gonna get that nailed in that week, and Robert would get that nailed.

Speaker 7: 01:29:07

Yeah. I I mean, I thought between 3 of us focused on that issue for, you know, 5 days or, you know, most of 5 days, we and we would like, they're just there are only so many things. And and, you know, I mean, on Friday, as I had to go catch the train, like, I was just it was kinda sad because, you know, it was like, oh, you know, we we didn't like, we're no different than when I came out on Monday. And I mean, we know a lot of things

Speaker 1: 01:29:34

that we do in this,

Speaker 7: 01:29:36

But and we've made a lot of improvements and found various problems, but none of them have really, like, moved the needle. And then, like, I had to go on vacation, so I was, like, you know, headed home and then, like, out of the office for the next week.

Speaker 1: 01:29:48

Oh, and, Lucado, I like that at the end of the week, you're like, well, in conclusion, this thing should be coming out of reset. That's right. I demand it to be so. That's right. And then I feel it was about this time that maybe that that next week because I think on the end again, I'm not sure if Steve and Robert are there, but, Ari, you back to sidecar, you have the big breakthrough of actually getting this thing talking over gen 3 in the middle of that week.

Speaker 1: 01:30:18

Like, the I think that the 11th is the date.

Speaker 7: 01:30:20

So my head. I I think that, I was on the train and I got chatted that Arian had made a big breakthrough, I think. So, like, that happened that Friday, I feel like.

Speaker 1: 01:30:31

That Friday. Okay. Yeah. But it was so that was

Speaker 5: 01:30:35

I waited until Nathaniel was gone, That's when I turned it

Speaker 1: 01:30:38

on. There you go.

Speaker 5: 01:30:39

That's the asshole I am.

Speaker 1: 01:30:43

But the I just remember getting like, we now have this very sophisticated part in terms of the Tofino that is now doing we're doing really well. And I think it was then, Aria, that I just remember really shouting at the t 6. It felt really good, actually, to shout at the t 6. To be like, you know, everybody is here, t 6. We've got, like, the s p 3 is here.

Speaker 1: 01:31:04

Tafido is here. We've got the management network here. We've now brought all these parts out of reset. This fucking nick just will not come out of reset.

Speaker 10: 01:31:13

And then I step into the picture where everyone has been working on it really hard, and I've been watching from the side

Speaker 1: 01:31:27

conversation about this. I am feeling down. I am, like in part because, Nathaniel, very optimistic engineer, Nathaniel's like, in conclusion, I have no fucking idea what's going on. And so, Rick, you and I are talking, and you're and, Rick, I just remember you being like, we'll get there. We're gonna figure this out.

Speaker 1: 01:31:46

And I'm like, I'm really not sure that's the case. I am really like I I wanna believe you, but, Rick, you were very, you're like, look, we'll get there. And then I think, Rick, would that heat up?

Speaker 7: 01:31:58

And thankfully, Rick has a history of modifying off the shelf boards to help us out.

Speaker 1: 01:32:05

Yes. So, Rick, you yeah. You're right.

Speaker 7: 01:32:07

So Like, Rick has an ethanol that's missing, like, 30% of its bomb, but

Speaker 5: 01:32:12

it still moves.

Speaker 10: 01:32:14

Right. And so that was the same thing here as I I had a couple of t six add in cards that I had used for a variety of early experimentation. So I don't I don't mind modifying them and seeing what happens. And with this, you know, taking the same approach. Okay.

Speaker 10: 01:32:32

So we have a design that we made that we've been hacking on and trying to get working. We can't figure out why it won't start. So there's a whole lot of things it could be, and it's, sure, we can spend a lot of time trying each of those. But we're not it it you know, you're kind of rolling the dice as to, can I guess which thing it actually is that's wrong? And so I took the opposite approach.

Speaker 10: 01:32:55

I took a known working board, and just started modifying it to be closer and closer and closer to our reference design or, to our our actual design until it broke. And, in this case, it took, I think, 3 tries, and it it was really, you know, some informed guesses. We we had looked at a lot of why we might see the behavior we did. And it turns out that there's some strap, resistors. So, essentially, you, you know, you have some resistors that pull a pin to either ground or VCC to set some initial configuration.

Speaker 10: 01:33:40

And a couple of those on on the t 6 choose which clock source to use. And this is really important because it's the clock source that's used for the a very, very, very early startup of the chip. So the very first thing it should do is go through this hardware based sequence of loading some configuration out of a SPIROM and then actually starting up. And then you should see some output on pins. And we weren't seeing this this happen.

Speaker 10: 01:34:07

And so that's why we were focusing on clocks so much as we thought, oh, it's just gotta be you know, the quality of the clock going in is not good enough. Well, I changed the resistor from the value that was on the add in card, which was suspiciously small at 500 ohms. And I put in what we have listed in our bomb for our design, which is 10 k ohm, and it doesn't work.

Speaker 1: 01:34:34

It doesn't

Speaker 3: 01:34:35

work in the same way

Speaker 1: 01:34:36

that ours is not working.

Speaker 10: 01:34:37

Exactly the same way. And so I typed this into chat, and I think it it was pretty late. Because like you said, Brian, I had talked with you at, like, 3:30 PM Pacific time. And then I went and did this, and it was, you know, 4:30 or something. People were mostly done for the day.

Speaker 10: 01:34:52

And I say, so I changed this, and it broke. So there's a very quick rush to go take one of our gimlet boards to change out these resistors to 500 ohm to see what happens. And sure enough, it works.

Speaker 1: 01:35:11

And the thing that's that kind of shocking about this is, Rick, it is 10 k on the schematic, but RFK, you had already changed that to your one case.

Speaker 9: 01:35:22

We have changed these probably 6, 7 times till one case, which should work based on the IBM specification of the IO of the die, that should be fine.

Speaker 1: 01:35:36

That should be fine. And, Rick, you had said that that if you had known that there were one case on there, this is an experiment that you might not have done.

Speaker 10: 01:35:43

Right. Because, I mean, this was being that these are pull downs, you'd expect that they don't have a lot of current draw in them, and you can run fairly high values. And that's where it being 500 ohm in their in the official Chelsea reference design seemed odd. But, yeah, I I probably wouldn't have tried it had I known they're they had already tried 1 k.

Speaker 1: 01:36:07

But changing that into a 499, actually worked. And, Steve, I don't know if you and Robert are there, but if you got some some retelling of when that thing came out of reset, I believe unfortunately, I was physically out of the office. I didn't take a call, but there was, there was some exuberation when that thing came out of reset.

Speaker 4: 01:36:28

Yeah. There was definitely some exuberation. There was also I I think I recall and by the way, Robert had to depart, unfortunately. So the the value coming from this speaker drops off drops way off at this point in the, narration. But, yeah.

Speaker 4: 01:36:42

No. I remember coming back into the office and a couple of of of loud expletives, that that erupted from the folks that have been working hard at work on this one. And just, like, kind of just that that unbridled joy when you've been just banging away at something for so long and and not seeing a clear path to resolution and then nailing it. And, it was it was a very exciting moment.

Speaker 1: 01:37:09

I believe Robert was screaming, we're gonna live. We're gonna live. We're gonna live over and over again. Is that correct? Yeah.

Speaker 1: 01:37:14

There's

Speaker 4: 01:37:14

definitely an ex Yes.

Speaker 1: 01:37:15

That's right.

Speaker 4: 01:37:15

In the middle of that, but yeah. Definitely.

Speaker 1: 01:37:17

That's right. Right? Yeah. And freight. Fucking 500 ohm resistor.

Speaker 7: 01:37:24

And, like, that's one of those things. I mean, I'll remember where I was. I was on vacation, and I saw that come through chat and was, like, bouncing off the walls and you know? But, one I mean, one of the takeaways is, like, you know, for a bunch of experienced people looking at this stuff, like, you just look at pull up resistors and say, oh, yeah. We have the pull ups.

Speaker 7: 01:37:45

But, like, we really should've gotten on there and probed voltage measurements because that would have told us pretty clearly that we weren't getting pulled where we thought we were getting pulled.

Speaker 9: 01:37:54

Yeah. Definitely. I mean, that would have been even little stuff like that, especially when you're working with vendors where they don't really build all their own IP. Right? So

Speaker 1: 01:38:04

Right.

Speaker 9: 01:38:05

They're getting the stuff from IBM or they're getting, you know, wherever. And they're implementing it, and it's working in their system. And you they they they're gonna go go over it about as much as they say, hey. This works, and it's in the specification for the requirements that we have for our product, and it's fine. So they're not gonna make sure that it's nice for you.

Speaker 9: 01:38:24

And probing all these things is pretty critical because you're gonna get weird stuff like this, and it's gonna go on your top ten list of most bullshit things that you've had to deal with.

Speaker 1: 01:38:32

Well and I also felt it was very vindicating, Rick, of your approach, which you had taken on Gimlet as well. It had it had yielded instead, it it had yielded a an ethanol x that was, like, missing many of its parts and still somehow booting. So it was still, like, valuable information. But this approach of starting from kind of the other end with something that was working and and trying to get it closer to the thing that's not working really paid dividends here. So that was a, that was a very big breakthrough.

Speaker 1: 01:39:05

And it was kind of a relief to I mean, on the one hand, yes, we should have I mean, obviously, done this voltage but then on the other hand, like, pretty surprising that that pull down has to be that strong, that a one k was actually not sufficient. Right.

Speaker 7: 01:39:17

Right. That certainly is, like, not kinda kinda what you expect when you see parts with pull downs. And so I think, like, that's sort of why, you know, it was easy to gloss over as we looked

Speaker 1: 01:39:35

another thing that we kinda hit along the way, and I believe I know Robert is off, but the and and I'm not sure if if if Rick is still here or not, but the the footprint issue we had, Nathaniel, where Rick had done the rework. I don't when did that

Speaker 10: 01:39:49

just gonna say.

Speaker 1: 01:39:50

Yeah. Yeah. All the things that

Speaker 10: 01:39:52

that you don't expect to be wrong is where pin 1 is on various packages.

Speaker 7: 01:39:58

Yeah. So we had I mean, this would have been this would have been probably this was before Christmas, I think, that we determined this, because I I think, our PCB guy did that kind of as a project over over Christmas break.

Speaker 1: 01:40:12

But Right. I just had to do it a lot earlier.

Speaker 7: 01:40:15

But we we had this so we have these, like, 16 pin DFN packages. So they're kinda like a little QFN. And, like, various thing you know, we have this whole complicated hot plug logic network with, you know, things go through, some AND gates and some, inverters and that kinda of thing. And, you know, like, in in late November, early December, you know, Robert and I are on the phone talking, and, you know, he's measuring stuff on his board. And it's like, you know, this the input to the chip, it looks like a shorted output.

Speaker 7: 01:40:49

And I'm looking at the board I've got here, and I'm like, yep. Yep. No. I can confirm that.

Speaker 1: 01:40:53

That that that doesn't look right. That's bad. Right.

Speaker 7: 01:40:56

And so then, you know, then it's like, okay. Well, something's gotta be wrong because, like, this you know, there should be an output and an input, and you can't get, you know, in the middle somewhere. And so, you know, looking at the footprint, looking at our CAD, I realized that, our CAD had been built with basically the pin numbers all rotated by a single pin around the chip. And so and the way a lot of these parts work is, you know, you'll have, like, an input and next to an output, next to a ground, next to a power. And so when you rotate them, you just get a a huge mess in your netlist.

Speaker 7: 01:41:28

And so, you know, we had realized that. And so, you know, after we realized that, you know, it's kinda like, well, what do you do? Because that's kind of, like, they're the parts are smallish. They're, you know, it's they're not unre reworkable, but, like, this is probably not something we wanna do 12 times on our fleet.

Speaker 1: 01:41:46

You say smallish To any normal human, they are extraordinarily small. So could you describe, like, the actual size? These are small.

Speaker 7: 01:41:54

Size there. The chip is, like, smaller than your pinky fingers fingernail. So and there are 16 pins in that space.

Speaker 1: 01:42:03

I was gonna go, like, it is getting into, like, grain of rice territory. Well, it's bigger.

Speaker 9: 01:42:07

It's a DFN.

Speaker 7: 01:42:09

It's a DFN. So it's bigger than that, but it's, I mean, it's it's small. And and the the problem with DFNs, especially, like, when you're just playing around in your home lab, is that there aren't there's not much for a lead there because they're kind of leadless. And so, it's it's tough to, you know, tough to do anything with them. And then given the issue that we had, it wasn't like we had to fly a pin or something.

Speaker 7: 01:42:33

It's like we need to rotate by 1 pin, which, like, your square little, you know, DFN basically can't do that. And so

Speaker 1: 01:42:44

So Rick volunteered to rework this.

Speaker 7: 01:42:46

Yes. I we can let him talk about what

Speaker 1: 01:42:50

he said. I think we're also the Twitter feature too.

Speaker 10: 01:42:52

I mean, some some context. Right? So this is this is a bunch of discrete logic gates that are used to do some complicated logic around the signals from a PCI edge card, you know, like a a normal plug in card slot. But in our case, it's a little different because we we know what's being plugged into there. And so we've we've changed some of the meaning of these signals, and we wanna detect certain situations.

Speaker 10: 01:43:21

So it was important that we actually validate that the logic worked correctly before we cut the next revision of the board. So we couldn't just, you know, say, oh, well, we'll fix the footprint. And then in the next revision, we'll actually test the logic. Like, we we needed to actually validate that what how we thought this worked actually worked correctly, and that that would work with the AMD processor, because it it has a lot of assumptions about the how the PCIe hot plug signaling works. So so I volunteered.

Speaker 10: 01:43:56

I I have enough of a rework station at home that I could do this. And then they told me how many of these footprints were on a board. So It was a big 12. Well, because the same footprint was used for a variety of these discrete logic gates. Yeah.

Speaker 10: 01:44:15

It ended up being, like, 12 or 13 footprints. And I'm like, okay. So how bad is that? Well, these are I forget what they were. Like, QFN

Speaker 1: 01:44:26

16.

Speaker 7: 01:44:27

16.

Speaker 10: 01:44:28

Sixteen pins. So 16 pins times, you know, 12 packages. You're you're talking a lot of pins, and these are pretty small pitch.

Speaker 1: 01:44:36

These are super small pitch. Adam, I if you could you next time you're in the office, you gotta see this thing physically. It this is, like, nuts to me. I mean, I know this is, like, by reworking standards, this is merely, like, workably small, not unworkably small, but this is really fine work.

Speaker 10: 01:44:53

And and that's why when I say have a reasonable rework setup like, I have a microscope. I have, you know, a fine pitch soldering iron and, you know, tweezers and all the all the setup. And I'm not set up to do really, really small things, but this was in the I can probably make it happen. Now for people who are not familiar with this, like, I'm using 32 gauge wire, and that's big for what I was doing. So I set to work.

Speaker 10: 01:45:22

I went through and removed all of these chips because they were all as Nathaniel pointed out, they were doing things like shorting outputs to inputs and and various other things. So I wanted to remove them all just to eliminate any sort of unintended behavior. And then I was like, no. There's no way I'm gonna be able to sit here and solder individual wires because I I have to actually what I'm what I'm doing is taking the chip and flipping it over on its back and hot gluing that to the PCB.

Speaker 1: 01:45:57

You you're you're dead bugging them. Right?

Speaker 10: 01:45:59

I this is literally what they call dead bugging because it looks like a dead bug. Right? You squish it against the back of the the PCB with its legs up in the air. And, so now I have to attach wires to each pin on the chip and run that over to the correct pad on the PCB to fix this mistake. And, yeah, the prospect of doing that 16 times on 12 chips was was not good.

Speaker 10: 01:46:26

So I started with 1, and that took about 20 to 30 minutes to do one chip. And that that was that was a lot.

Speaker 7: 01:46:38

Yeah, man.

Speaker 10: 01:46:38

So I I went back to the schematics and figured out, okay, if I just wanna make one of the PCIe slots work so I can test this, How many chips do I have to do? 4. Okay.

Speaker 1: 01:46:50

Speaker 10: 01:46:50

can do 4 chips. Four chips is somewhat reasonable. 3 and a half hours later, I had been staring through the microscope the whole time, doing reworks. And the part of the problem is that as you get further and further in, you have denser and denser wiring. You have wires overlapping.

Speaker 10: 01:47:11

And so as you're working on something, you'll you'll bump a previous one, and that's just enough, tension. Like, you didn't get a good soldering joint, so it'll break free or, you know, you'll melt one wire by accident when you're working on another one. So, yeah, 3 and a half hours of microscope work later, I had it all reworked for these 4 chips so that I could test that the hot plug circuitry worked the way we thought it did. And it turns out it is.

Speaker 1: 01:47:39

This is great.

Speaker 10: 01:47:41

For the most part, there there is one quirk that we had to go back and and rework, which is, you know, important. Had we not tested this, we would have built revision b

Speaker 1: 01:47:49

Yeah.

Speaker 10: 01:47:49

Incorrectly. So it it was important work to do, but this is the time like, for all that we talk about, the time spent on things like the pull up resistor, where you're trying to understand what the problem is. And it's actually the fix is really trivial once you get down to it. It's it's the hard part is figuring out what the problem is. In this case, the problem was really obvious.

Speaker 10: 01:48:11

Right. Just the sheer amount of effort required to do the rework that takes all the time.

Speaker 3: 01:48:18

Rick, this is heroic. I I I posted, what I think is your picture of of one of these chips. But one of my questions is, so how did the schematic get into ORCAD or whatever incorrectly? So this is this is one

Speaker 7: 01:48:33

of those, scenarios where, you know, you you have to verify everything because, actually, the schematic is correct. So if you look at the schematic and you look at the part datasheet, everything matches. And, like, I was wrong. This is actually a 14 pin part, not a 16, but it's 4 millimeters square. So, like, this is not a big part.

Speaker 7: 01:48:54

But yeah.

Speaker 1: 01:48:55

That is a big grain of rice, I'd like to say. That is in grain of rice. I think I feel 4 I feel I can find a a 4 millimeter squared greater price. Sorry. Go ahead.

Speaker 7: 01:49:04

So so the the, the symbol that we drew for the schematic was correct. But when, when the layout person built the footprint for it, where they started their pin 1 was actually on pin 2 or or maybe on pin 14 depending on I don't remember which way it was rotated. And so what happened what happens in that case then is what we think is electrically 1 and what the chip thinks is electrically 1, ends up actually being electrically pinned 2. And so it's just like a human error mismatch between the of thumb. But in this case, you know, it like, it was just a human error that messed that up.

Speaker 7: 01:49:56

But so that's how you get the mismatch.

Speaker 1: 01:49:59

And it's kind of amazing that it doesn't happen more frequently. I mean, Nathaniel, you had made this point with Gimlet that there about how lucky we had been in so many different dimensions. But but and lucky that this didn't happen more frequently, honestly.

Speaker 7: 01:50:14

Yeah. Well, out most of our schematic symbols and footprints have all been, like, created as part of, you know, this company and and by, you know, various people. So we had some partners helping us with some of those. And then, you know, we we now have some of that, in house now. And so a lot of this has been, you know, like, kind of created by a bunch of different people.

Speaker 7: 01:50:39

And, you know, a little tiny error where you just happen to, you know, miss where pin 1 is, you know, can have, you know, devastating effects. And, you know, as as we go through and mature our library process, there will be more reviews and more people looking at this stuff so that there's kind of a double check, because, you know, like, these kinds of pins can be, you know, can be this kind of a problem can be very expensive. And but, you know and these are, like, popcorn logic parts. And so, you know, we we did a lot of looking at the t 6, and we did a lot of looking at the arm. And we did a lot of looking at the s p 3, but, like, in this case, this was just missed.

Speaker 7: 01:51:14

So but we've we've had very few of these. So we we we can be thankful for that, and our library process is getting better as well.

Speaker 1: 01:51:21

It it is. And we yeah. We I've again, we've got we got lucky, and then on this issue, I mean, again, heroic heroic rework from rec to at least allow us to validate the design. Because what we're trying we were trying to do in this kind of parallel to to bringing up sidecar is we're trying to get the design for for Gimlet completely verified. And would I'm trying are do there are other mishaps?

Speaker 1: 01:51:45

I mean, I feel like we we we hit most of the Gimlet mishaps here. Or am I you know, am I missing a glaring

Speaker 7: 01:51:51

one? No. I I think that was most of them. And, you know, like and then on sidecar, I mean, I posted a a picture in the Twitter thread about, you know, we had, like, the wrong package called out. So, like, we bought the wrong parts that didn't fit the footprint.

Speaker 7: 01:52:04

So that was the flash part that I reworked on there. But we haven't had any, you know, anything that was, like, that crazy of a miss from, you know, datasheet to, you know, to implementation. I think on Gimlet, we had, you know, some we had to use some copper tape and beef up some, some power rails in a couple of places. So that was, you know, a little bit, nasty in in a certain manner speaking for but, you know, Eric did nice work there.

Speaker 9: 01:52:34

Management network. That was a good one.

Speaker 7: 01:52:37

Yeah. The management network had some polarity swapping and some, you know, other issues. And, I mean, like, TX and RX swapping because, some of the parts have kind of a a sad symbol naming.

Speaker 5: 01:52:53

Yeah. But there, we we managed to do it wrong twice, both on the connector end and on the chip's end. So and then we and then we fixed the chip end, and then we were wondering why the link still wasn't working. And then we realized we had to reverse the connector end now too. So, yeah, that was a fun one too.

Speaker 5: 01:53:09

Chip.

Speaker 1: 01:53:10

Revero and default. Yeah. Exactly. And then, Ari, do you wanna talk about the kind of the the latest, update on on sidecar? So things that you got you got PCI up from a configuration perspective.

Speaker 1: 01:53:21

Kind of the last major piece on sidecar was actually like, okay, did the links work at all?

Speaker 5: 01:53:27

Yeah. So with some software running now, we've been working on on a switch management, piece that, can can program tables and and and program the p 4 program into this into the ASIC and bring bring the links up. And we recently managed to get some links working, loop backed on themselves. So one port of Tufino to another port of Tufino, those links that we in the several different configurations ranging from a short short little loopback cable to, cables that are representative of the backplane that we intend to, ship with. And, we managed to close the link at, at the, intended or the possible 400 gig that each each Tofino port can provide.

Speaker 5: 01:54:14

So and and, but Do

Speaker 1: 01:54:16

do you wanna mention some of the, the the the hurdles we had getting there? Because I think they're they're they're both kind of representative and interesting. At least 2 of the ones I could think

Speaker 5: 01:54:25

of. Yeah. The also, as as we're as as Niels and I are debugging this and we're we're you know, can't get the link up. We think we have all the code right, and, and so we enlist Intel's help. We set up a meeting so that they bring out some of their engineers of their of their team.

Speaker 5: 01:54:43

10 minutes prior to that meeting, I'm like, oh, let me make sure that the setup is correct, like, as we as we are prepping for that meeting. And I lift up the board, and I realized that we actually removed the loopback cable between the two ports because we were gonna do some VNA measurements on another board with that same cable. And I so I did we did the measurements, and then I never put the cable back. So I was like, okay, this is one reason why this board is not intended to be a wireless thing. This is intended to be wired.

Speaker 5: 01:55:10

So I stick the cable back, and luckily, it still didn't work.

Speaker 10: 01:55:14

So I was

Speaker 5: 01:55:15

like, okay, fine. Like, at least we didn't waste 2 weeks of time on just the stupid cable that we that we just forgot. And and the con for the context here, these cables are under the board, under this metal chassis that is visible on some of these pictures. It's very difficult to see that there's no cable under there. So Right.

Speaker 5: 01:55:30

Tying all the cables.

Speaker 1: 01:55:31

In your defense a little bit. Yeah. Exactly.

Speaker 5: 01:55:33

That was missed. So here we go into this call with some Intel engineers, and we walked through various pieces and then we turn on we we had configured the BSP for the for the for the platform for these two ports and couldn't get, even in test mode, we couldn't get a single wiggle on one of these links. And so, we're going back and forth, checking their SDK is you take there's this long spreadsheet that you that you that you take and then hardware engine like the the SerDes engineers basically fill out properties of the SerDes in the spreadsheet and then they take as they there's a Python script that then takes the spreadsheet, the the basically the comma separated values of the spreadsheet and then turn it into this JSON blob that then goes into their SDK. And then from that, they distill a port configuration that they then load into the device. So So it's a multistep process.

Speaker 5: 01:56:28

We skipped to go into these pieces because we we the the JSON parser broke for us. So we basically build a couple of c structures that would that represented the port configuration for what we were what we were doing. So as we are debugging, we found a couple of small issues like, when we were put the port in test mode, you do need to give it analog tuning parameters, meaning these SerDes are fairly And without tuning parameters, the ports are never gonna work. So you need at least something. So as you're on the call, one of their engineers pulls off out of one of their running systems, existing tuning parameters that they found that would work for a, you know, 50 centimeter or 1 meter cable.

Speaker 5: 01:57:13

So that would be in the ballpark of what the cable that we were working with. So first of all, we needed to set those. We hadn't set them properly. But so then as we were working, the port still doesn't work and we we go through this process and they're like, well, you need you need to go and work out. You need to basically build your entire platform.

Speaker 5: 01:57:27

You need to load all the ports into your platform so that if we, for some reason, have 2 ports connected to the wrong thing, then we can, you know, we can you just you just try them all and at some point one of them should start working because they're connected to each other. So at some point you should see something. I was like, okay. Fine. We can go do that work because that's but that would that was representative of a couple hours that we needed to do, so we couldn't do that right on the call.

Speaker 5: 01:57:50

And as we were talking, one of their engineers says, oh, you know what? Let's check the port numbers because, I don't I remember there's a mismatch between the the the port numbers that you that we use on the schematic symbol and the port numbers that are used in the spreadsheet. And as it turns out, the schematic ports are labeled 0 through 31 and the ports in the spreadsheet are labeled 1 through 32. And so the ports that we had configured were off by 1, and so after we changed the off by 1 error in our code and we and and restarted the process, lo and behold, the ports worked. And they came the link came up and and and we were we were underway.

Speaker 5: 01:58:26

So it wasn't as if we hadn't didn't hadn't plugged the cable in, but then the cable was in the like, the configuration was assuming the wrong port. So, for those of us in software wondering about the off by 1 errors, it's off by 1 errors all the way down to hardware.

Speaker 1: 01:58:40

That's right. Yeah. Off by 1 I mean, I and what a what a brutal. I mean, obviously, it feels very vindicated to get that nailed, but also felt, you're just like, oh my god. It's off by what nurse.

Speaker 1: 01:58:51

Yeah.

Speaker 5: 01:58:51

But the the thing was that we should have had that call with them just a little sooner because while we didn't waste huge amounts of time, because it wasn't a 2 week full time effort, If we because there were we simply would have missed a couple of things. Like, first of all, they're off by one error, we would have known not have known because they were they were using the exact same labels that they used on the pins of the device, like the device symbol and how they talk about it in the hardware guide. They use the exact same naming convention except in software, except then in software, everything was off by 1. We would have never figured that out, I think, by ourselves. Well, maybe not never, but it would have been it would have been a long one.

Speaker 5: 01:59:30

Yeah. And then the the t the the

Speaker 1: 01:59:41

and And it should be said that when we have been, you know, making all the decisions about what parts to use, a part of our of our calculus is figuring out what a partner is gonna be like to work with when things don't work. And we've been really blessed. We've got some folks that are very, very invested in helping us figure this stuff out. I mean, in some ways, like, the thing I felt best about on the t six was certainty that, like, we will get this resolved. Like, okay.

Speaker 1: 02:00:08

This is good. You're gonna I gotta ultimately, we figured it out, but, they were definitely as perplexed as we were on the p six issue and, always available to brainstorm.

Speaker 9: 02:00:20

They get some serious props. Their their guy that we're talking about, Jeff Heat. He's, like, the FAA that will take a phone call from his customer while he's at dinner with his wife.

Speaker 1: 02:00:32

He he they were really and and which was great, actually.

Speaker 5: 02:00:35

We should encourage that behavior, but it's not asking us.

Speaker 9: 02:00:38

That behavior. Do not do that. Yeah. But Yeah. For anybody.

Speaker 9: 02:00:41

That is not like, you don't have to do that.

Speaker 7: 02:00:44

But it It's, like,

Speaker 9: 02:00:45

just dedicated. Could not we were having so much trouble, and he was gonna do anything. Yeah. It was incredible.

Speaker 1: 02:00:53

It was deeply appreciated. I feel that same way, but I think it tells me great on the Tufino side too. I've really and AMD too on with when we had the SP 3 challenges. I mean, we've really turned to these folks, and, you know, we tend to do our homework. So we tend to be, you know, tend to show up having the the the the common stuff nailed.

Speaker 1: 02:01:11

But,

Speaker 5: 02:01:12

Like the cables plugged in?

Speaker 1: 02:01:14

The cables plugged exactly. That's right. The cables plugged in. Yeah. But, yeah, the so we you know, in the end, I feel like we've we were we've done our EBT our our revision b for EBT has taped.

Speaker 1: 02:01:31

Nathaniel, what the what boards arrive in a couple weeks. Right?

Speaker 7: 02:01:36

Yeah. They are somewhere between where they're the actual PWBs have are been shipped to our manufacturer. So

Speaker 1: 02:01:43

So that's really exciting. Really expect I'm gonna knock our wood, but we really do expect that bring up to be, to be pretty smooth. I feel like we've got a lot of experience now with this thing.

Speaker 5: 02:01:53

But we fixed a lot of things.

Speaker 1: 02:01:55

We fixed a lot of things. And we we touched a lot of the board, but we fixed a lot of stuff.

Speaker 7: 02:02:00

And and we also start with, like, with code and a level of things that just exist that did not exist on day 1, the first.

Speaker 5: 02:02:09

You have so much more debug software for all pieces in the

Speaker 1: 02:02:12

system now? We do. And I feel like that's been something that's been fun is to be get that debugging software better and better. And I feel like I mean, it it helped us in Sidecar. We were faster on Sidecar than we were on Gimlet because we had I think learned a lot from from what we wanted to have in Gimlet.

Speaker 1: 02:02:29

And I feel like we've made it better yet again for for Rev b here. So it's been it's been fun. I mean, that that that's kind of borne out a lot of the, certainly our belief in the the hardware software code design.

Speaker 7: 02:02:42

Yeah. And we're using a lot of the same parts on a lot of these designs. So, like, one thing we didn't talk about, like, sidecar had a little they had a we had a missing capacitor on a couple of power supplies. But, we had caught that issue in a design review for another design, like, 2 weeks before bring up or a week before bring up. And so that was all fresh in our minds.

Speaker 7: 02:03:04

So as we're going through troubleshooting, you know, of why a power supply didn't turn on it, you know, it turns out that required capacitor is required. And so like but that helps as we start building a bank of functional circuits and stuff that we're parts we're comfortable with and parts we have good references that really helps speed things up as well.

Speaker 1: 02:03:22

Yeah. I think we, you know, we we officially go from from, ripping up all exit designs to be we go from from part of the solution to part of the problem in terms of, like, no. Don't change this stuff. It works. You can certainly see why people have that bias.

Speaker 1: 02:03:35

But, yes, building up that all that shared knowledge is very helpful. You yeah. Anything I I I from anyone else, any any kind of parting thoughts? Or I know we've gone on for Adam, thank you for the time here. I knew this was gonna be a long one.

Speaker 1: 02:03:52

You know? Yeah. I know. I could fire

Speaker 3: 02:03:53

it up. I I'm I'm I I knew it's gonna be at least 2 hours.

Speaker 1: 02:03:57

Yeah. And, hopefully, this this delivered. It's been I did it it's been a a wild couple of months here, but, it definitely some, some terrifying moments. I do feel boy, it was darkest right before the dawn, though, because I was in that that conversation where Rick is, like, assuring me that we're gonna find it. I was really not feeling optimistic.

Speaker 1: 02:04:18

So, a good object lesson there too in terms of of just retaining that resolve and that perseverance. And then, also, honestly, I feel on all of these things. I don't know how you all feel about this, but, I think it is really important to have a team of people attacking these problems where different ideas, different perspectives, people trying different things has has been really essential for us.

Speaker 7: 02:04:43

Yeah. I I 100% agree. I think, you know, it's so easy to get lost in in the forest, and you need someone else to, you know, come back out and say, hey. Look. You know, like Rick did with the part, like, I'm gonna go play, you know, in this other space or, you know, we've had lots of those where it's like somebody has an idea and because, you know, the rest of us maybe get buried in a spot, it's that external perspective that really helps unlock things.

Speaker 1: 02:05:09

Yeah. It's been an advantage too of being a being a bit distributed, honestly. I mean, I think a lot of people have asked, how do you do a hardware company as a distributed company and a remote company? And, you know, honestly, like, that was a good example where I think it really helped that Rick was not in the same room as as, like, RFK Robert and me, for example, where I think, like, the body it was good to have, like, a little bit of isolation there and be willing to do different kinds of experiments.

Speaker 9: 02:05:33

He's not tainted by our facts.

Speaker 1: 02:05:35

Exactly. Exactly.

Speaker 8: 02:05:36

Well, and it it forces conversations into these, like, common channels where other people can kind of opt in and opt out as they want rather than needing to be in the room. You just have to be paying attention to a certain chat channel or something. You can always scroll back.

Speaker 1: 02:05:53

Yeah. And that has been really and that's certainly and as I was going up to this space, just trying to remember everything, it was very helpful to be able to go through the the the chat channels. And the party had emojis, we definitely use the party had when things work. It's nice that you can go ahead and search the party hats for party parrot. That's what we're doing.

Speaker 1: 02:06:10

That's right. That's right. Yeah. We're not not not fully sacrifice, but get the party hats. Alright.

Speaker 1: 02:06:16

Well, I know it's late on the East Coast in central time, but, thank you everyone for for joining us. This has been a lot of fun to get these, these tails down. Arian, thank you for the end. Arian, Eric, Nathaniel, Aaron, Rick, RFK. Thank you very much for Steve and Robert.

Speaker 1: 02:06:37

It's been a lot of fun to actually record these. And I really enjoyed listening to our past episodes to remind myself of the of the problems, but it's it's gonna be fun to carry this into the future as

Speaker 5: 02:06:48

well. 20 years from now, we will all reminisce about the one recording we did with the the the nicks and the switches. And, yes, it will be cool.

Speaker 1: 02:06:55

There we go. The 500 ohms, man. The 500 ohms that saved the company. I would would

Speaker 5: 02:07:00

Trauma.

Speaker 1: 02:07:01

It's hardly Trauma.

Speaker 9: 02:07:02

You force relive the trauma. Good stuff.

Speaker 1: 02:07:06

Alright. Thanks, everyone. See you next time.