Raiding the Minibar
Hey, Adam.
Adam Leventhal:Bryan, we meet again.
Bryan Cantrill:We meet again
Adam Leventhal:for the final time.
Bryan Cantrill:Yeah. Despite your experience with the two Suge's moving company, your your name for for my sons in their in their moving.
Adam Leventhal:No. No. No. No. That but first of all, I mean, you corrected me last week or a couple weeks ago when I when I was saying that we held hands when we rebooted the road.
Adam Leventhal:No. I said it was the Three Stooges movie company, and I said you were really missing something. And I
Bryan Cantrill:knew I'm the third Stooge in the in the Three Stooges movie company.
Adam Leventhal:Right. Right. Without you, I just had these two Stooges.
Bryan Cantrill:You had two Stooges. I don't know. The act is kinda missing something a little bit.
Adam Leventhal:No. No. I I was I was definitely a Stooge along with them as we wrestled around in my storage unit and got everything done. So that was wonderful.
Bryan Cantrill:Well, that was very well. Well, they were well, the the end stooges moving company was was helping you out. I I was inhaling careless people by Sarah Wynn Williams.
Adam Leventhal:Oh, nice.
Adam Leventhal:That's
Bryan Cantrill:awesome. I Yeah. I actually It is the first time I can remember that Bridget I I made the mistake I verbally I was verbally exclaiming a lot while reading
Adam Leventhal:this book.
Bryan Cantrill:So this is the book by this is the book that Meta has successfully prevented Sarah Wynn Williams from promoting.
Adam Leventhal:But hey, no worries Sarah, like they've actually saved you a bunch of time by doing all the promotion anyone would ever need.
Bryan Cantrill:The courtesy of Barbara Streisand, we are doing all of the promotion. Yeah. It is and the book is is amazing. It is it is just and we are gonna we're gonna have to we're gonna have to do an episode on this. I don't know how.
Bryan Cantrill:Obviously, Sarah Wood and Williams can't be on here, but we gotta have, like, Charity on here because Charity was at Facebook at the time. Rain was I mean, we've got a bunch of folks that kind of we know that overlap with that, but
Adam Leventhal:No. We're we are gonna have Sarah on. She's just not gonna be able to say anything. But we we should address all of our
Bryan Cantrill:I love it. I love it. We can do as far as as far as anyone is concerned, we got no. We got Sherwood Williams here. She's right there.
Bryan Cantrill:She just can't see anything, obviously. Not gonna say anything.
Adam Leventhal:That's right.
Bryan Cantrill:What kind of it's like a pen and teller thing. I I I kinda like that.
Adam Leventhal:I'm sure it'll work well on an audio format and we'll go for it.
Bryan Cantrill:No. I think we'll go for it. I think I think that that's That that's great. Anyway, it is a it's a must read and I would really I have as I as I posted on Blue Sky over the weekend, I have not verbally exclaimed this much since reading Bad Blood. Assume you you Yelped when reading Bad Blood, I assume.
Bryan Cantrill:Yeah.
Adam Leventhal:For sure.
Bryan Cantrill:I mean, do you not Yelp when reading Bad Blood? That's crazy. I I just feel that the I because I also feel like I have just read the craziest thing I've ever read in my entire life. And then you read the next chapter and you're like, I would like to amend my statement. This is now the craziest thing I've ever read in my entire life.
Bryan Cantrill:And I definitely feel that that is true for for careless people.
Adam Leventhal:Yeah. I'm I'm slowly making my way through a character limit, and I kind of feel
Bryan Cantrill:the same How do slowly make your way through character limit? We do know, honest amount of like, you get the spin of a monk to make yourself go slowly through character limit. I am impressed.
Adam Leventhal:I feel like, you know, if you just search and replace that book, you know, Twitter for Doge, you've got like a new book called like Limited Government or something, and and you just hit ship. And I think that some of the limiting factor is, like, just I I can only take doses of this guy in in sort of moderate quantities, and then need to put it down and and take a little stroll and then pick it back up.
Bryan Cantrill:Oh, that's interesting. So you're saying it's like it's actually not you would love to claim it's that your it is your own self discipline, but it in fact, it is your it is your immune system that is actually forcing you to actually put this thing down. That's right. That's right. Well, that's however you're pulling it off, I'm amazed.
Bryan Cantrill:Feel like Yeah. I feel with Heroes People the way I felt that blood like I recommend people read that blood but make sure that you've got some open time in front of you because you're not gonna, you know, you're not gonna feed the children or look after the dog or whatever whatever other critical tasks you have in front of you because you're not gonna put it down and I feel the same way about careless people. It's just just it's riveting. In fact, actually, I'm glad that I put so I put Bridget on to it and then she started and she then I could get the book back. So I'm like, alright.
Bryan Cantrill:Well, actually, this is actually you know what? This is actually this is what I needed. This is this is the intervention that I needed.
Adam Leventhal:There you go.
Bryan Cantrill:But we're not here to talk about careless people. Not yet anyway.
Adam Leventhal:That's right. Sarah Wynn Williams is not here silently on this episode.
Bryan Cantrill:On this episode, as far as you know. That's right. That's right. The the but we we will do that. We're going to have to do that.
Bryan Cantrill:We're going to have to do something about it. And I and again, people should read it. So we'll do it. We'll do this is like I and this is like an emergency Oxide and Friends book club. We need to have like the Oxide and Friends book club red phone.
Bryan Cantrill:Like, alright, everyone to everyone go read read your emergency book. So, anyway, that's all we're talking about. We well, we, I am I'm really stoked to talk about this. We are talking about minibar, which is one of the one of these little one of these exciting boards we've got that you that that folks that wouldn't necessarily know about because we're not shipping this stuff. And you know, one of the things that has been, I mean this is of course makes sense but when you are shipping hardware, a lot of the things that you develop are actually not what you ship because just like in software, right?
Bryan Cantrill:We mean software we we make debuggers and things like that. We we actually do ship that because it's it costs nothing to ship. But we have a bunch of tooling that is part of building software.
Adam Leventhal:And I mean, even tests. Right? Everyone knows like even relatively unsophisticated software has tests. Like has test suites, has test frameworks, whatever.
Bryan Cantrill:That's right.
Adam Leventhal:You know you know you don't ship that, but you know it's it's equally important to the development of the software.
Bryan Cantrill:That's right. And so and we have got and similarly, we have a bunch of test stuff for hardware and minibars is is kind of a is a bit of an a ploose ultra on this. So we've got Ian and Doug and Nathaniel with us and and we've got some other folks have been working on this thing. So we want to get into kind of what minibar is and this for those of you who may remember when Doug previously joined us on on I think it was a cabling the backplane, right? But Doug we That's right.
Bryan Cantrill:Yeah. You and Doug you took us through a bunch of the mechanicals and the the mechanicals definitely need some assisting pictures, I would say. I know, Adam, this is your your your favorite thing. This is a great podcast. It was a slideshow.
Adam Leventhal:That's right. So speaking of of audio format, so we're gonna be talking about a bunch of pictures. And for listeners, either if you're on YouTube, you're probably looking at a picture of it in just a moment. Or if you're on the podcast, you can go to the notes. Or if your podcast app supports chapter artwork and
Bryan Cantrill:Is that is that like the is that the the feature that is Yeah. It's called chapter artwork. Yeah.
Adam Leventhal:Yeah. We can we can do some real navel gazing sometime and I can talk about my process, my onerous process of of making those. But chapter artwork, so if if you open up the app and look at the screen, it'll actually rotate through different images.
Bryan Cantrill:And meanwhile, in the chat, I have dropped two links. One to RP three sixty three, which is an RP that's still in discussion. So this is still like some some work in progress. But the the RFD describes what we what we've built with with minibar, what we are building with minibar. And then I've dropped in a link to a Google Drive that has got a bunch of a bunch of photos.
Bryan Cantrill:So you kind of want to follow along. And maybe with that, Ian, do you want to do you want to kind of introduce minibar and because I've been this this is an idea that goes back to really when we kind of shipped the the first racks in 2023.
Ian Sobering:Yeah. Absolutely. And and I could I could try to give a description of what minibar is, but I think it's it's fair to give some context first about how it came to exist because otherwise, it's gonna sound really weird. So let's if you if you'll journey back in time with me to the summer of twenty twenty three, we were getting ready to ship, two racks to our first two customers. And at the time, you know, we we had been building compute sleds, and we'd been building rack switches, but we were still very much at the stage of manufacturing where you're dialing in the manufacturing process.
Ian Sobering:And our attention was focused on, you know, building at that time, I'd say medium quantities, you know, because we're we're scaling up in preparation for starting to ship to customers. But we have the the the manufacturing infrastructure we had was exactly what was needed to build sleds, switches, and and PowerShell controllers and racks, and that was it. No, you know, minimum test, minimum other stuff. And so when we were building rack one and rack two, compute sleds would get they'd get assembled. They'd come off the assembly line.
Ian Sobering:An operator would carry them over to a programming station, program all the firmware, load the host OS image, and then we need to test the sled. And And and because what I just
Bryan Cantrill:for just a definition because for software folks, when we say pro when it what does it mean program a sled? Because this is not like a program that you write. Wait. When you say programming the sled, what do you mean?
Ian Sobering:Okay. So when I say programming the sled and, Nathaniel, feel free to jump in here and correct me if I if I screwed this up. Programming the SLED means low oh gosh. Okay. So loading the the firmware for the service processor, loading a bunch of stuff into the service processor's auxiliary flash, which is all of the images for the sequencer FPGA that's on the board that does power supply sequencing and control, actually loading the host OS image onto the internal m dot two solid state drive that it boots from, and then doing other, like, programming well, yeah, flashing the root of trust, getting all of the hardware to the point where the next time you power cycle it, the board will proceed through the power up or through all the power stages and it will boot into host OS and try to do rack stuff.
Ian Sobering:So there's there's a lot of different moving parts, and it's a, I don't know, four or five step process of, you know, flash this part and then power cycle and then flash the next part and then power cycle and then load the host OS and then reboot and then do some more stuff, And then you're done, sort of. You still need to test it, and and that's where our story begins. Is that is that
Bryan Cantrill:Yeah. That's a Right. That's a yeah. Very fair. The point is that there's actually a lot that actually needs to be loaded on these sleds.
Bryan Cantrill:And and and there's bunch of nooks and crannies that need to be loaded with software.
Adam Leventhal:I think Yeah. In past, we've referred to this as teaching the computer how to computer.
Ian Sobering:Teachers how to compute in in steps.
Nathanael Huffman:It it goes down to things like even some of the smart voltage regulators need configuration Yep. Loaded permanently so that they can do the thing that they're supposed to do.
Ian Sobering:Yeah. And and so the for for context, I think the programming process takes, what, Nathaniel, ten, fifteen minutes as it as it loads everything through various interfaces and, you know, the the you plug in a bunch of programming dongles, and and it's it's a it's a pretty operator intensive process at this point.
Nathanael Huffman:Yeah. I mean, the the big sources of time there really are loading the we have, you know, a 32 megabyte host image that gets laid down into a spy NOR, and so that takes a little bit of time. And then, we also load our m. Twos over the k.2 network, which I'm sure we'll talk about. So Yeah.
Nathanael Huffman:But those are the two big, like, major sources of of time there.
Ian Sobering:Yeah. So it takes it takes about ten or fifteen minutes per sled, and you do that for for 16 sleds or 32 sleds. But then you're not quite done. There's actually two more steps. So the big the biggest and most time consuming step is you need to test and make sure the slit is working properly.
Ian Sobering:Now now that it can compute, we take it pick it up off the programming station, we carry it over to a rack and plug insert it into a rack, and there we wanna see a couple things. We wanna see that the rack switch can communicate with the sled over the management network. We wanna see all of the the data plane Ethernet links come up and work at full speed. We want to see the ignition controllers in the rack switches be able to talk to the ignition targets in the sled. So for I I think we've talked about ignition on the podcast before, but ignition is the oxide racks subsystem for doing physical presence detection, low level power control, board identification, and some really basic error reporting regardless of the state of the management network.
Ian Sobering:It is the oldest network. So it would be sometimes we hear me yeah.
Bryan Cantrill:Yeah.
Ian Sobering:But all of these things go then the reason we can't test these things at the programming station is, you know, Brian has talked at great length about the benefits of having this cable backplane and and the high level of integration the rack has. Problem is on the test side, I can't just walk up to a sled and plug, you know, an ethernet cable into it and hook it up to my laptop. All of these interfaces are locked behind the Blindmate backplane connectors that when you slide the sled into the rack, you know, it grabs onto this this floating backplane connector cartridge and it aligns itself and it plugs itself in and make sure that everything is is mated without you having to cable anything up. What that means
Bryan Cantrill:is Great
Bryan Cantrill:for operations.
Ian Sobering:Put it in a rack. Yeah.
Bryan Cantrill:Yeah. Right.
Ian Sobering:And what we ran into, you know, our manufacturing process for sleds was pretty good. We were still dialing it in and there was some attrition and we can talk about that later. But the problem with trying to test compute sleds in the rack you were trying to build is let's say you plug a sled in and some interface doesn't come up. It could be a problem with the sled hardware. It could be a software bug.
Ian Sobering:It could be a problem with the backplane cable between the sled and the rack switch. It could be a problem with the internal cabling inside the rack switch that goes from the backplane to the switch motherboard. Or it could be a manufacturing problem with the switch. And all of a sudden, in order to troubleshoot this stuff, you're doing things like, you know, oh, okay. Well, you know, cubby 31 didn't come up.
Ian Sobering:Alright. Well, let's swap the sleds in cubbies thirty and thirty one and see if the problem follows the sled. Well, okay. No. It didn't.
Ian Sobering:So something in cubby 31 is bad. Is it a cable or is it the switch? Alright. Well, let's swap the two rack switches in the rack and see if the problem stays with the switch or follows the switch. And, you know, someone is is remoted into one of the management the a zone on one of the management sleds looking at all the link status and, oh, nope.
Ian Sobering:Okay. What's the backplane cable? Well, now we have to pull out a backplane cable, and replacing backplane cables is a is a really
Bryan Cantrill:Brutal. Yeah. It's
Ian Sobering:labor intensive process. And and all this time, you're moving the sleds around. Yeah. And the more wear and the more times you move the sled, the more wear and tear you put on the sleds before you actually ship it to a customer, and that's bad. We don't want that.
Ian Sobering:To give you an example of of something that I was personally there for, we were on Rack 2. We were we were troubleshooting one particular cubby, and, you know, we'd swap sleds around. We'd swap switches around. We had convinced ourselves it was a backplane cable. And so myself and and Robert Keith, one of the other hardware engineers, were in there.
Ian Sobering:We tore the backplane down, pulled the cable bundle, and and it's it's important to note these you can't replace one cable to one cubby at a time. The cables are in bundles of four cubbies each. So the the cables have a big connector on one end that goes on the switch and four pigtails on the other end that go to four cubbies. And so if you touch one cubby, you touch four cubbies, and you increase the risk of, you know, yeah, well, cubby 31 works, but cubby 30 now does not because I somehow damaged the the cable going to Cubby 30. Anyway, we we replaced the backplane cable and accidentally rotated the backplane connectors a 80 degrees in the little floating backplane cartridge
Bryan Cantrill:when we
Ian Sobering:put it back together. And if if you look in the photos that Brian put up, you can see these backplane connectors. They're big rectangles. They're about an inch cubed, an inch on each side. And they've got wafers in there that have, you know, a whole bunch of ground signal signal ground pairs that you send really, really high speed signals over.
Ian Sobering:But on one side of the connector, there's a big hole. And the big and it's it's almost 10 millimeters in diameter. That big hole is for a gigantic metal spike that's on the backplane connector side of the cartridge. Because when you when you push a sled in there, as you push the handle on the sled up to lock the sled in place, the spike finds that hole and aligns itself to the backplane cartridge so that when you when you push it that last little half inch, everything mates up and makes contact in the right order. If you rotate the connector a 80 degrees and shove a sled in there, you will stab that spike directly through the backplane connector.
Bryan Cantrill:Doug, if I recall correctly,
Ian Sobering:you described piece of firewood. And Yeah. That's what I I think you described doing this. Right? Yeah.
Ian Sobering:Oh, yeah. No. I I 100 because we we put it back together, and we've been in there like, twelve hours already at this point doing testing. And so we're both Yeah. Just shredded.
Ian Sobering:And, you know, if if you you know, we we put it back together, get the cables all hooked up, get the switch hooked back up, I go and grab a sled, and I shove it in there, and I crank the handle up into the locking position, and and it goes. And it does not feel very much different from from putting a a sled in the right way.
Adam Leventhal:So like stabbing
Ian Sobering:it to Just enough different. Was like that that felt a little different going in so I pull it out and and it was split like a piece of firewood. And at that point you have, you know, the the sled's dead. You have to send it back to reworking. That's the kind of risk you run testing the sleds in a rack.
Ian Sobering:And Yeah. And I I know this this has been sort of a long journey down down the bring up rabbit hole, but it we realized pretty early on, we needed the ability to to test the sleds at the programming station. And and to to know that every sled that came off was was working good when it went into inventory because we've just eliminated an entire error source from a rack testing. We know it's not the sleds. We've tested them all.
Ian Sobering:We know what tests we did. We, in theory, have the test results even, like, saved somewhere, you know, on the network where we can go look and and see a sleds test pedigree. And and we knew this at the time very, very vividly that that this is something we wanted going forward.
Bryan Cantrill:When you say this, you're talking about a so we we wanna be able to to have a kind of a receptacle that is but then it like, this is not just a mechanical problem. It's a mechanical problem, obviously, but it's also a big electrical problem because, like, what are you gonna connect to on the other side? And Yeah.
Ian Sobering:And and this is where this is where minibar's history gets kind of interesting because so simultaneously to to building and shipping the first two racks when we when we're we're realizing we need we need some more manufacturing test infrastructure. There there are are, you know, individual engineers and cabals of people throughout the company who are also bumping up into the whole, hey, everything I want is locked behind the backplane connectors
Nathanael Huffman:Yeah.
Ian Sobering:Problem. And and the hardware team would bump into people from time to time where, you know, you'd hear somebody in a meeting go, you know, man, I really wish, you know, I could plug my commodity Ethernet switch into the management network for for some reason. Or, man, I really wish I could test ignition without having to have a sidecar rack switch like in the lab that had because that's what has the ignition controller in it. You know? And it it's locked behind the backplane interface.
Ian Sobering:And, you know, people were suggesting things like, hey, Could we build a, you know, a little dongle board that is like the width of one of these backplane connectors that you could just you could just plug on to each little backplane connector that has, you know, an Xmax connector on one side and something like an RJ 45 jack on the other and just magically gives you Ethernet. Yeah.
Bryan Cantrill:That would be really cool. I support this magical Ethernet plan. What's Yeah.
Ian Sobering:And and so
Bryan Cantrill:All the difference of magical Ethernet, please raise your hands. You need not vote. It's only a software to vote.
Ian Sobering:Yeah. Because all also at the time, you know, we were we were having to get really creative with dogfooding together groups of compute sleds to actually do lab development stuff on. Because I don't think we had a dog food rack at that point. If we did, it was really, really new.
Bryan Cantrill:Yeah. It was very
Ian Sobering:it's not a good idea. We might not have had a rack in the Colo Data Center space yet, which has been a great tool. And so well and it take it takes me back to the first time I ever went out to the Emeryville office. I and, of course, it's it's during COVID. So I walk in, and there's nobody there.
Ian Sobering:It's just a bunch of IKEA tables that are absolutely covered in computers that people are remoting into to do development work. And and that was about the state we were at that time too. We had a lot we had a lot of lab sleds
Bryan Cantrill:that we wanted to get
Ian Sobering:the backplane connector behind. And so Yeah. These these two things sort of came together at the same time and and we went, okay, we we need a dedicated piece of hardware to do this. And the big question now is what does that hardware look like? And that brings us sort of to the next part of the saga.
Ian Sobering:Nathaniel, have I left anything out here? Or do you wanna
Nathanael Huffman:Nope.
Ian Sobering:Do you wanna jump in?
Adam Leventhal:I think alright.
Nathanael Huffman:Yeah. I think that's good. I mean, it it's mostly around, you know, getting tests early. Test early is good for a few reasons. It helps us scale down helps us scale down the, like, problem space.
Nathanael Huffman:I mean, there's always the chance that, like, while it was sitting in stock, it got damaged. But it really cuts down and and helps us focus on the problems that we have at hand.
Bryan Cantrill:Yeah. Yeah.
Ian Sobering:So, yeah, now now I guess the the only question was, what does the hardware look like? There were people who wanted dongles. Oxide was big on dongles. I think
Bryan Cantrill:I think everyone dongles.
Ian Sobering:Oxide was big on dongles.
Bryan Cantrill:Is that a is a nice thing? I cannot assume everyone. Isn't it doesn't everyone love dongles? Everybody
Adam Leventhal:loves us
Nathanael Huffman:on values. It's in our values.
Adam Leventhal:Right? We did a whole episode just on dinables.
Ian Sobering:Yeah. No. And and this goes back all the way to to the very first, you know, like service processor test boards, you know, the the gimletlets and the and the monorails. So people would make, you know, dongles for a single stick of of DDR four, you know, so you could so you could read out read out the the VPD promo or I squared c or something. And they make an even dongle, you know still much and
Bryan Cantrill:I love it.
Ian Sobering:Yeah. And and so I that had a lot of traction. But the more we the more we and and everyone wanted dongles for their use case. Know, an ignition dongle or that just broke out the ignition links or other stuff. But the more we looked at it, it it it there were there were five specific things we needed the manufacturing station to do, and that sort of drove drove everything because all of our lab uses ended up falling into those same test cases.
Ian Sobering:And and it was we the programming station needed a piece of hardware that could that could connect to the sled's ignition target and test that it was working correctly. It needed to be able to connect to the back the PCIe interface that the sled exposes on a backplane connector to make sure that the PCIE interface was working and it would come up at full speed. It needed to be able to break out the management network Ethernet and convert it from s g m I I into some format that the programming station could ingest like base t. We needed to somehow test that these that both of the sleds, 200 g or 100 g Ethernet links would come up at full speed, and we needed a we needed a PCA interface that we could use to load the the host OS image. The last one is kind of interesting.
Adam Leventhal:Yeah.
Ian Sobering:The way and and this because this goes back to the the multistep programming process. The way we program this the the host OS image, the way we load that on the sled is different than the way we do everything else. The the service processor, the root of trust, the FPGAs, those all have programming headers on the on the motherboard for dedicated programmers, and they they connect over USB back to the programming station. For the host OS image, we need to load that on something that only connects to the host CPU. So what we do is we pull one of the because there is this nice, you know, PCIe gen three by four interface that we have just sitting there doing nothing, but it goes to a backplane connector.
Ian Sobering:We so we can't get to it. What we have to do at the programming station is actually pull one of the u dot two SSDs out of the front, and Oxide has made a adapter called the k dot two. The k stands for Kluge that lets you plug, instead of a SSD, plug a PCB that then lets you plug a bi four Ethernet NIC into it.
Bryan Cantrill:It so and we talked about the k dot two in our proto boards episode. Yeah. Right. The the Nathaniel, regrettable but necessary. Is that what's on the k dot two?
Bryan Cantrill:Actually, I have one in front of me, but
Nathanael Huffman:It says unfortunate but necessary. Yeah. Unfortunately, but reasonable. Unfortunately, yeah.
Bryan Cantrill:And this is a really thorny problem because when we were we did have this like problem of like how do we actually get the how do we program the m dot twos and it felt like, well, can't we just like program the m dot twos not actually in the sled and and if the game what being the the compute sled, can't we don't we have a way of like, can't we program them outside the game with unload them? And man, Josh and I went down all sorts of rat holes there. And you would think that like m dot two programmers would exist, but let's just say that that was dead end.
Nathanael Huffman:I mean, they do exist. They just don't work.
Bryan Cantrill:They just don't exactly. The good news is they exist. The the bad news is that they don't work. So k and k dot two was so clutch for us and so really really important for us to be able and still important for us to to be able to end with with a with the the first K dot two, I gotta say, Nathaniel, I mean, you kind of took a little bit of the K out of the K.2 when you made you made a three d printed enclosure that it like, I don't know. Yeah.
Bryan Cantrill:The case is great for clues. It sounds like the case is for printing now.
Nathanael Huffman:Credit goes to Eric for that, really, because so the thing I made, I kind of just I mean, it was like an afternoon slap together kind of thing on k.two. And, you know, it was, like, what if we just stuck a PCIe interface on this thing? And we did that. But, like, I didn't have a production gimlet on hand, so I kinda, like, guessed. And I didn't really look at the CAD.
Nathanael Huffman:And so, like, the original boards that came in ended up needing to have, like, a nibbler taken to their top edge to even get them to fit into the slot. And they they were
Adam Leventhal:The the way you're phrasing that
Ian Sobering:is like This is Adam's
Adam Leventhal:it's like it was their fault. Like, they needed to have this done to them. Can we just, like, you made it
Nathanael Huffman:a little too big it wrong.
Adam Leventhal:I mean chop off half of, like, part of the fucking It
Nathanael Huffman:was it was like yeah. You need, like, a quarter of an inch off the top of the PCB.
Adam Leventhal:Explaining it in such a a a hermetic way that makes it feel like a a more sophisticated than taking some scissors to a PCB.
Nathanael Huffman:Well, yeah. Yeah. We used a nibbler, and so and that's not like a nice cut. Right? And so it's just kind of Yeah.
Nathanael Huffman:It looks like somebody chewed the top of Yeah. Like the and it's like a four inch. I mean, you need like a four or six inch nibble there. So, anyway, yeah, it's it's ugly. But Eric did a nice job.
Nathanael Huffman:He when we decided that this was more than, like, you know, we're we're not gonna use four of these. We're gonna need, like, 40 of these. Then Eric was like,
Bryan Cantrill:while hundred of these? Yes.
Nathanael Huffman:Yeah. Exactly. Why don't I why don't I CAD up something real? And so he, you know, he made a nice enclosure. He's he's got some, shots there in, in Discord.
Nathanael Huffman:But, anyway and, like, you know, he you know, we got a nicer PCIe connector, so the NICs weren't at right angles. And, you know, anyway, he did a nice job. But but we're we're we that's annoying because we don't get to test the sled with a u dot two. So Right. We take a u dot two out.
Nathanael Huffman:And, like, while we get good confidence that our, that's a buy one NIC because it's just a gigabit, you know, gigabit NIC. And so we know that that one of the four lanes works fine for PCIe, but we don't really get to test that. And so with minibar, we get to stick the NIC in out the back and test that and and use that use the sidecar, PCIe lanes for that. And we can leave the u dot two in there so we can get full full front and full fidelity test on all 10 drives.
Bryan Cantrill:Yeah.
Ian Sobering:Because right right now, if you go over to a a programming station, you know, they and and it it does look nice. So, you know, it's it's all in a in a case of everything, but you'll see, especially if there's multiple sleds being programmed, there's kind of a pile of u dot two drives sitting there Yeah. That have been removed from the from the sleds. And and, you know, at some point, you gotta put them back in and then do more testing.
Ian Sobering:So
Bryan Cantrill:And and then, you know, I just really also want to get just a a quick tangent on the on the programming of Ignition in particular because we had this incident that we that we all that perhaps somewhat callously referred to as the mass casualty incident where we blew out a bunch of gimlet sleds right before compliance. I think we talked about that in the compliance episode or maybe it's too traumatic. But the and and you'd actually done a bunch of the investigation working working with someone who had helped us decap that ignition to figure out what had actually happened. This is a lattice ice 40 chip. And we I think, I mean, the the kind of the moral of that story is that like these things are somewhat sensitive on their programming interfaces and this is this has to be done carefully.
Bryan Cantrill:You you can act absolutely blow out a sled in the in the process of programming it.
Ian Sobering:Yeah. And that I that interface for whatever reason was was particularly sensitive. I think it was a I think it was a grounding issue because the the mitigations we put in place after after the the quote unquote mass casualty event, we've we've never seen
Bryan Cantrill:that problem again. But I'm
Ian Sobering:I'm looking back through my email trying to find the, the my correspondence because we we worked with, an ex an ex contractor who was a yeah. John McMaster, who's a friend of
Bryan Cantrill:John McMaster. Yeah. Yeah.
Ian Sobering:Of Rick, former oxide employee. And he and he did a wonderful job. He he decapped the ICs. He took die photos. He, you know, did impedance measurements of all the pins.
Ian Sobering:And and I mean, it was really clear from looking at it that, you know, not only did something did something cause the the a bunch of ESD protection structures on the die to blow, but it happened during programming, I think. Yeah. And and and it was probably weakened during programming enough that then when it, you know, you put them in the rack and they blow. So it it but we haven't seen that again.
Bryan Cantrill:But it also is highly just the importance of getting this kind of step right in a bunch of different dimensions.
Ian Sobering:Well, and you want it to be as operator friendly as possible too because one of the risks to these parts is is if if you have an operator who has to reach in there down to the motherboard and, you know, they have to find the right connector to plug it into and there's you know, they're they're not in on Gimlet, I'll say they're they're not in, like, convenient places Good. Because, you know, for for a lot of reasons. And and so they they have to plug the thing into the right connector, They have to plug it in the right way. They have to not touch any of the rest of the printed circuit board. That's also an issue for the operators because they have ESD protection, but the less you poke the board, the better, especially on the programming interfaces.
Ian Sobering:So it it whatever we come up with needs to be operator friendly.
Bryan Cantrill:There you go. Yeah. Okay. So that that this kind of sets the sets the stage for kind of some of your first thinking on on what became minibar. Minibar so named because we had named our sled and side we we had our various elements in after cocktails and I felt like this was going to be a miniature variant of all the cocktails on
Ian Sobering:Yeah.
Bryan Cantrill:On the minibar.
Ian Sobering:And and that's almost exactly what happens. So because we we were sitting here looking at at minibar going, okay. So at this point, we're talking about a piece of hardware that plugs into the backplane basically has a backplane interface, has an ignition controller, has has a management network switch, does you know, breaks out Ethernet to what are effectively technician ports, and does some TBD thing with the, you know, 200 g base k r four Ethernet links. Oh, and and it connects to the sled over the PCIe interface. This is a rack switch.
Ian Sobering:We need a two channel rack switch, You know, instead of a 32 instead of a 64 channel rack switch, you know, or a 32 channel rack switch, we need a a two channel rack switch for one sled.
Adam Leventhal:Forgive me, but did you already explain why there was a PCIe connectivity coming out the the back of the sled?
Ian Sobering:Oh, no. I didn't. And and thank you for reminding me. So in the oxide rack, the sleds in Cubby 14 and Cubby 16 actually plug into the rack switches over PCIe. They attach to the the Tofino two switch ASICs in the rack switches, and they manage them over PCIe.
Ian Sobering:In all the rest of the Cubbies, the PCIe interface is just unconnected. But whatever sled you install in Cubby 14 and Cubby 16, the Tofino two will attach to that interface, and that sled is responsible for managing. So even though, you know, 30 of the 32 sleds are not going to use that interface, all of them need to get tested because one of our big selling points is, hey, you can put sleds anywhere. You know, there is no special, you know, special sauce where that's concerned. So, yeah, we were we were looking at a a two channel rack switch, and that's where where minibar came from is, you know, we've got we have all of these cocktail themed, you know, things, and minibar's gonna have a little bit of all of it in it.
Ian Sobering:And it and as I think, Brian, you were the one who came up with the the minibar name. But that was go ahead. Which maybe a
Bryan Cantrill:dated concept. I'm not even sure minibar still exist. I'm not sure that they you know, I think DoorDash may have ruined the minibar. But okay. So we start down the path, and I think to a certain degree, I mean, did you feel like you were kind of pulling together a list of what of, like, what people want from Santa because all of felt like you're just everyone was like, god, I like oh, finally, someone is doing this.
Bryan Cantrill:Okay. I've got all sorts of things I want in this thing. I want you to learn it up with everything.
Ian Sobering:Yeah. No. Well, because and also, like, let's talk about everything that else that was going on at OXA at the time. You know, we were we were the the hardware team was, like, at Benchmark Electronics in Rochester, Minnesota actively building these racks. We were dealing with, you know we're continuing to refine and make changes that would improve the design for manufacturing of the sleds and the rack switches and everything else.
Ian Sobering:We were starting to look ahead into the future at, Cosmo, which is Oxide's socket s p five compute sled. So we were doing the initial design work on Cosmo, and everyone was really busy. So Minibar's design at first was, like chasing down individual people and going, hey. I noticed you were working on subsystem x. Please tell me what would help you develop subsystem x in the lab.
Ian Sobering:And there for a while, there wasn't really a cohesive vision for minibar until you finally get the list of stuff together and you're like, shoot. I just need two I need two channels of the switch. And and that's what it's gonna have to be. And and Yeah. And there's other reasons for that too because it you know, we're we only have I'll I'll borrow Nathaniel's phrase, and and I really like it because it's it's very accurate.
Ian Sobering:We only have so many hardware engineering credits that we can spend at any given time. And a lot of them are allocated to to sustaining engineering work or to getting racks shipped or to future product design. Whatever we do for minibar, it can't have a massive brand new hardware lift. And it can't have a massive brand new software lift especially because the software the software side of the house was was just doing heroic work at that point, getting getting features features ready for for the first production racks. And so we were incentivized to reuse as much existing architecture as possible.
Ian Sobering:And that naturally led us down the path of, you know, take take a switch, cut out as much as you can what's left. You know, take instead of a instead of a Tofino two switch ASIC on the end of the PCIe interface, put a PCIe slot where you can plug a an Ethernet NIC in, you know, in instead of you know, and and and the list goes on. So that actually helped us in the end because the software side of of minibar spring up has been really straightforward so far. We already have drivers for that stuff.
Bryan Cantrill:Yeah. They from a and we're because you're just talking about like running hubris on the SP there and Yeah. I right. So then another so another kind of kink that I just wanna make sure that we don't we don't forget to talk about is we're also doing a big tooling transition. And we are we're we're moving because we we've been using OrCAD on Gimlet but it wasn't like, we were moving away from OrCAD.
Bryan Cantrill:We were moving to Altium for a bunch of reasons but a bunch of unequivocal reasons moving to Altium. So because I think it was a minibar or Medusa that was first in for Altium. I think minibar was certainly very very early for
Ian Sobering:I think think Medusa actually Medusa actually beat it by a little bit. And I and if I can say a couple words about Medusa Yeah. Absolutely. Before we name drop it here. So with I I I talked about the sled side of building the racks and the experience with with swapping the sleds around and and troubleshooting, you know, if links don't come up, are they are they sleds?
Ian Sobering:Are they switches? Are they backplane cables? What is it? One of the the other pain points really, really, really painful points we identified in that in building those first two racks was, was actually in the switch. The rack switch is not just one big printed circuit board.
Ian Sobering:It's three. And if you count the fans, it's four. And if you count the temperature sensors, it's five. But all of the so there's there's back panel ports and those have, you know, their own cabling, but there's also front panel ports. The switch has has a bunch of QSFP ports on the front for optical transceivers.
Ian Sobering:One And that that QSFP
Bryan Cantrill:board Yeah. Is that that QSFP board is I mean, how many parts are there in that bomb? That is it is I I think is that it is certainly yeah. It's big. Whether it's our second or third most complicated board, it is it's on the podium.
Ian Sobering:Yeah. I it's it's it's a beast. And the the problem we were seeing was right at the end right at the end of of of doing all the rack testing, we would test all of the front panel ports on the Switch, and we would test all of the fiber runs in the rack. Because when when we deliver the racks to customers, they are wired up with with the the fiber that the customer has requested, whatever their, you know, and and the the optics the customer has requested, they're they're ready to go.
Bryan Cantrill:Because the optics are not all the same. And the just in case And
Ian Sobering:the optics are not all the same. You know, different different
Bryan Cantrill:customers You have have
Ian Sobering:fiber in their data centers. Yeah. They they've qualified different transceivers. And so we are you know, we one of the one of the last steps we do and and because it's it's just the way it falls in the manufacturing process. For you know, first, you build them sheet metal, then you wire in the backplane, then you put all the stuff in the rack, then you test all the stuff, then you put the fiber in.
Ian Sobering:Because if you need to pull stuff out of the rack, working around, you know, working around the fiber is really not fun. So one of the last steps is we we route all the fiber through the rack, and we use the fiber just to wire up all 32 ports on the front of the on the switch in loopback. And we run a loopback test on it to see all the you know, we have 32 optical modules, you know, on a big on a big storage tray and we pull them out and we populate them all at the same time and they all turn on and all the links come up. You know, because even even if the customer's only gonna use one or two ports, and I and I think that's been the use case so far, We need all of them to work, and we check the fiber at the same time to make sure it's all terminated right and, you know, none of the connectors are dirty. You know, we clean the connectors, things like that.
Ian Sobering:What we discovered was that every once in a while, well, I I'd say every once in a while, on three out of four rack switches that would come off the assembly line, there would be a problem on the QSFP board that has the the cages for the QSFP modules and all the power supplies and the FPGAs to configure it and manage the transceivers, some of the press fit connectors on that board, the pins would get bent during the press fit operation. And then this is just one of those this is a normal manufacturing thing. It's one of the things you figure out in the process of of bringing anything into production. What sucked about it is it's an absolute nightmare to replace one of those boards after it's assembled in the switch. The switch has to come out of the rack, which is a two man You have to unplug all the backplane cables and and anytime you bend the backplane cable because the switches are not hot swappable.
Ian Sobering:They go in the rack and they bolt in and in theory, they don't come out without a technician there. And so you have to unplug all the backplane cables, try not to bend them, you two man lift the switch out, and you put it on something called the rotisserie because this thing is you you need to you need to take some screws out on the top, then you need to flip it out, you need to take some screws out on the bottom. And to do that, there's a there's a big I mean, it's it's a it's like a a spit that clamps onto this thing and has a crank on it where you crank it and it spins it upside down. So you can, you know, one person can just sit there and, you know, work on the top and then spin it upside down and work on the bottom and then spin it back. Have to do something like 50 screws.
Ian Sobering:You have to pull all the cables out and and I mean, we got to the point where it take like, you could replace one of these boards and take about forty five minutes.
Bryan Cantrill:And Yeah. As as Eliza is saying in the chat, it is actually the the rotisserie is the star of of every oxide office tour.
Ian Sobering:It is. It gets a lot. It's it's big and it's green, and it gets a lot of attention.
Bryan Cantrill:Doug, do wanna talk just a little bit about the rotisserie? Because that that that's such a and we'll we'll get some photos of this that we can drop in because it's that was another kind of custom jig that has, I think, proven to be very valuable.
Doug Wibben:Yeah. The the sidecar rotisserie was basically something we came up with to allow us to access the top and the backside of the the sidecar to fix the cables and, you know, to not have somebody, you know, two people every time to flip it over at risk of dropping a 75 pound mass on yourself. We affix it to basically a a kind of a rotating system that it would affix to on kinda similar to minibar here, some aluminum extrusions on a welded stand and then a, I guess, a wheel that would allow it to flip over. So it's you know, we've gotten a lot of mileage out of it for for, you know, demos as as we've been mentioning in the chat here, which was not not necessarily the the intent, but I'm I'm glad it shows well. So yeah.
Doug Wibben:That was that was a little bit of an engineering effort on its own.
Bryan Cantrill:Yeah. Well, and I think and this is one of these, and I think we Doug, there's been a lot of these where we you kinda have this problem where you can I mean, you can find things like you can go you can try to go buy something that will do this, but it's going to be more expensive and less fit for purpose? And I
Ian Sobering:I mean, that's kind of the story of everything at oxide though. Right?
Bryan Cantrill:Yes. That's right. That's right. And I I mean, I think we've had a lot of these, but like, man, they've been in so many this, thank God. We've got the the the the chops and the kind of the to just go do this ourselves.
Bryan Cantrill:I mean, is like and Doug, I mean, yeah. As you say, the rotisserie has paid for itself many times over, not just in terms of office demos, although certainly that. It it it took us a while to get the guts to actually flip that thing all the way over because, know, if it did it did not wanna you know, you drop that as you say. It's like it's not just the it's the weight. It's the as as everyone knows, I've got the if if there's a way to destroy something, I'll find a way to do it.
Bryan Cantrill:So
Ian Sobering:It's particularly unnerving because so because the the switch is a two man lift, there are four handles. There's two on each side for you to grab as you pull it out of the rack. And as when it's on the sidecar, even if it's secured in place, as you spin the wheel and rotate it past 90 degrees, the handles will flip out of their little recess stitches
Bryan Cantrill:flip out.
Ian Sobering:And go Yeah. Click of the click on the side of the chassis. And so you you hear metal start to, like, click and clack and ting. You know, every time I do it, there's this it's like, oh, it's just the handles. Okay.
Ian Sobering:Yeah. Right. Not it's not like shifting in the, you know, in in the in the I'm not gonna drop a, you know Yeah. $10,000 rack switch or whatever. To unwind I'm I'm sorry.
Ian Sobering:Go ahead, Doug.
Doug Wibben:Yeah. I think that's been in existence for probably two years, and knock on wood, I don't think anyone has dropped one out of there yet. So we'll hold down.
Bryan Cantrill:No. I'm knocking wood right now, Doug, but don't worry. It feels very Yeah. It feels very secure. So it it's been Yeah.
Ian Sobering:And and it gets a lot of use. There's a there's a rack switch that lives just permanently on a rotisserie in the oxide office for demos because people can, you know, look at this. Isn't that cool? Oh, yeah. Well, you think that's cool?
Ian Sobering:Wait until you see the bottom, you know, and
Bryan Cantrill:the The bottom. Exactly. The spider the the the the spider's nest in the bottom.
Ian Sobering:But in anyway, to to unwind this this bunny trail that I started, we we made a a smaller test board for the rack switches before they go into the rack called Medusa. And it is a a 32 port loopback tester that plugs into the front of the switch with 32 just passive DAC cables. And so as switches come off the manufacturing line, we deliberately test the front IO board in loopback before it goes into the racks so that we don't have to unwind the rack later. And and Medusa was, you know, so named because it has 32, you know, QSFP DAC cables coming off the front of it. It looks like a nest of snakes.
Ian Sobering:That was actually the first board to go through in Altium in the new CAD chain. Minibar followed not very long after that. I can't remember exactly the timeline. But We but Medusa was another that was another manufacturing pinch point that we found about the same time that we really needed to build our own tool to address.
Bryan Cantrill:Well, and the thing that I loved about actually sending those two through this kind of new tool chain we're getting used to with Altium is that it also because you're using so many of the same parts that we're using in Cosmo and elsewhere. It's like we're getting the parts library, we're getting all the things that we need from a kind of a the a cleanliness of EDA perspective as Nathaniel. Nathaniel was fond of telling me after we brought up the K1, it's like, you you realize that this like, this this relied on luck. Luck cannot be the strategy. We've got to do something, and so just like getting on a much better footing with respect to libraries and so on.
Ian Sobering:Yeah. And and we we tried to time that on purpose. It just came at the tool the tool transition came at at a bad time. And actually from for Minibar because the the first version of Minibar had actually gotten it to the point where it was schematically complete in OrCAD. And
Bryan Cantrill:Oh, really? I didn't think I knew that. Wow.
Ian Sobering:We were getting ready to to have schematic reviews and and and or might have had a schematic review or something like that. And then we decided to change to Altium. And at at that point, the focus was very much, you know, getting all of our parts libraries migrated over well, not even migrated. Re recreating our parts libraries in Altium and having a clean baseline needs to happen first because the next you know, before minibar, before anything else, because the next thing that's coming down the pipe is the Cosmo compute slot. And by the time we are ready from a project management perspective to start working on Cosmo, all of that parts library needs to be in place.
Ian Sobering:And and some of that is how Altium works. Altium is Altium is a really great tool, But in order to place a a component symbol in the schematic, you know, even if you're just in there messing around, you know, playing with ideas, the component needs to be registered in the PLM system. It needs to have a schematic symbol and a footprint and be reviewed and then released into the the component library. So there's a lot of upfront work. You know, it's not like KiCad where you can just sort of draw a one off symbol and say, oh, I'm you know, I'll wait till later to assign it a footprint or or check things.
Ian Sobering:Right now, I'm just drawing a schematic. In Altium, you need to do all that work upfront. So Robert Keith, one of the other hardware engineers, and and myself and and and everyone else embarked on what ended up being about a a ten month journey to learn Altium and build the parts library and clean everything out, clean out all the crap from the PLM system. Nathaniel wrote, what, eight track, all the tooling around integrating Altium's PLM Altium's component libraries into our PLM system and doing generating e bombs and mechanical bombs and scrubbing stuff. And, do you wanna talk a little bit about that?
Ian Sobering:Because that that deserves its sort of own own sidelight here.
Nathanael Huffman:I mean yeah. I mean, mostly, we just I mean, we wanted to script all the things. And so we started doing that with with Cadence, and so we had to transition that over to to Altium. And but, you know, we wanna automate, make it so that you can push one button and get a board package. That's really all that's going on there.
Ian Sobering:Yeah. And it's it's it's really slick. It was it was definitely worth it.
Bryan Cantrill:Yeah. And well, one of the advantages was that that that we were there's a the much better tie through to the mechanical side. Right? And I think it's because I think it was one of the early minibar demos where you're showing kind of showing it in SOLIDWORKS. And I mean, it just makes it much easier to kind of think about mechanically.
Ian Sobering:Oh, yeah. So Altium has two things that really really sped up our workflow. And Doug, feel free to jump in here on the SOLIDWORKS part. Altium has a nice little three d renderer built into the the electrical CAD side of the program to where I can, you know, I can just hit, you know, number number pad key three and, you know, like like, you know, number one is the two d board outline. Number two is the two d view.
Ian Sobering:Number three is the three d view. And if I when I build my parts library, if I put in three d models of all my packages, I can see how my board's gonna look. You know? And if I configure Altium right, I can set up design rules that check like package to package distance with the actual three d packages. Not going to go into more detail about that because that's a whole different Give me a about whether that's useful or not.
Ian Sobering:But one thing it does do is and I I put an image in the images directory of the Altium three d render of what the minibar board looks like, and then right next to it is an image of what the actual board looks like. And they're pretty close because we can get in this day and age, we can get pretty good packages from from most vendors. And if we if we can't, Doug Doug will make them. The second part is Altium has a really great SolidWorks integration plug in where Doug and I can push boards back and forth from SolidWorks to Altium to SolidWorks to Altium, and he can fix all the components I've put in the wrong places and do, like, actual tolerance mechanical checking against his enclosures. And that is phenomenal.
Ian Sobering:Because all of a sudden, you know, he can he can do stuff in parallel with me, and he can find all my mistakes.
Bryan Cantrill:That's very cool. So is is that a good segue to the kind of the mechanical side of this? Because there's obviously Oh. Doug, there's a there's a big mechanical piece here.
Doug Wibben:Sure. Yeah. I can start with the the ECAD, ECAD kind of co design integration as part of Altium just in brief. That's something I think I've been promised for twenty years that has never never come into fruition. But now we we have it, and it's great.
Doug Wibben:I Ian can place things. I can move things. I can add keep outs. I can add heat sinks. I can add all kinds of stuff, and the integration actually work.
Doug Wibben:It's fantastic. So, yeah, there's a lot of back and forth between Ian and myself on the the printed circuit board layout with regard to connectors and the location of everything else. Do we wanna jump into the mechanical design of the mini bar?
Bryan Cantrill:Yeah. Absolutely. Yeah.
Doug Wibben:Yeah. Okay. So I I kinda numbered my pictures in the kind of the evidence file here in the in the link backwards. So picture eight's probably the best one here to describe this. So, you know, when given the task of designing some mechanism to integrate this mini bar tester into a compute sled, we we first started out with a a printed circuit board.
Doug Wibben:Samtech, the Xmax connectors, they make plugs and receptacles kind of in the same format. We originally had it so that we would plug a board kinda similar, I guess, nearly exactly what what Ian shows in his in his images here with the plug style of the Xmax connector into the receptacle style on the Gimlet. That probably could have worked, but there were a couple of issues. So the I guess, the main issue was that these connectors are only rated for 250 mating cycles. So for Yeah.
Doug Wibben:For lag use, that that would last forever. For our production units, we kind of had to tailor our our main design for the production units because those are the ones, you know, as we continue to grow as a company and volumes go up, we're gonna need to be running a lot of sleds through these testers. So, the ideal idea was to put a a cable, a very short cable, like image. Yeah. The light has that, in between the our compute sleds and the minibar PCBA.
Doug Wibben:We've got the shortest ones that StamTech offers that will make them five inches long, 27 millimeters, and and that was enough, basically, to kinda bridge that gap there. So, also, in in kind of giving it this this compliant this this loose connection between the the test board and the the product board, we're able to use the same blind mate mechanism that we used in the rack that we had on the accident friends about two years ago, I think, here. So I didn't have to didn't have to invent anything new as far as the alignment goes. The sled already has a latch on it, a handle and a latch for articulation, and we in in large part borrowed a lot of the same design elements from our cubbies. So, you know, the current rack has 16 cubbies.
Doug Wibben:They're two two sleds wide. Our minibar enclosure, the production one, is essentially a thirty second of a rack. It's just one cubby, you know, extended a slight slight amount to add a board, and the the main interface is largely the same. So when when our production operators and manufacturing are running the boards for the test, it's a familiar motion. They don't have to reach around and plug anything in.
Doug Wibben:Training should be minimal. And, you know, we should be able to kinda fire these in and out at a reasonably quick play pace kind of compared to the programming cycle.
Bryan Cantrill:Yeah. Right. And then so and then this allows so by separating out those connectors, when we hit the mating cycles, we will replay we'll be able to just replace the connectors on the cubby and leave the mini bar alone. I it's the the idea there.
Doug Wibben:Yeah. So the the the connect the cables, there are three cables, the the two main ones and then the PCAU one that that Ian mentioned. Presumably, if we're doing kind of periodic maintenance on a tester, we'd probably just replace them all at once just for
Bryan Cantrill:Right. Yeah. Yeah.
Ian Sobering:Sure we're
Doug Wibben:not always missing them. But if if one happens individually, right, we could replace one individually. We can also since only one end of it is getting mated, we can rotate them a 80 degrees and get another you know, double the cycles out of them because the the test the test run doesn't get cycled.
Bryan Cantrill:Yeah. And this is, again, a very important to get this stuff right. Adam, do you remember back in Fishworks days when we had the contract manufacturer with a broken testing probe that do you remember this?
Adam Leventhal:No. I don't.
Bryan Cantrill:That was because we were we were seeing all these failures on one of the iWashie NICs and it was always the same NIC. And it was because the tester had bent pins and it was literally breaking every neck as it was coming through the line. Oh. Well, just like yeah. It's just like the absolute opposite of what you want to be doing.
Bryan Cantrill:It's the taking good stuff and training. So anyway, it's very important that like you you really like being mindful of these mating cycles and and making sure that the the actual manufacturing equipment is, certainly doing no harm, but it's all very important.
Adam Leventhal:That's right. That it's evaluating and and sussing out things that are broken, not introducing brand new Exactly.
Ian Sobering:And and that and that was if I can jump in here and elaborate on what Doug said, that was one of the other reasons we went to the cubby with the blind mate backplane connector was, you know, back toward the beginning of the podcast, we were talking about the dongle thing. Minibar's original concept was, okay, well let's let's just make a printed circuit board that's as wide as a compute sled that has, you know, power connectors and backplane connectors in all the same spots and you can just shove it into the back of the sled, you know, on the table. Because we were envisioning something like, you know, a k dot two adapter, like like you plug a k dot two adapter into a sled. You know, I was originally envisioning something in a three d printed case that you would plug into the back of the sled, and it would do the same thing. And, you know, Doug very quickly went through that and said, you know, this is going to be a nightmare of how do you get this aligned.
Ian Sobering:It's gonna be really prone to operator error. You know, you're you're trying to get all these connectors and stuff lined You know, Doug had already done gone down that rabbit hole with the blind mate backplane cartridge. And, you know, he you know, we also did the math on how, you know, how quickly would we burn through the mate demate cycle rating of these XMACS connectors on the production floor? It because it's it's something like 200 mate demate cycles. And Right.
Ian Sobering:And I wanna be clear. That's not a hard you know, that's not a brick wall. You know, at at cycle two zero one, it's not gonna be an open circuit. You know, and and in reality, it's probably good much longer than that. That's just what they rate it to.
Ian Sobering:And we don't want to ever get in a place where it just sort of slowly decays and you start getting intermittent programming failures, and then the operator has to troubleshoot that. But, you know, each rack has you know, let's say you've got one minibar. Each rack has 32 sleds that you're gonna program plus a couple of hot spares. So and if you ship a hundred racks a year, you're gonna burn through more than one minibar a year just replacing it based on the connectors. And at that point, we're like, okay.
Ian Sobering:We've already solved this problem. Know, kinda like we you know, we've already built a rack switch. You know, we've already designed a blind blind made package.
Bryan Cantrill:We've already designed a cubby.
Ian Sobering:Already have a cubby that the sled goes in. And and all of Doug's, you know, locking mechanisms and retention stuff and protections to prevent you from crashing the sled in the backlink connector, you know, and and it became very clear at that point. It's like, oh, we have this. Let's let's just make it let's make a small one.
Bryan Cantrill:Yeah. That's awesome. Okay. So we get the and we get this this gorgeous mechanical design. If this thing in the photos, it looks rugged.
Bryan Cantrill:I guess it feels very rugged. It just feels great. The so and then take us through kind of the kind of the development of the board and because you're doing layout as well on this. Right, Ian?
Ian Sobering:Yeah. So so simultaneously, we're we're we're trying to figure out how big the board needs to be. And and so because Doug needs to know how big to build the mechanical enclosure. And and the this cubby the rack cubbies are already pretty deep. You know?
Ian Sobering:I I mean, a rack is almost a meter deep, and we're about to add another foot onto that. And so we we're we're building this big long rectangle, and Doug and I are kind of looking at it going, I don't know if this is gonna fit on the shelf in the lab. Because the other thing in the back of our heads is people want these things for lab use. You know, and we have engineers that were probably gonna ship these things to people's houses. Because that was the other thing.
Ian Sobering:You know, we've we have people who have full up sidecar rack switches at their house. That's not a cost effective solution because there are a lot of oxide engineers who need to get at the stuff behind the backplane connectors. But sidecars are expensive and they're heavy to ship. We don't you know, we we wanna be able to send them something a lot a lot smaller and a lot cheaper. And so, you know, Doug and I kicked it around for a while and and eventually came up with something that's about, you know well, I mean, you you can see in the photos there.
Ian Sobering:It's it's the width of a compute sled and, you know, I don't know, about 10 inches deep that has basically the guts of a sidecar minus the you know, minus all the QSFP ports and minus the big switch ASIC. It has an ignition controller. It has a PCIe slot that that breaks out that nice backplane PCIe interface where you can plug a NIC into it. It has a full management network switch with all the same switch ASICs that we use in in Sidecar to do the management network side of the switching, not the not the data plane side. It has a service processor.
Ian Sobering:It has a root of trust, so we can do measured boot validation on, you know, minibar firmware if it's out on a production floor where we don't necessarily have physical control over the hardware or who who's accessing it, it's got Ethernet ports on the back, and and it's all nice and boxed up. I think Doug and I were both both really happy with how the whole system ended up ended up looking. It it it was really slick. But the the size of of the minibar itself was a problem. And because we we we designed the the production it's a big it's big.
Ian Sobering:It's big. And we we had one of these at at OxCon back in October to play around with. And they're I mean, they're they're probably gonna get bolted to the programming stations. They're big and heavy. And it needs to stand up to to a lot of a lot of abuse.
Ian Sobering:And and Doug did a really good job of of building that into the design, but remember back to the beginning, there were all these other engineers who, you know, hey, want one at my house. And like, you know, in my bedroom here, I don't have room for a full minibar. So Doug came up with what we call minibar light where we, you know, for for applications where you're you're not going to be pulling sleds in and out of it all day, but you just wanna set it up on your bench and and have it be plugged in there for long periods of time and you're not going to be messing with it, Doug came up with a way to repackage the all of the minibar electronics in a little three d printed enclosure that mounts on the top of the sled with magnets and gets cabled up with all the same backplane cables. They're just not blind mated. You you plug them in manually, you put little three d printed retention clips on them.
Ian Sobering:So, Doug, do you wanna talk about minibar light?
Doug Wibben:Sure. So, yeah, going back to, you know, the original design, we I originally only had the one design and and kind of socialized it. And like Ian said, the response was that is too big. So, yeah, we can I it didn't take long to to convince me that, yeah, we we needed something smaller and cheaper? So, you know, the the the full mini bar enclosure for production is is just a bunch of stock extruded aluminum elements.
Doug Wibben:It's not terribly expensive, but it there's there's some cost there. So to to get something smaller and and and less expensive, we've come up with the mini bar lights. So it kinda pictures one through nine or one through six, sorry, cover that. So image two kinda shows basically kind of the the split up version of of mini bar as it connects to our compute sled. So so Ian's board, the test board is there.
Doug Wibben:We have a basically, an extender cable for the PCIe interface onto a little enclosure that we can slap on the top. The enclosure is it will will hold a full height half length card. And if anybody wants to play with anything bigger than that, we can I can just make something form fit to that? The the thing we had to consider was that these these XMX connectors cables don't don't have great retention on their own. It neither does the the power connector.
Doug Wibben:So I kinda came up with just a little bit a little three d printed clip to hold those on so that if somebody walks by it and tugs on a cable, it doesn't come loose, and you're wondering what's going on. So so they should be retained fairly well there. And then the whole thing, you know, again, originally just designed for the production version was kinda convenient enough to just fold up. I guess image three is probably the best one there for on top to sit on top of the compute sled. So it's I think it's less than a hundred dollars for for the mechanical bits for this, and those are, like, extremely low volume runs.
Doug Wibben:So that's that's fairly reasonable. And
Bryan Cantrill:I love the power button. The power button is so great.
Doug Wibben:Yeah. I wish I could've lit that up for the pictures, but, yeah, that'll light up. Yeah. And the kind of on the right side there is is where the power comes into the minibar, which eventually feeds feeds the the slide as well. I didn't have a plug to show in there, but that's where the power will will come in.
Doug Wibben:And, yeah, it it's kinda sits neatly on top of the sled. Our next generation sled, it will work with that as well in the same manner. It's like Ian mentioned, it's got some magnets on it to just kinda roughly hold it in place, and there's there's a leg that kinda slips down into the fan tray so that this doesn't get, you know, kicked or knocked off onto onto the floor. So This Yeah. This is yeah.
Bryan Cantrill:It's so gorgeous. And so and this is and this is just so the the mini bar we're using for that kind of manufacturing use case. And this is the mini bar light is what we're going to be using certainly in in home labs, basements, but also in our lab here in Emeryville. This is going to be just to conserve that kind of that that depth. And we kinda use right.
Bryan Cantrill:Doug, so this not gonna be used in manufacturing. This is this is basically a dev use case.
Doug Wibben:This will not be used in manufacturing. No. This this would not survive, you know, production operators kind of plugging in and unplugging it. Yeah. This this is the yeah.
Doug Wibben:Careful version. Yeah.
Ian Sobering:This this is very much a I mean, this this is a lot Minibar Lite is a lot closer to what, you know, minibar as a dongle was conceptualized to be. You know, this this is just something small for the lab that people can use. It's not intended to get moved around a lot, and and it's designed to be used by by oxide engineers who are familiar familiar enough with the hardware that that they can they can take it apart and put it back together again. It it was about the time that Doug was was showing off the the initial renderings of the minibar production thing and people saw the the PCIe slot on the back because it looks like the back of your your ATX tower. It's got a PCIe slot there where you can plug a card in.
Ian Sobering:Where people people started going like, so can I put a GPU in it?
Bryan Cantrill:Yeah. Right. Of course.
Adam Leventhal:Right.
Ian Sobering:And we're like, I mean, I guess, but it, you know, it it needs to be 75 watt GPU or less, and it needs to come up and it's gonna come up in bi four mode, but that's it. But that's where and and the the the the the 75 watt thing, one of the things one of the images that that you posted in chat, Brian, that shows the mini war light all cable out on the bench, it has a PCIe add in card for one of the Chelsea o t six Ethernet NICs in there. And minibar is designed to be able to tolerate one of those guys. They're 75 watt class cards, which I think is the highest the highest PCIe add in card power level where you don't have to have external power connectors.
Bryan Cantrill:Right.
Ian Sobering:Because because I I and at that people were like, well, yeah, but what if what about accelerators and GPUs? I was like, which ones do you wanna use? I was like, well, we don't know yet. Like, well, come back when we have a good idea, and I'll build some external power into it. But if what what's really cool about this is, you know, we can stick a t six or t seven NIC in there and do software development against that.
Ian Sobering:Or if you do want, you know, 200 gigabit ethernet or a hundred gigabit ethernet into your into your your gimlet on a bench, and you happen to have a hundred gig switch at home, you can make that happen. I'm not not advocating for that, but that was the criteria. And so that so that's what Doug built. He built the mechanical enclosure around. And I I love the little PCIe riser cable coming off of that.
Ian Sobering:I think he has some photos in there of it. When it's folded up on top of the sled, you know, the minibar folds up on top of the sled, and then the PCIe enclosure folds up on top of minibar, and it's it's just beautiful. I love it. Great great work, Doug. Thank you.
Doug Wibben:Thank you. I'll I'll add, you know, in this home configuration, the PCI card is not in the air full path, you know, contrary to what what is in the production version. So if if any one runs anything too hot in there, it'll it'll need a little oil.
Bryan Cantrill:Yeah. Yeah.
Doug Wibben:That that PCI enclosure was was suggested by Adam, I guess, we were socializing this at at OxCon. I just kinda had that board sitting on there, and I he's like, can we do something a little neater? And there it is.
Adam Leventhal:I I feel like I probably didn't suggested it. I probably, like, touched something in a way that I wasn't supposed to, and you're like, gotcha. I need to include.
Bryan Cantrill:Okay. Yeah. Exactly. It's a kind of suggestion. It's a
Adam Leventhal:I suggested Adam's family might appreciate him not dying by touching that.
Bryan Cantrill:And and I He still managed to drink the paint. Okay. So we need to actually secure that door latch
Adam Leventhal:a little bit better.
Doug Wibben:In in nearly every home used picture, I see there's a cat. So this this is cat proofing it as well.
Bryan Cantrill:That is true.
Ian Sobering:Those nice covers on there.
Bryan Cantrill:There I yes. There is a cat. I there are there are with are a lot of cats on compute sleds at oxide. So the the compute sled does making it cat proof is definitely appreciated. And I know, yeah, I've got a I've got a killer of boards that will she's she's a murderous for sure.
Bryan Cantrill:She she's not actually eating mice or or squirrels. She's she's going after boards. So so alright. So we've got this great mechanical design, very exciting. We we've we've either made it Adam proof or we've incorporated Adam's suggestion depending on his story you believe there.
Bryan Cantrill:The and then so you're doing layout for the boardian, and now it's time, like, we've got this thing. We we we've got it taped out, and we're getting it fabbed. And now it's time to to bring the thing up. Right? This is only a couple weeks ago now.
Ian Sobering:Yeah. So speaking of killing boards, happy to to segue right into that one. Yeah. So minibar bring up so far has well, it's gonna happen in two parts. Up till now, it's happened remotely.
Ian Sobering:I have a couple at my house. Well, actually, one now. I shipped one to Doug so we could take those photos. It's the one in the photos. But and then here here in a few weeks, we are all going to be the hardware team and the software, the hubris people and and Brian and some of the other software folks are gonna be out at Benchmark Electronics in Rochester.
Ian Sobering:And while we while we are out there doing bring up on the Cosmo compute sled, we will do some minibar software work and and actually see some of the first non prototype, like, production tester minibar mechanical assemblies and and, you know, do do some of that work just while we're all there together. But the initial the initial power on checks were done at my house in the spare bedroom over the garage where I've put together a little a little workshop, and it's it's been interesting. The
Bryan Cantrill:Yeah. What'd
Ian Sobering:hit? I I received two boards from Benchmark who who assembled them. And and just wanna plug for Benchmark here, they're they've been great to work with. They do incredible work. Really awesome partner in them.
Ian Sobering:So I I I'm like a kid on Christmas. I've been waiting, no joke, you know, a year and a half for these boards. And I cannot wait. And so Monday morning I go in just all excited. The first board, I take it out of the you know, it's an ESD bag.
Ian Sobering:Take it out. I I apply power. Minibar, you know, like Doug Doug said, minibar passes bus power through from, you know, the big power supply on the bench to the the sled. And normally, this is 54.5 volts. We designed the minibar to work with anything from 30 to about 65 volts because we wanna be able to test sleds at different you know, over over the full voltage range that they can handle.
Ian Sobering:So I'm I'm not gonna go for broke. I'm gonna put, like, 30 volts into it at first. And what I'm expecting to see, I've I've I'm expecting to see a bunch of green LEDs light up on the board because the power supplies are supposed to turn on automatically. There's a big, you know, big switching regulator that takes 54 volts and turns it into 12 volts. And then, you know, because we need 12 volts for the PCIe slot.
Ian Sobering:And then there are a bunch of smaller switching regulators that generate all the other system power supply rails. And on each of the power good indicator signals, I put a little green LED next to the regulator so I can look at the board and go, oh, good. You know, power came up. You know, well, there's my checkout. Okay.
Ian Sobering:You know, Matt Keeter, time to get going on hubris. And I don't see the green LEDs. Instead, what I see is once every ten seconds, all of the green LEDs flicker on and off very briefly. And I go, oh, crap.
Bryan Cantrill:The crops have
Ian Sobering:failed. Ten ten seconds is kind of a magic number in oxide 54 volt power land. Minibar has a hot swap controller on it. It's the same hot swap controller that we use in the sleds and the rack switches and everything else that runs on 54 volts. Because most the twelve seventy two?
Ian Sobering:It's the ADM twelve seventy two. Yep. Totally fine part, not dissing on the part. The ADM twelve seventy two has a reset input. If you apply a low level to the reset pin, the reset the hot swap controller triggers on the falling edge of the signal.
Ian Sobering:So so as you go from high to low, it sees the falling edge, and it will turn the board off and attempt to restart ten seconds later. And what I am seeing is the board is boot looping and it's resetting itself. And I go, ah, crap. And so I go in and and I look in the schematic. And when I design minibar, there's actually two hot swap controllers on minibar.
Ian Sobering:There's one that protects minibar's system power supply rails, you know, all all the minibar guts. There's another one that the that passes power through to the sled because this thing is a cubby, and we're hot plugging and unplugging sleds from it. So it it electrically, it needs to be exactly the same as as the rack. These hot swap controllers let minibar power cycle the sled. So you can do things like simulate what happens if your data center power goes down and bus power disappears, in a lab environment.
Ian Sobering:And you can do this remotely. You can do this over Ethernet remoting into minibar. So it's really cool. And at one point when I was designing it, thought, hey, wouldn't it be great to give minibar the ability to power cycle itself? You know, what happens if if, you know, somebody is remoted into minibar and they go, oh, I need to update some firmware or something.
Ian Sobering:Because we have we have processes for updating firmware in things over the management network. We can do that to minibar. The last thing we need to do is trigger a power cycle and reboot the board. And what I did was I put a pull up resistor from you know, so I I took the reset line, I tied it to the FPGA. Okay.
Ian Sobering:Cool. The FPGA can can reset minibar. There's a pull up resistor on that line to minibar's 3.3 volt power supply rail. What I thought was going to happen was I apply power to the board, you know, 54 volts comes up, 12 volts comes up, 3.3 volts comes up, everything is happy. What actually happens is I apply power to the board, 54 volts comes up, the gigantic 12 volt switching supply turns on, and we see a volt and a half of what's called ground bounce between the 3.3 volt rail and ground.
Ian Sobering:And in in electronics, we talk about ground. Ground is not some universal constant that is the same everywhere and a magical perfect reference. Ground is whatever you mean. To learn. Yeah.
Ian Sobering:And so I'm very disappointed by this. What happens what happens when the 12 volt switching supply turns on is the the the voltage difference between all the other power supplies in the board which are just off and not doing anything and just sort of floating at at, you know, approximately ground but aren't tied to ground And the quote unquote ground reference, ground drops away. And so if you probe the reset line, which is pulled up to the first play rail, you watch, you know, you you see a a positive going volt and a half spike, which is actually caused by ground dropping away. And and and I know that this is a little confusing, but, know, when you're when you're probing the board, you're referencing ground. And and this is just one of these things that you you you get used to over time.
Ian Sobering:So ground drops away. The hot swap controller sees the voltage on the reset line go up to turns out one and a half volts is a logic one on that pin. And then the downstream power supplies turn on, everything stabilizes, the, you know, ground goes back to where it's supposed to be, the FPGA, you know, that pin is low that drops down below the logic zero threshold and the hot swap controller goes, oh shoot, that's a falling edge. I need to reset. Somebody triggered a reset and it boot loops because it shuts all the power supplies off and ten seconds later it tries again.
Ian Sobering:And so ground just sits there bouncing and bouncing and bouncing and bouncing and the board doesn't turn on. And the solution is really simple. You know, I I get a little exacto knife and I go in and I cut the printed circuit board trace that that, you know, connects to right at the hot swap controller that connects to the reset pin. And there's an internal pull up resistor in the reset in the in the hot swap controller that holds that line de asserted so it doesn't reset itself. And and boom, we're done.
Ian Sobering:Except in the process of doing that, my probe slips and I touch 54 volts to the reset line and it blows up the board. Oh. And this is now half an hour into the bring up process and I am sweating bullets because I've just destroyed a minibar.
Adam Leventhal:But you have two.
Ian Sobering:Okay. So second board. I I I take five minutes, and I go walk around, and I calm down a little bit.
Nathanael Huffman:I take the second board out
Ian Sobering:of the block.
Bryan Cantrill:Okay. First of all, I admire that. Yeah. I admire the I because the yeah. The the the walker on the block is very important.
Adam Leventhal:Before we get to the second board, so how spectacular was the slip into 54? Was this like an an exciting moment or or not particular? Just like kind of a fizzle, not a bang?
Ian Sobering:It was exciting enough to be an oh shit moment, if you've heard my language. Because what you I mean, you you see a brief spark and all the LEDs
Bryan Cantrill:I was gonna ask. Yeah. You saw sparks.
Ian Sobering:And and then and then you get the sinking feeling in your gut where you're just like, oh, no. I did that.
Bryan Cantrill:I don't think this board is supposed to make sparks.
Adam Leventhal:I don't think that was as bright.
Ian Sobering:In in in sort of a there was a happy ending to that story. So it it it blew out a bunch of ESD protection structures in the FPGA and in the hot swap controller. And so I I pulled the FPGA off, I replaced the hot swap controller, and it it cleared the short.
Bryan Cantrill:So had I the board that you blew, you were able to rework?
Ian Sobering:Yeah. Partially, yeah. Wow. I was in the middle of finishing reworking it when Doug asked if if anyone had a minibar, and I shipped it to him. But, yeah, I think we I think we could rework that board.
Ian Sobering:I just didn't have another FPGA to put on it. And without an FPGA, if you actually program the board and try to let it boot and and run normally, the service processor will go, will will fault during the boot process because it can't find an FPGA to load with the FPGA image. And so I I just said, you know what? Let's use this one. You know, I'll ship this one to Doug so that he can do his mechanical checkouts, and then if we need it you know, we we have there are 13 more of these at Benchmark.
Ian Sobering:If we need it, we can rework it, but if not, you know, it it it can be what it is.
Adam Leventhal:I see. Now, this is this is really sophisticated stuff too, because you send it to Doug, and then when he finds that there's an electrical problem, you can just say, what we would
Bryan Cantrill:find when I
Adam Leventhal:sent to
Bryan Cantrill:you. Yeah. Would need the left here. I don't know what's up front of
Ian Sobering:missing the empty spot where the FPGA is should should be. No.
Bryan Cantrill:No. Was there. I I that FPGA was there when it left here. I don't
Nathanael Huffman:know what
Ian Sobering:to do. I did shipping.
Bryan Cantrill:It was fun after shipping. Did have the the kind of this reminded me, we had the the sidecar heat sink, which is huge heat sink.
Ian Sobering:Oh.
Bryan Cantrill:Yeah. And we they were arriving bowed, and we had there are our manufacturer tried to convince us. Like, I think that's happening during shipping. It's like, shipping is not bowing a heat sink. Like, that is not.
Bryan Cantrill:Sorry. I mean, was it just,
Adam Leventhal:like, was it just shipped in an envelope or something?
Bryan Cantrill:It was shipped in an envelope. Like, shit. We we we've seen all the failure modes for FedEx and UPS. Like, yeah. Like, this is, throwing a heat sink in the bushes does not do this.
Bryan Cantrill:I'm sorry. Yeah. So okay. So you've got so that's the adventure board number one. Now but we now have one number two.
Ian Sobering:So this this is like this is thirty minutes into my Monday morning, and I'm not okay at this point.
Bryan Cantrill:And I was gonna ask you, is this you were able to debug this really quickly, it sounds like.
Ian Sobering:Well, no. So it took me about and I'll I'll talk about how I debugged it. Here so it wasn't really quickly. I mean, you're you're getting sort of the after action report and it makes it sound it it makes it sound like I was a lot more calm and collected than I actually
Bryan Cantrill:I was I was not Such a relief. And Yeah. Right. Such a relief to know. You know, if you lose your mind.
Ian Sobering:I I mean, I I so so my my partner Chelsea works in the bedroom next to me, you know, and and she she's an integrated circuit designer. And I like, I went over to her office, I was like, I already blew up a board. It's been thirty minutes and I blew up a board. She's like, it's okay. It happens.
Ian Sobering:And and, you know, it but it I was not having a
Doug Wibben:good day.
Ian Sobering:So I
Bryan Cantrill:When when I blew up a board, namely, or when as aided by a cat, I there wasn't there was no sort of calm retelling in the next room. It was more like the kids thought I just snot my finger off. So Yeah. It was Yeah.
Ian Sobering:It there were there's some words. But anyway, okay. So so I have another one here and I take the second board out of the box.
Bryan Cantrill:And I That board is looking at you nervously. Yeah. I What happened
Ian Sobering:to first board? Don't ask. Yeah. I well, and and there there's other checks we do on these boards before we power them up. So like one of the things we'll do is we'll take a multimeter, and we'll we'll probe around the board to make sure that that important things are not shorted and and do that.
Ian Sobering:So there's I do all those checks. I put it under the microscope. I go in and I cut the little reset trace next to the hot swap controller so it should power up okay. I connect power, I turn the power supply on, all the green LEDs turn on. This is really good.
Bryan Cantrill:It's great.
Ian Sobering:And and the you know, so I and I I I get my multimeter, you know, probe a couple of the power supply rails, okay, you know, 12 volts, 3.3 volts, one volt, you know, everything looks good. And I I think I wanted to go get a different oscilloscope probe or something. And so I reach over, I hit the button on the power supply and I turn the board off. I get up Right.
Bryan Cantrill:I mean, as you would. As
Ian Sobering:would. I go get whatever I was gonna go get out of my closet. And
Bryan Cantrill:Meanwhile, anyone listens to this, like, why is he emphasizing that he turned the board off? Oh, yes, dear listener. That's Horse shot.
Ian Sobering:Was music you hear. Where yeah. No. It so I I come back and I sit down. I reach over and I punch the the power supply turned on.
Ian Sobering:None of the LEDs come on. So I get my and at this point, I'm really not having a good day. And
Bryan Cantrill:Yeah. Time to take a walk.
Ian Sobering:Everyone in the house knows about it. And so I get my multimeter, and I start probing the power supply rails. The hot swap controller is okay. It turned on, and it is, you know, outputting bus power. You know, again, I'm not using the full 54 volts.
Ian Sobering:I'm like it's like 35 or 40 volts. But you know, so stage one is okay. Stage two is the big switching supply that takes takes 54 volts and turns it into 12 volts. And you know, in in the compute sleds, we buy these switching supplies from, you know, the the IBC, we buy these from from a a third party vendor. Minibar doesn't need, you know, a kilowatt worth of power delivery for for its its little two channel network switch.
Ian Sobering:And so, you know, I designed a smaller version that basically does the same thing. And it has you you can see it in the photos. It's the gigantic power inductor over there next to the the oxide label and the the minibar logo. That's the switching supply. And so I put it in and I I yeah.
Ian Sobering:Well, power going into it is good. No 12 volts coming out. That's weird because that was working a minute ago. And so I put it under the under the microscope, and I look at the output and I immediately see the problem. There is a there is a current sense resistor in series with the 12 volt output that the switching supply uses to figure out how much current it is delivering to the load.
Ian Sobering:And the the internal current control loop in the switching regulator uses that to figure out, you know, how hard to switch, how much current to put out. And this is this is sort of glossing over it because switch switching regulators are black magic anymore and we really need Eric on here to talk about that. But under the microscope, that resistor is rotated 90 degrees. So instead of being a resistor, you know, like a a one milliohm resistor, it is a dead short across the pads where it's supposed to be. It is a zero resistor.
Bryan Cantrill:Okay. And this is like so how did it work ever?
Ian Sobering:Well, at at first, the the when when the 12 volt supply turns on, nothing else on the board is powered up. So the switching regulator expects to see almost no load current.
Bryan Cantrill:Right.
Ian Sobering:And so it ramps the 12 volt supply up like normal. But as things turn on, it's gonna slowly start drawing more and more current as the different power supply rails come up and the service processor configures and the FPGA turns on. And the the switching regulator has voltage and current sensing lines, you know, at the point of load to so it knows, hey, is my 12 volt supply actually 12 volts? And how much current am I putting into the load for, you know, control system magic purposes? What it saw was Infinite current.
Ian Sobering:Pumping more, you know, more and more currents into the well, here's what it saw. It saw me punch the button on the power supply, and its input power went away. So its output power starts going away, and the internal control loop sees, oh, no. My output is drooping. I need to pump more and more and more current into the output to prop it back up.
Ian Sobering:I need to boost my output back up to 12 volts. I and I'm guessing here because I don't know the details on how this part works. What I do know is is when I turned it off, it popped the the switching supply. And so what I'm guessing happened is it dumped more and more and more current in there expecting to see some current, but instead it saw no current because the resistor pads, the pads on either side of the resistor, you know, that would normally be, you know, there's a pad and then there's a resistor in the middle and then there's another pad. The pads actually made a dead short across the component footprint where the resistor should be.
Ian Sobering:So no matter how much current dumped into the load, it was never gonna see anything. It it was gonna see zero load current, zero load current. My output is falling. I need to output more stuff. And it overstressed the MOSFETs and the switching supply and consequently overstressed the controller IC and blew it up.
Ian Sobering:And so the rework on that one was really easy, and that board is now working. So I, you know, I I pull the MOSFETs off because I'm I'm probing around. I go, oh no. Well, there's a there's a dead short to ground. It's under the 12 volt switching supply and it's localized right where the switching FETs are.
Ian Sobering:Probably what happened was and and you sort of work back from And so I pulled the resistor off, pulled the MOSFETs off, pulled the little switching regulator off, put new ones on, double checked to make sure everything was okay, and the board powered up with no problems. And then I immediately sat there punching the power supply button off and on and off and on and off and on to see if it blew up in, and it didn't. So, you know, that is you know, and you know, further testing showed that everything is fine. It behaves like we thought it would, but two for two right off the line, that one was just an assembly error. The pick and place machine, put the resistor down, turned 90 degrees, and and
Bryan Cantrill:you on the other one?
Ian Sobering:That's where it gets interesting. So we and I actually have the statistics on that here. We we submitted what we call an NCR, which is for for not product that is non conforming to benchmark. And have them basically to say, hey, you know, in in one of two boards that Ian got, it didn't conform in this way. The resistors rotate 90 degrees.
Ian Sobering:Would you please go inspect the other ones and let us know what happened? And in we built 15 minibars. 13 of them were fine, two of them had the resistor rotated 90 degrees. And I just happened to get one, and it happened to be my second board of the day.
Adam Leventhal:Interesting. Is that a common I would have thought it was kind of all or nothing, not that kind of ratio.
Ian Sobering:We've seen sense resistors rotated 90 degrees before and I have some suspicions This particular part, the resistor, so all these boards go through optical inspection where they have an automated camera go around and look at, you know, components and and check to make sure they're in the right the right orientation. I we've seen sense resistors rotated before on on service leads and other things. I suspect one of the reasons why is this you know, most resistors, like, if you imagine your surface mount resistor, they're black. This one is green, and the solder mask on the board is green, and they're about the same color. Maybe there's something there where the automated inspection camera gets confused and does a pass.
Ian Sobering:I'm not sure. But these are the kinds of things you feed back to your board assemblers so that they can fine tune their process. And and we've done this on all of our products. It just, you know, for for me sitting there, you know, it's now been sixty minutes. I've blown up two boards.
Ian Sobering:And and the second one I'm gonna claim, I swear to God it was not my fault. But, you know
Doug Wibben:But I'm
Adam Leventhal:sure you have perfect confidence having known Oh my gosh. Knowing that the first one might have been your Yeah.
Ian Sobering:And so now But this but this is this is tales from the bring up lab. And so Totally.
Bryan Cantrill:Yeah. At
Ian Sobering:this at this point, I I messaged Well, actually, later in the day after I called him, I messaged Nathaniel and I was like, Nathaniel, I screwed up.
Bryan Cantrill:I got a story.
Adam Leventhal:Nathaniel's like, send it to Duck. Just send them both to Duck.
Ian Sobering:Well, Nathaniel and Eric were both really kind because they they both came back and were like, Dude, I've blown up so much stuff. And shared some really, really excruciating stories of boards that have blown up or you know, just, you know, Nathaniel shared a story of an absolutely insane rework that he did on a board, you know, as as pre he and Eric did actually on on a board when when they were both at GE and and you know, so that was nice. But then you get to start the process of of debugging these things. And you know, because because I I have two two dead boards and this was supposed to be my week. I was supposed to be doing minibar bring up.
Ian Sobering:The second board with the popped power supply was really easy. You know, there was an obvious something is wrong and I bet that has something to do with it, you know, because the board that didn't blow up didn't have that problem. And so
Bryan Cantrill:that was the regular thing around? How did you find that you
Ian Sobering:I I ordered it. I had DigiKey ship me one.
Bryan Cantrill:Okay. Yeah. Yeah. Yeah. I was gonna ask if you kind of cannibalized board number one for that, but the OKA, so you yeah.
Ian Sobering:Well, I I hesitated to pull any parts off of board number one because I I actually debugged board number two first. Until I had until I had time to troubleshoot board number one, I didn't know how much how much of the board got zapped. And what I did Yeah.
Bryan Cantrill:Wanna do was
Ian Sobering:was pull a part that I thought was good off of board number one
Bryan Cantrill:Oh, god. And put
Ian Sobering:it on board number two and have it not come up because of damage that I would have caught if I would have just troubleshooted board number one. And so I I just stuck with board number two. I had, you know, I I needed some other parts, so Digi Key shipped me, you know, quantity 10 of these switching regulators and so far I've used one. I mean, I'll use this on other boards too, so it's not, you know, it's not a big capital investment. But no, just get new parts.
Ian Sobering:They're like a dollar each.
Bryan Cantrill:Yeah, yeah, right. The turnaround is pretty good. So but I think it's but it's wild that the the failure mode of this was that when you powered it off and it detected the troop, that's when it starts shoving infinite current into itself because That's
Ian Sobering:a theory. Yeah. Yeah. I didn't I didn't have it hooked up to the oscilloscope because what I was about to do was there are test points on the board where I can clip little connectors on and capture oscilloscope traces of the power supply rails as they come up and as they power off. And and that was like literally the next measurement I was gonna make, and it was probably what I was going to go get a probe to do.
Ian Sobering:And and I, you know, that was when it popped. So, yeah. It was it was it's just bad timing. But the the board number one, so I so I I couple days later, the parts came in, fixed board number two, that was all good. Board number one
Bryan Cantrill:And board number two worked. So just in terms of like, you were able to bring it back up again, I mean, must been
Ian Sobering:a Yeah.
Bryan Cantrill:Must been I mean, you feel you got this thing debugged, but it must still be a relief when it powers on Yeah. And it actually works.
Ian Sobering:Yeah. So so board board number board number two is actually on the bench behind me right now waiting for me to flash the firmware. Board number one, I knew it had gotten zapped. I I thought I knew what traces I had zapped. I wasn't sure exactly because my memory of the event is not great.
Ian Sobering:It it was a very stressful time in my life. And Right. So what what I but, you know, if you looked at the board, it was really obvious that, you know, things things had blown up and stuff was shorted. And so Yeah. The first step in troubleshooting that is, you know, take your multimeter and probe around the board and see what things are shorted that are not supposed to be shorted.
Ian Sobering:And pretty quickly, it was obvious that the 3.3 volt rail, which is what the reset signal was pulled up to and and was causing the whole ground bouncing had gotten zapped somewhere. Something Yeah. Interesting. One of the chips on the board or multiple of the chips on the board were blown and that rail was now shorted to ground. The question is which chips are which chips got zapped?
Ian Sobering:And to troubleshoot that, so I have I took another power supply which it's like a low voltage high current supply. It's like, I don't it does like six volts and 10 amps or something, up to 10 amps. And it's a current controlled it's a nice power supply concurrent limit. I can program the current limit and clamp the output current at something reasonable and safe. So I hooked up the power supply directly to the 3.3 volt rail and dumped current into the power supply to try to produce, you know, because as as current flows through the short in the board, it creates a voltage drop.
Ian Sobering:And so as I increase the current that I dump into the three volt rail, the the voltage that develops across the short gets bigger and bigger. And if I then take my multimeter and probe around the board, the you know, and and probe the board at different points, as I as my probes get closer and closer to the short, I will see that voltage drop change.
Bryan Cantrill:Interesting.
Ian Sobering:And and and so I I did that and it, you know, it became pretty clear that the FPGA was popped. So I took the FPGA off, still shorted, probe here on the board some more. Okay. Well, it also popped the hot swap controller. Makes sense because those things are tied together.
Ian Sobering:And so, you know, I I pulled those off, probed the board, and the short's gone. And so I went, okay. Well, at this point, I'm confident I can probably rework it. And and so it I I actually used that board to practice reworking board number two and then I shipped it to Doug so he could do mechanical checks.
Bryan Cantrill:That's cool. That's great. So do do we okay. So then we we have the thing now brought up. We we know we're we're going to give feedback on AOI or what have you to get make sure we don't have the the current resistor issue.
Bryan Cantrill:And then it's on to to software and and and and the software and this is all running hubris, you can actually folks can go look at the minibar image if they want. I think it's all it's all out there in the open. Don't think Matt's got that on a private branch, so you can go check that out.
Ian Sobering:Yeah. Matt is Matt's working on the hubris image, and Nathaniel is working on the FPGA stuff. As of today, I think he I think Aaron. Mission check out. Or I'm sorry.
Ian Sobering:Not Nathaniel. Aaron. My apologies. Nathaniel's like, please, I don't need any more work.
Bryan Cantrill:Yeah. Exactly. Right. Awesome. There is then In
Ian Sobering:the interest of full disclosure, there is one bug, like, actual one bug that I have found. And it was a goof that I made, and it's the one of the RMII Ethernet signals between the service processor and the management network switch in minibar is connected to the wrong pin on the service processor.
Bryan Cantrill:Right. Yeah, I heard about that. Oh, yeah.
Ian Sobering:It's not a showstopper because it you know, it's not a showstopper. We've gotten enough feedback just in you know, because and and again, it's been it's been almost a year since this board was sent out to manufacturing between when the board went out for manufacturing and just now doing bring up. And in that intervening time, we've been doing, like, all of the Cosmo service led design. So there was a lot that we learned in the process of doing Cosmo specific debug capabilities that we either wanna roll into minibar or specific use cases that were like, you know, a year ago, we thought we were gonna do this, but you know, now that I actually have one of these boards in my hand and I'm I'm I'm I'm doing bring up and I'm using it, Like, I really hate what I did here. I I wanna change something.
Ian Sobering:So there's a lot of that that's gonna get rolled into what will be the final version of minibar that actually goes out onto the production floor. But but those are the kind of lessons you learn. And I actually think I can rework this one, but it involves Who
Bryan Cantrill:are really?
Ian Sobering:Well I mean did work on Friday, and the the test is to flash the hubris image and do humility monorail status and see if the network is up.
Bryan Cantrill:And did you have to dead bug it to rework it? Or how did you
Ian Sobering:rework, and I mentioned this just I think because the rework is kind of sketchy and interesting, but also because in the interest of full disclosure, I did make a goof here. The rework is pull the service processor off and clean all the solder off the pads on the board. Then you get a fresh service processor and you pull two of the solder balls off the bottom of the BGA.
Bryan Cantrill:And you
Ian Sobering:take some 40 gauge magnet wire so so you you pull some solder balls off to make a path out, and then you take some 40 gauge magnet wire and you tack it to either to the pad on the board or the the pad on the BGA package that you need to get the signal out of. And then you flip the thing upside down and you saw and you reflow solder it back onto the board. All the while trying to keep the wire from moving.
Bryan Cantrill:Holy then once
Ian Sobering:it's attached, you take the wire and you run it over to just like one of the series source termination resistors that's already there on the board, there's a nice pad that you need to, it's just this last inch of trace you need to rework, and two of them have signals attached to it, and I can just jumper to them. And that took thirty seconds to fix. It's just there's this one trace that isn't brought out of the package despite what Nathaniel and Eric warned me about because they have lot of experience with things like this. I I did not do it. I am now learning my lesson.
Ian Sobering:And and so I on Friday, actually, I I did the rework. I had planned on I I just need to set up a programming station so I can flash the cue percentage.
Bryan Cantrill:You pulled that off, or do you think you might have pulled
Ian Sobering:that I I well, I think I did. Wow. Nothing is shorted. The service processor looks like it reflowed well. Like it's, you know, it's not like propped up on one side where the wire's coming out or anything like that.
Ian Sobering:The real test will be to, you know, to flash the hubris image and see what what humility tells me about the status of the network link. And and like I said, this this is not a show stopper because if we It's really the talk to the service processor, we plug a a dongle into it like we normally do in the light. Yeah. Right. But when we deploy these things, especially the minibar lights, we we need that link up.
Ian Sobering:So Yeah. You know, we'll fix it. You know, it it it happens. So yeah. Yeah.
Ian Sobering:That's that's the bring up story.
Bryan Cantrill:That is heroic rework, I gotta say. Because that is I
Ian Sobering:we'll see if it works.
Doug Wibben:I don't know.
Ian Sobering:No no promises.
Bryan Cantrill:Well, you're gonna have to add photos of that to the album for sure. So we're gonna need to get we're gonna whether it works or doesn't, that's gonna be exciting. That's that's that's exciting either way.
Ian Sobering:We'll drop a photo in the chat here of the magnet wire coming out from under the Yeah. Package.
Bryan Cantrill:And then Doug, the other thing is what I wanted to ask you about, because I know that for Cosmo, we and you you got a photo there of the the kind of the the the Cosmo programming prototype. Did you want to just describe what that is a little bit?
Doug Wibben:Yeah. Ian can probably help with that. He mentioned earlier that with our our our current compute slide, Gimlet, the programming interfaces are there are, I believe, four of them, the headers, and they're kind of spread. So there's there's not a lot of hope for automating any of that. I guess these are in reference to pictures, twelve and thirteen, Brian, as you mentioned.
Doug Wibben:Yep. So for for our new next generation feeds led, there's been, I guess, a little more consideration for concentrating those in one area for for programming. And we've got a single Samtec header kind of in the back corner of the PCBA. You can see these are very approximated. I think that's a courtesy of my colleague, Brooks, a laser cut acrylic mock PCBA that I glued a connector to.
Doug Wibben:So we we're we're trying out basically a single single port, single interface for for the programming function on our next generation slide, which brings the potential of automating this process, which which I think is low priority at this point, much more into the realm of possibility. So, you know, currently, as as Ian mentioned, we've gotta plug in the plug in the sled, attach four cables, and and program the the sled here. It's just one.
Bryan Cantrill:Yeah. That's oh, and and this is all very important for, you know, especially as we get Cosmo out there and as we begin to really ramp and then, know, we've had the the a bunch of our always customers have asked us like, okay. This is great. Like, but how do I if I want to buy 10 racks, if I want to buy 20 racks, like how are you going to be able to do that? And minibar is a is a really important piece of that in terms of being able and and the the kind of this unified programming header here that we're going to have on Cosmo.
Bryan Cantrill:Like all this stuff really adds up for our own speed and reliability of manufacturing to because you just when you try to accelerate a or you're trying to scale a process that's that's our that's manual or what have you, it's everything's gonna break. So, Doug, to your point, like, more we can auto we actually have secured now something that that we've got some hope of being able to automate.
Doug Wibben:Yes. And and kind of, you know, aside design intent on this production version mini bar is, you know, it's it's made with these stock extrusions, alumina elements. We we should be able to kinda scale these and gang these up, as we need to scale for for volume production. So I I don't know that we'd ever get, you know, our of the equivalent of a rack full of these, but conceivably, you know, these can all stack together and and we can test multiple slides at once.
Bryan Cantrill:Yeah. Which is awesome. That is awesome. Well, this is this is great. It's super exciting.
Bryan Cantrill:I I mean, congratulations. I know, you know, it's been a long journey and we've obviously we've competing priorities and everything else. So it's been but it's really great to to have something in hand that that that works, that rework that you showed the photo you dropped is amazing of the wire coming out. And it's gonna be certainly very exciting to get this thing on the manufacturing floor. And I think realizing the the image the vision of of minibar.
Bryan Cantrill:It's very exciting stuff.
Ian Sobering:Yeah. We're we're really excited. There are a lot of people who who who are looking forward to using it. And, you know, it's it's been a group effort. I mean, everyone on the hardware team made this happen.
Ian Sobering:And and, I mean, Doug has just been we've we have two really great mechanical engineers here at Oxide, and and Doug has just absolutely been killing it on this. So a big thanks to everyone who helped make this make this possible.
Bryan Cantrill:Yeah. Really terrific work all around. Great teamwork. And do go check out the RFD, check out the photos. Ian, you have to keep us up to date on good luck.
Bryan Cantrill:I think we're all I I kind of
Ian Sobering:feel like Hopefully find out tomorrow.
Bryan Cantrill:Exactly. Leslie Nielsen in airplane. But we're we're all counting on you. So good stuff. And again, terrific, terrific work all around.
Bryan Cantrill:And, Adam, good suggestion on the on on you know? Look, look, your your suggestion's being incorporated.
Adam Leventhal:Yeah. How about that? Whether I remember it or not.
Bryan Cantrill:And how great is this? I just feel like we're living the dream here.
Adam Leventhal:This is just so It's astounding. Yeah. It's awesome.
Bryan Cantrill:I I just think it's awesome. Alright. Well, thank you very much everyone. And, I think we're gonna have a yet another episode being planned somewhat in advance next week. So stay tuned, but we are gonna have some folks on the support and engineering team talking about a really interesting support case that we had that we wanna get into the details of.
Bryan Cantrill:So I'm gonna be looking forward to, I think, Lavaughn and Trey and I think we'll have Will on and I see a couple other folks. So stay tuned next week for that. Alright. Thanks, everyone.
Creators and Guests
