Oxide and Friends | Transcript: The Sidecar Switch

The Sidecar Switch

November 29, 2021 / 01:14:45/S1 E24

Speaker 1: 00:00

Twitter. Why why do we make this so hard?

Speaker 2: 00:04

I know that you can't, like, you can't talk and type at the same time.

Speaker 1: 00:07

You really can't talk and type at the same time. And here we go.

Speaker 2: 00:13

This reminds me of these we used to Brian and I used to give these presentations, where there would always be live demos and one of us would talk and the other would type. And, inevitably, when the person typing tried to talk, they would start typing the thing that they were saying, which was always kind of entertaining.

Speaker 1: 00:30

Yes. Well, there are some terrifying variants of that. I mean, I I I believe I have shared with you my terrific fear. I observed that when sending an email or engage in work correspondence and my family enters the room, I am likely to sign things love, Brian. And I and I I mean, I have tried to, like, preemptively let people know that if I will never profess my love to you in a work email.

Speaker 1: 00:56

Like, that's not the way it's gonna go down. So

Speaker 2: 00:58

Oh, well, I you know, I'm just gonna delete that whole save file.

Speaker 1: 01:00

Would you mind? So this is like what can could you please do that? Actually, I do before we get to Arlene, I do have a funny story for you, Adam, because I feel that you and I both have plenty of Abe Simpson isms that we have yeah. We that we incorporate. Well, so we are aging.

Speaker 1: 01:17

And I am these things are so deep in my own vocabulary. I don't really keep track of them anymore. Like, could you say conclusively what references you do or do not make and how casually you do or do not make them?

Speaker 2: 01:31

I do know that in mixed company, I'll say things like cod sarnet.

Speaker 1: 01:35

It's not it's not an

Speaker 2: 01:38

expression that has a recurrence in for anyone who is not well acquainted with Abe Simpson. So these folks who are, like, I'm this is I'm I'm I'm exposing too much. This is I'll say this when I'm playing pickleball, which I look like legit with, like, legitimate septuagenarians. And they're like, look at listen to this, old timer.

Speaker 1: 01:57

This guy is showing us how to do it. This guy I finally, someone else born in 19 tickety too. So I, we had a dilemma over the weekend. And at one point, my 14 year old looks at me and says, did you just say there's an onion in the ointment? And I'm like, did I say there's an onion in the ointment?

Speaker 1: 02:17

He's like, is that an expression? I'm like, dude, that's actually that's a Simpsons reference. And I am no longer cognizant of when I'm doing this. And so now I, like, I am becoming you're becoming Abe Simpson. You're an entire generation that is becoming Abe Simpson.

Speaker 1: 02:32

I'm anyway, I'm worried.

Speaker 2: 02:33

No. I'm with you. And I surround myself with people who would only accept those references. So, yes, it's troubling.

Speaker 1: 02:39

But but the children, what do you I mean, I and honestly, like, I you cannot my 14 year old, god love him, be Cantrell number 1 fan, has poured over the Simpsons like a Talmudic scholar. You cannot expect a 14 year old to know early season Simpsons better than he does, and he missed it as a reference. It's it's chilling. It's chilling. Chilling chilling in the beer.

Speaker 1: 03:00

Exactly. On that note. So yeah. So we're, Ari, you had this, this tweet. What of course, the tweet represents oh, there's a lot of work that we're gonna talk about underneath it, that the that the tweet represents, but an absolutely gorgeous image of the board that you've been designing, here at Oxide.

Speaker 1: 03:21

So, I thought actually, maybe we would work backwards a bit, and maybe you could describe the like, what people are seeing with that image? Like, what what what the image actually means?

Speaker 3: 03:36

Oh, sure. So I got a bit inspired by some of these, by some of these dye shots. You'll see people take photos of of old chips and, try to do the best I can do with with what I'm doing. But what you're seeing is, the the signal, the copper layers in the circuit board. So this is a printed circuit board made off of fiberglass with thin, copper layers in the middle.

Speaker 3: 04:03

And so what you're seeing is about 12 of these copper layers that where all the signal traces and the power pores are, all stack on top of each other and then rendered as a single image.

Speaker 1: 04:16

And, I mean so and what are the colors to note? Because I think I mean, it's it's the colors are obviously synthetic. But what what do the different colors to note?

Speaker 3: 04:24

Well, roughly, the the different colors are for the different layers so that you can recognize as you're designing a PCB like this, it's it's quick it's pretty easy to lose track of what what's where. So all the layers have sort of names. You try to stick things things get selected and added or put on specific layers where you need them, and then you use these colors to sort of keep track of that. So when you look at the image, for example, you'll see a bunch of, I don't know, like magenta bluish stuff. That's, for example, a particular layer.

Speaker 3: 04:55

So that's all the same layers. So when you turn that layer on or off, so you make it visible or not visible because you don't have these layers on all the time while you're working because it it obstructs things from view, you know which layers are on and off based on these colors. Now at some point when this board was designed, we added various, other colors to, basically show groups of signals so that you can quickly see which signals belong to the same sort of link or same class or same group, and those might have different colors from the particular layer that they're on.

Speaker 2: 05:30

And, Arnie, what software are you using as you're designing and rendering this image?

Speaker 3: 05:35

So this was all done in in Allegro PCB Designer from our friend in music, Cadence.

Speaker 1: 05:42

From from the from the only software company that makes the software. It all goes back to Cadence.

Speaker 3: 05:47

Yeah. So what you're seeing here is just the board view rendered, like, in the tool as you're working on it. So, this is a this is a CAD program that renders all the all that output to an sort of an open GL canvas. So it's it's like, you know, like a video game engine, if you will. It's just in 2 d.

Speaker 3: 06:05

And, this this was just so happened to be a screenshot of that of that render as we were working on it.

Speaker 1: 06:11

So it's a gorgeous image, and I think, you know, a bunch of us had the same reaction internally. You just just aesthetically, it is beautiful. And I'll just it reminds me of the things that I always loved as a kid of, like, this really the complicated, like like, complicated, like like, the London underground maps or I mean, it just it it kind of invites you in to ask questions about the like, what am I looking at? This is a like, I'm amazed and I don't even know why. Like, I wanna know, I don't know.

Speaker 1: 06:37

I don't know if you had the same reaction. We clearly have a lot of context on this, but, I I feel that it is

Speaker 2: 06:43

Yeah. Yeah. I mean, well, you and you more than I. But, yeah, same reaction and actually, had the same reaction or or, sense that same thing. I, you know, was disassembling a piece of hardware, like an old coffee maker and pulled out a PCB and showed it to my 4 year old this weekend.

Speaker 2: 06:57

And there's something sort of intrinsically fascinating about the routing of these things and all the more so on on the thing of this complexity.

Speaker 1: 07:04

It is amazing. And I've asked in the past for the ache like, did I feel like the definitive history of the PCB has yet to be written or at least I haven't found it, because to me, it is just the whole thing is mesmerizing. It is so complicated. It is it is so astounding, and it's so important. I mean, so much of what we have done rests on the the PCB and our ability to get this dense integration.

Speaker 1: 07:27

So, Arne, maybe you could take us back

Speaker 4: 07:29

to the beginning. So this is the kind of the the,

Speaker 1: 07:32

the end or a way station anyway and kind of the sidecar journey. And do you wanna take us back to, because you you are, the the first person that I didn't know prior to Oxide to join Oxide. So you were in the absolute earliest days. And I think you and I both, really fondly remember that day that you came up to Oxide, like, moments after we'd started. But maybe you could take us from there.

Speaker 3: 07:59

Yeah. So, I think it was a Friday. Must have been a Friday, I think. You invited me to come check out the, what is now the Oxide office, which was then an an empty space, not even a chair. And, we just stood around for for, I don't know, 1 or 2 hours to talk, which was my, I guess, sort of informal interview, with a more formal engagement the week after when there was actually a couch to sit on, which was which was fun.

Speaker 3: 08:33

But, yeah, we went from there. I mean, the the the vision for Oxide was was was pretty clear. It was you, Jess, and Steve were able to communicate that well, and, you know, Robert Musaki was already there and and, Josh and and Dave. And so, it was, yeah, it was it was pretty it was pretty obvious from the beginning that, to me at least, that there was gonna be some really interesting stuff that was gonna that was gonna happen. I think interesting things that we were gonna build.

Speaker 3: 09:01

And, in those those early days, we didn't even, focus on a whole system. We were or at least I was pretty focused on the root of trust and how do we even land something to where we can start reasoning about code and how to measure integrity of the system, and, and, like, just even just a small piece. And I I remember thinking and actually saying out loud, well, the switch is, you know, we're we're gonna do an integrated switch. That's fine. We can just we'll we'll we'll be able to leverage something more or less off the shelf and customize it.

Speaker 3: 09:37

And I there's there are few things in the in my recent history that I've regretted saying more than that.

Speaker 1: 09:43

You know, these self delusions are important, though. These self delusions guys go forward at key junctures.

Speaker 2: 09:50

And just because it might not be obvious, like, the the the image that we're looking at is Argin's design for this this switch, which was no big deal, you know, almost 2 years ago.

Speaker 3: 10:00

Well, wait. Because this is this is a completely custom thing. Thing. This is not like, sure. The silicon we're using here is off the shelf.

Speaker 3: 10:07

You can go and purchase this. You know, the the the the the big switch ASIC is the thing that Intel makes, and then there's a smaller thing that comes from Microsemi or Microchip. And there's a there's it's it's all off the shelf components. There's there's no necessarily real magic there or cost anything real custom here. But the the integration of these components and how it how it connects all systems in the rack and how it sort of pulls the whole thing together, or at least we hope it will pull things together Right.

Speaker 3: 10:36

As a as a single, like, a single management domain that allows you to, know, exercise the control that we need over each component in the rack, that is that is not done. And it's actually in the conversations with other other individuals from various companies, and if I describe what we're building, then inevitably what what what shows up is, like, oh, this is actually the entire rack is actually more like a like an like, like, one of those large routers or switches that are that is built by, you know, the companies like Cisco, etcetera, that have these blades in them that need to be managed separate and that have, you know, main data plane and a control plane and, usually and even a third line where you can turn power on and off, etcetera. And this this is this very much represents that same sort of idea, except that it's, you know, in a large chassis on its own, rather than in a in a blade like what Cisco does. It

Speaker 1: 11:33

and I mean, I'm trying to even even trying to remember because I I definitely feel that, like, we were not sure whether we wanted to integrate a switch or not. I mean, clearly, it it it's, like, taking on a bunch more work. And even still not certain. Like, pretty sure we wanted to do our own switch, but not really sure what that meant at all, I think. And certainly, I've not thought about it completely.

Speaker 3: 11:56

No. Initially, that was very it was very much the the idea would would have been to more or less still use something off or something that would resemble a switch that you can purchase off the shelf, like a, you know, one to 2 u box with a CPU in there and a switch ASIC. Very, very much like these wide label switch chassis that you can purchase from from various vendors that that the hyperscalers have made so popular. And it wasn't really until we started thinking about until we needed to add a management network where it became obvious that, hey. We wanna do more integrated cabling.

Speaker 3: 12:34

And then if you if we we wanna we don't wanna add we spent all this time working on this root of trust and this and this a tested boot flow to get into, into host software that that where we have some some certainty or some assurance that it is booting the thing that you wanted to boot. And these existing switch chassis all have CPUs in them that were not the ones that we were gonna use for a compute node. And so we would have had to replicate all that work for yet another CPU, and that seemed silly. So we came up with this idea, or in this case, it was actually Keith, I think, who started this idea of, like, let's make this an external PCI Express device. And, basically, this became a a PCI Express peripheral, which it technically is in all these other cases too, except that we're making it explicit using a cable, so we connect to one of the compute nodes.

Speaker 3: 13:29

And so when we started going down that route where it was like, okay. We're gonna have this thing externally connected using an external PCI Express connector to one of our existing compute nodes. Now we're off in la la land that we're building something ourselves because no one is no one is doing this.

Speaker 1: 13:44

No one's doing it.

Speaker 3: 13:44

It turned out that that Google had done this in the past, and Google had done it successfully. And that's that's that gave us the the the the confidence that we could go and do it as well, that it wasn't totally crazy, but it was definitely not anything that anyone making off the shelf systems had seen, and that put a square, on us to go and build custom a custom chassis for this.

Speaker 1: 14:09

Yeah. And I I remember it also, that's where the way I remember it too. But, yeah, I remember in particular thinking, like, we're putting all this work into the rear of trust. And then what we're gonna, like, just throw down some, like, piece of crap, you know, old, off road map x86 part that and it was by the way, was always Intel, and we knew that we wanted to have, AMD based designs. And it just felt like it was gonna be going backwards to be using these reference designs, for for the switch.

Speaker 1: 14:36

And so, yes, as you say, it's like, by the time we you know, we thought we were gonna do, like, these little tweaks. I I kept you know, Ariane, you regret, deluding yourself to think this is I regret using the verb I I overuse the verb tweak. Like, we're gonna tweak existing designs. It's like, yeah. No.

Speaker 1: 14:54

No, pal. You're not tweaking existing designs.

Speaker 3: 14:56

I remember some I remember some of the conversations we had with with some some vendors or some manufacturers of existing systems trying to see if we could leverage some of their designs or, like, license something and then modify it or and and and as soon as we started talking about what it is that we wanted, well, we we want we want to remove the BMC, and you just see people look across the table and be like, you're you're you're crazy. That that just, like, nullifies half the design, and it turns out that's true. And so we we, yeah, we very quickly found out that once you start to remove some of these critical components that these boards are built around, then you're kinda on your own doing a semi custom thing anyway. So you might as well just bite the bullet and go do it,

Speaker 2: 15:39

Just

Speaker 3: 15:40

Like, actually the way you want it. Right.

Speaker 1: 15:41

Yeah. And then so I so then you should describe because I feel that, like, another key moment in this story is is Intel Tafino and really appreciating that part. Because I think we, we just didn't know I didn't know very much about switching silicon. Of course, there's basically Broadcom as the dominant company

Speaker 3: 16:04

switching silicon. I I during my time at Facebook, I interacted with the individuals working on on Broadcom chipsets. And, the only thing I remember from that or the thing that I vividly remember from that was that it was rather painful and that people were not necessarily happy with what they were building.

Speaker 1: 16:23

Did not spark joy.

Speaker 3: 16:25

So so it did not spark joy. No. And it it it their SDK is this is this giant thing that that is is is a little difficult to get through. Now it turns out that the Tufino 1 is also pretty sizable. Don't get me wrong.

Speaker 3: 16:39

But, yeah, we we we had conversations with Intel early on, and and they were well, so we we had a we had a a an entire day actually at Intel where several of the business units came together and pitched several things to us in in an attempt to, you know, persuade us to use these components. And and the Barefoot team, which had just been acquired by Intel a couple of months prior to that, came in rather late. I think it was somewhere in the towards the end of the afternoon. But they did a really compelling presentation about the Tofino ASIC and and, in particular, Tofino 2, which brings a bunch of, refinements and a bunch of fixes from from the first generation. And I remember us all walking away.

Speaker 3: 17:25

This is super interesting. We should look into this a little bit more. We should see if this is a viable part for us to use. And so quickly for for, everyone who is like, who does not necessarily know anything about switch ASICs, and I mean it in a good way. What makes the FeNO interesting is that, normally, when you buy a chip from from Broadcom, most of the designs from Broadcom and others are what they call fixed function ASICs.

Speaker 3: 17:55

So they they predesign these these chips, and they have functional units that allow you to process networking packets according to their design. But they predesign that. So they decide how large certain tables are. They decide what hardware is exactly available, what kind of operations you can do on these on these, on these network streams. Like, oh, you wanna do some kind of tunneling or encapsulation?

Speaker 3: 18:21

Well, then you better hope that whatever it is that you wanna do is supported by this ASIC because otherwise the ASIC can't do it for you. And what makes the FeNO interesting is that it is a it is a programmable, somewhat flexible device. And so you can parse packets in a in a structured way using a through using using a specific programming language they designed for this. And then the the ASIC can be configured to be you can tailor it to your application. And so if you want to emphasize a certain application, if you want to emphasize a certain thing that you need out of your Switch ASIC, For example, you wanna maintain a large number of tunnels and encapsulate and decapsulate packets, and you but you have no use for, let's say, VPN functionality or or or whatever.

Speaker 3: 19:06

Like, that might be slightly even though those are still tunnels, you might wanna do different type of types of tunnels. You can repurpose that that silicon that is, or you can you can dedicate the silicon that is available to you to whatever it is that you need in your application. And, whereas in in an in a fixed function ASIC, that silicon would be turned off. It would it would not be used, and that would be you know, it's kind of a waste of the the budget of what you're paying for.

Speaker 1: 19:34

And and the language is standardized. Right? P 4 is is not it exists beyond just Intel Divino.

Speaker 3: 19:40

Correct. There's a consortium around that, and and it is all initiated by by the barefoot folks, but it is an open language and opens, it's an open thing that everyone can anyone can collaborate in. And there are actually already several implementations, that can use this language to describe these switching or data flow applications. And you can even use it in, for example, the Linux kernel. There's a BPF implementation that will allow you to implement a data path using using the p four language so you can describe how a package should be parsed and how it should be processed and then how it needs to come out on the other end.

Speaker 3: 20:15

And you can use that in on a on a Linux machine. And so you can go and build your own ASIC if you wanted to, and you could be in line with this with this language. Now I should mention that that Broadcom has a similar type ASIC that is also more programmable than their other offerings, But it is with their own in their own language, in their own their own environment. And, when we were talking with them briefly, they were not for some reason, not very keen

Speaker 1: 20:44

to sell us that. Well and I Ari, I do feel you're the first person at Oxide to coin willingness to get weird. And and their with Broadcom, it's not willing to get weird. And we we're we're looking for partners that are looking to get weird a little bit. Like, willing to get weird anyway.

Speaker 3: 20:58

Yeah. They they they definitely did not want us to go and build our own hardware. They they basically told us to go and and work with one of their board integration partners, because they did not did not have resources or did not wanna spend resources on us. And and when we started asking some more involved technical questions, the first thing that basically showed up was, well, we can start answering these questions for you, but can you please open a line of credit so that we can that we can build some engineering hours to you?

Speaker 1: 21:28

And it feel I mean, it's just, like, very on brand for them. And just, like, yeah, there was no it's, like, okay. I is am I being is this the shakedown right now?

Speaker 3: 21:36

I mean, I I don't wanna I don't wanna hate on them too much at the same time, but it was it was very clear that that the the team at Barefoot was much more eager to, to, you know, show us what this thing was capable of to help us get this design done and across the finish line. And it is very I feel that the the the the device in itself, like, the ASIC itself is just more in line with what we wanna do at Oxide. And and and so Totally. It was it was just overall a better choice for

Speaker 1: 22:07

Well, it is a shared zeitgeist too around kind of the software controlled data center. And, you mean, just it would It seems like we had a lot in common there in terms of what our vision was. So we were not looking at them as just, like, a different Broadcom. We were looking at them like, create being able to leverage some of the the novel, the novel bits of the part.

Speaker 3: 22:25

Yes. Because we haven't really been able to do any of anything with that yet. But I, I, for 1, am am am very I'm I'm looking forward to having some breathing room and actually spend spend some time with the with the with their s d a s d a e and and the and the compiler, etcetera, to come up with with some interesting applications that we can use these switches for because we can use them as load balancers, we can use them as tunnel endpoints, we can use them for I mean, I've seen a demo application in p 4 where the switch is used as a as a high performance DNS server, where if you if you send it to DNS packet, they can it can it can just de encapsulate the DNS request and it has a lookup table and you can send out results. And so you can use it as a key value store. There's some really interesting ideas that you can build with this thing, and it can be done at line rate, which means that you can send know, a 100 gigabits per second of DNS requests or the DNS responses if you wanted to.

Speaker 3: 23:22

Now for many of our, like, I don't know how it'll be interesting to see where that is gonna go in practice, but but if there if there happens to be a particular use case that our future customers want to want to want to do, like, or maybe a particular vertical that we end up selling into that have particular needs, then we can try and address that using this hardware, which which I think is is is a is a neat idea. That we can, after the fact, change what this can do and how how it accelerates the network.

Speaker 1: 23:55

Alright. It's really cool. Sorry. I have a question. I

Speaker 2: 23:57

guess, No. Question question about the hardware. So, also, I I see a bunch of folks who've joined as speakers. If you could just, like, raise your hand or something to indicate if you wanna ask a question. But are are you real quick, all the hardware folks at Oxide, when we talk about Tufino 2, say, you know, it's a beast or something similar.

Speaker 1: 24:15

It is. What what makes

Speaker 2: 24:17

like, and and I've heard some descriptions of it, but what makes this chip a beast?

Speaker 3: 24:22

Well, I mean, this is this is this is not necessarily unique to Tufino. I think if you go look at some of the other networking ASICs out there, you'll see similar specifications, although most of them are behind NDA, so you can't really see them. But what makes this thing a beast is that it is it is a large it's a large device. It's a it's a 6 centimeter squared package with 5 dies on it. So it's a packet processing die that's Barefoot's unique stuff.

Speaker 3: 24:50

And then there's a bunch of SerDes pieces around that that get the packets in and out of this device, similar to what AMD does with multiple chiplets. This this thing has 5 of these chiplets on there. So it's a large it's the large device physical physically, and then it has these crazy power requirements to power this thing. In the in the thread, I I alluded to that. The core the core rail to power the packet processing pipeline can draw as much as 500 amps at, you know, 850 millivolts.

Speaker 3: 25:24

And it has it has this ridiculous 250 amp load step in in a in a microsecond. And it it can tolerate very little droop. So you need to you need to design a serious power supply for this thing to operate. Correct? Am I correct in assuming that these little blue bars to the north and south of the Tofino are power supplies?

Speaker 3: 25:47

Every single

Speaker 1: 25:57

side them for the SerDes.

Speaker 3: 25:58

But, yeah, this is a design that can deliver about 600 amps with a 300 amp load step.

Speaker 2: 26:04

Nice. Yes. Simeon, did you do you hear something?

Speaker 4: 26:08

Yeah. I wanna ask a question about p 4. So is is the plan to build a p 4 based pipeline full for the oxide switch? And and, you know, is is that, is that sort of the way that you will you will provide these novel features is to is to provide a software upgrade where you're effectively replacing the pipeline?

Speaker 3: 26:31

Yeah. That that would be one that would be the way that we would do that. So initially, the the the first product we will release will have a switch implementation written in this p four language then with a control plane attached to it. And the customers won't necessarily see anything of this yet. It will just operate as a switch.

Speaker 3: 26:49

Network will flow, yeah, as they would expect. And over time, as we start to understand the use cases that that these are gonna be used for and as we start to understand our capabilities here, we might add features to this, you know, hardware accelerated firewalls, hardware accelerated load balancers, various tunneling and encapsulation features that we can implement. And who knows? We'll be we'll be in the process of working with this thing, what we can what we can unlock. I can imagine that if you are a media company and you're looking to stream I don't know.

Speaker 3: 27:24

You're looking to stream video or something and you wanna use load balancers, then having a hardware accelerated load balancer would be a quite an appealing feature. And so we might we might put some time into building an application and then deploy either custom p 4, like, but we might have different p 4 programs for different customer needs or different sort of verticals. I don't know I don't I don't know if we would do per customer specific things that that seems expensive and time consuming because you would have to test all these permutations, etcetera. This is what this becomes a bit of an interesting delivery problem, but there there are some at least the the opportunity is there. So we'll have to figure out what this is gonna look like.

Speaker 4: 28:05

Do you have the ability to slice that that switch up to say, okay, we're gonna do ordinary switching on these ports and load balancing on others? Or is it a all or nothing proposition? Like, this is the Yoxidirect for firewalling?

Speaker 3: 28:21

No. You can you can mix and match these applications as long as you have the the resources in in the switch available. And so you can allocate to so the a little bit more detail about how this thing works. There's there's 4 packet processing pipelines in this in this chip, meaning that you can can you can process 4 things in parallel. But these things are pipelines, so the each each of these pipelines is cons consists of 20 stages.

Speaker 3: 28:51

And so at any point in time, there's there's 20 packets in each or there could be 20 packets in each pipeline, so you're working on 4 40 packets in in parallel. And each of these match action units or these stages as they call these these stages are called match action units. They can up they can do a different operation on this packet. And so you can decide how you want to allocate these these this program. What you that does come in a proposition where it only runs 1 program.

Speaker 3: 29:24

So you you you you write one P4 program, you compile it, and then it synthesizes it into somewhat like a bit stream, what you would see in an FPGA, and then it loads that into into the device. So you can't really slice these and make sort of virtual things out of them. That that is is unfortunately not possible. Who knows? It might have at some point, it might occur.

Speaker 3: 29:44

But, so we would we would deploy you deploy 1 program per switch. So whatever we do there, we need to it needs to be decided upfront what that program looks like. And if there's multiple use cases served by that program, that's okay. But we need to we need to know what that looks like.

Speaker 1: 30:02

I think it it's worth mentioning too that, you know, our belief is that we are that this hardware that we're building is gonna last for a really long time. And so giving us and our customers software flexibility to do interesting things on top of it is really interesting. And so we just feel like there's a lot of potential here to go in a bunch of different directions. And I think that the other thing we're finding is that, like, the degree to which, this this is a real pain point. And not surprisingly, it's a real pain point for those folks deploying on prem infrastructure that they and, of course, it was and I I don't know why that's surprising.

Speaker 1: 30:33

It's been a pain point for me historically too because you can't see inside the switch, and there's so much of your performance issues and your reliability issues emanate from that.

Speaker 3: 30:43

Well, yeah. So so touching on the first bit, this thing definitely has a lot of horsepower for a rack switch. It might be even a little bit overpowered, but, hey. Like, we we we can get access to this thing, and it's it's it's we think it's worth it. And so why do I say it's overpowered?

Speaker 3: 31:03

Like, well, we have 32 servers in the back or in the rack, and then so we need 32 connections to these servers. But then we also have 32 ports out the front, which is very unusual. Usually, a rack switch has something like 4 uplinks or maybe 8 uplinks, but we happen to have 32 because the ASIC came at 32 additional ports, so we might as well expose them. This does lead to some interesting things where we can, for example, build small clusters for like, customers will be able to build small clusters in a pretty with a pretty dense fabric. And so we can, we can reach really interesting oversubscription ratios all the way up to 1 to 1, which is that's pretty unique that that there's not a lot of folks who do that.

Speaker 3: 31:42

I don't think that anyone would do that because it's very costly in terms of, trans optical transceivers and and fiber to make that happen. But it could if if if you really need all that network bandwidth. It is available to you. So but because one of the things that our customer said is, like, well, just make the network go away from a performance point of view. Just make it go make it be so fast that we just that it is that it is basically an a resource that we just have will have enough of.

Speaker 3: 32:10

Okay. Well, but, you know, then then we're gonna give you a lot, but that and then but the assumption is that we can that it will last for a bit. So there's that. And then the other thing that that is really interesting that makes it that driven by this programmable bit is that there are some standards now or some conventions starting to originate or appear that, allow you to tag on extra data onto packets to do all sorts of interesting telemetry. And so we can do tracing through the network, where we can tag on how long certain hops have how long it took for certain hops to to traverse, because we can insert relatively cheaply data into these packets.

Speaker 3: 32:52

And so we will be able to or

Speaker 2: 32:57

or the

Speaker 3: 32:57

hope is that we will be able to or the expectation rather is that we will be able to build interesting telemetry and more in-depth, you know, ability to troubleshoot the network and determine well, first of all, you can see which path the packet has taken, which is an interesting which is already a first interesting thing. And then from there, you can distill all sorts of additional information, that might be of value when you try and build high performance network applications.

Speaker 1: 33:24

So Arne, you talked a little bit about how I mean, you you mentioned a bit in passing about kind of the ports that you we've got towards the cable backplane and and into the customer's network. Do you wanna talk about I mean, there were a couple of big design decisions early that I know were, were involved. I don't know if you wanna hit on any of those.

Speaker 3: 33:48

Yeah. I mean, the biggest design decision or design decision that has driven a ton of what this thing physically looks like is the, the fact that we wanted to have this this blindmated cabled backplane so that we can, you know, insert these compute nodes as as as sleds into this into this rack. And you don't have to mess with any cabling. The cabling is is is fixed in the rack. The rack comes it will come with all cubbies, wired up, and you simply insert a server into the slot, and it mates with the network the high speed like, this this this high capacity network as well as the management network and the and the the the power telemetry network, all without you having to plug in anything.

Speaker 3: 34:29

But in order to do that, we needed to make some decisions. We basically needed the ASIC to sit very close to the the rear of the chassis because we wanted to because we're if you wanna hit these speeds, you're you're gonna run into losses in the cable and losses in connectors. And there's only so much link so much electrical budget you have, so much lost budget you have. And and so far, you can get these signals to still, you know, be within the within the loss, limits. And so so the the ASIC needed to move to the back of the chassis, which is also nonstandard.

Speaker 3: 35:02

A lot of these ASICs are more towards the front because that's where usually the ports are. And so we have these these connectors. The one of the pictures I showed in the in the thread has these connectors broken out on the bottom of the board because the board is floating in the chassis so that we have access to these ports on on the bottom. And so we can have 16 of them connected to the backplane, and we can build this cable backplane that will let you blind mate into it. And then 16 more are brought to the front, and there's a secondary PCB that sits there with queues of p cages on it, where these cables, attach these cages and then so you can insert your regular, optical modules.

Speaker 1: 35:42

And and these flyover cables, I think, are kind of amazing, actually. I mean, just terms I thought that was, like, the as we as you were waiting in in terms of dealing with this loss, dealing with all this PCB, and discovering that, actually, we've got these I mean, because that's the kind of the use case they're designed for as well. It seemed to be a great

Speaker 3: 36:00

fit. Yeah. Because it turns out that the PCB material that you make these these boards out of, is actually pretty lossy. Even the really expensive material is is running signals through these tightly extruded copper cables is much more efficient than running it through a printed circuit board, how no matter how well you design that printed circuit board. And so the the the challenge with this design has been to get the get the signals from the ASIC, from the BGA balls of the ASIC as quickly as we can into these connectors.

Speaker 3: 36:35

So that because once once we enter into these connectors, we we try and make these transitions through these connectors as as low as low loss as we can make them. Once you're into TwinX, the the loss numbers are not nearly as bad, so you can you can have longer cable run or you can go further with these signals than, you can through a PCB. And so that has driven a ton a lot of this design has been has been trying to get these traces to be as short as we can, and and make this get as as little loss as we can have so that we can reach the servers at the top of the rack and the bottom

Speaker 1: 37:11

And then do you wanna explain I and I think, I mean, you you explained it in passing, but just to emphasize why it's called sidecar, what are the origin of that is?

Speaker 3: 37:20

Well, the sidecar was because it is a sidecar to a compute node. It's like it's like the compute node is a motorcycle and the sidecar is this externally connected thing that hangs off on the side. That's why we started calling it sidecar.

Speaker 2: 37:32

Well, but it's also a beautiful double entendre because, the the server is called Gimlet. Yes. And side sidecar is both that off of the motorcycle, but also another beverage.

Speaker 1: 37:44

Yes. It's

Speaker 3: 37:45

all of the drink. But the the gimlet was act that was chosen because we had already had this sidecar name, and so we needed another beverage to go with that.

Speaker 1: 37:52

So that

Speaker 3: 37:53

that was what

Speaker 1: 37:53

That that

Speaker 5: 37:54

that's yeah.

Speaker 3: 37:55

We will have lit we will have drink inspired systems, I guess, going forward.

Speaker 1: 37:59

That's what I

Speaker 3: 38:00

interesting to see what we're gonna call generation 2 of the computer.

Speaker 2: 38:03

It'll it'll be a very boozy launch party.

Speaker 1: 38:05

Very boozy launch party. I know. Right? Yeah. I and I feel that was all credit due to Kate on that one where she was, and and we felt that she should be naming get what because she was very actively involved in leading the charge on that.

Speaker 1: 38:17

And I think she was the one who made the observation that, well, actually, is a a drink. So maybe

Speaker 3: 38:22

that's Funny story funny story about the sidecar, though. I've never had a sidecar. And so when we were recently in in at our our, manufacturing partner to do bring up of our our compute ports, we we had a a dinner and the the the place where we were actually served really proper cocktails, like really good cocktails. And they had a sidecar on the menu as one of their specials. So of course, I had to try that.

Speaker 3: 38:43

That did not disappoint. To, I am looking forward to the launch point.

Speaker 1: 38:47

Exactly. The more sidecars in the future. I'm not sure I've ever had a sidecar, actually. So then it it then talk about the management network too because this was a this was kind of was an issue. This is an issue of complexity any way you slice it, where we've got these service processors.

Speaker 1: 39:03

The problem is that we've got the the host CPUs, and they are are AMD CPUs. They're going to talk over this cable backplane. But you also need to connect the service processors to one another somehow. And, Ari, maybe you talk about that about that dilemma because a lot of the the challenge here or or some of the challenge anyway is dealing with with that network.

Speaker 3: 39:24

Sure. So, you know, because you you want an out of band management interface so that when the host CPU or when your operating system are running on that host CPU is not operating or configured the way you need it to, that that main link will not come up. And so you need some way to brought these systems, outside of that. And so there have been some attempts to so all these all these, all these NICS that you can purchase have these extra interfaces, NCIS, interfaces. And we looked at that initially.

Speaker 3: 39:59

So that basically lets you basically have a side channel. The the the your network interface, your network your your NIC is actually not just one network interface. It's actually a little switch and there's a 1 gigabit port on the side that you can connect another system to. And that that port can be up independent of the main port, the main MAC for the that that you connect to your network. And so you can build a little side channel in the same using the same cabling.

Speaker 3: 40:31

So effectively, your NIC is not a NIC, but it's a little switch. It's a little 2 port switch.

Speaker 1: 40:35

Which sounds great. What's the catch?

Speaker 3: 40:38

Well, the catch is that these things are designed It's not entirely clear who controls what and when. And so one of the things that we struggled with was what happens if the OS wants to reset the NIC? What if what happens if you need to power cycle that NIC for whatever reason because it got jammed up? Like the firmware in the because all these things run elaborate amounts of firmware. What if we need to power cycle this thing?

Speaker 3: 41:01

Well, in that case, your powers your your management network link is gonna go down as well, which means that your board, your the service processor, your board management function does not have a network, attached to it. And so you now lose the ability do this out of band thing, and you could you might potentially not be able to connect at all. And so urged by Rick, we we we did look at, okay, we let's build another switch into this network, like, into this rack. We'll have a separate switch ASIC or another separate switch. Initially, we were talking about a, you know, little 24 port or whatever, rack switch.

Speaker 3: 41:37

Well, it turns out the 24 ports is not enough. So then we started looking because we wanted to have 32 servers, and then, you know, things spiraled quickly out of control from there. So we ended up with this quite elaborate, industrial Ethernet switch for a microchip, that we can that is on this board that has so we have a completely separate set of links that we can control separate from the main switch. And that is that is our that is a dedicated segment that we can use for out of band management tasks. And so all these concerns are separate.

Speaker 3: 42:09

The the the service processes have their own link that can't the the host operating system cannot interfere with these links simply because they're they're they're they're they're invisible. They can't get there. And so that is a that is a layer of of robustness or resilience that we can add that we added in this way.

Speaker 5: 42:26

Specifically because I had had done research in the past of finding exploits in network card firmware where the host could intercept and CSI traffic and do nasty things. So we wanted to make sure that we had acknowledged that that was a a thing.

Speaker 1: 42:44

Yeah. Yeah. And so NCSI is I mean, the the Arin is calling it a side channel, and it is practically in the in the name. So this is a sideband interface and

Speaker 3: 42:54

Well, but it it really it's really I was pretty disappointed when we when we decided not to do it because it it it now added all this extra cabling Yeah. And it now added all this extra complexity and I wasn't I had to grieve a little bit about the fact that I had to add it to the to the system. And I I think ultimately it is the right choice to make, but it was definitely a I was still very much in camp NCSI, and and, you know, let's try and just make it work. Let's let's work with one of these vendors and just get it get

Speaker 1: 43:25

it done. Well, it feels so much simpler, of course. So we all had our our you know, everyone had their own NCSI horror stories, but and and Rick definitely came in with a with a with a bunch of new ones as well. And I I mean And

Speaker 3: 43:38

I didn't have any. So I because I never used

Speaker 1: 43:40

Oh, really? Oh, okay. You had oh, okay. Interesting.

Speaker 3: 43:42

I was still I was still positive.

Speaker 1: 43:44

That's right. And I think Rick was just like, no. Just no. Not again. No.

Speaker 1: 43:49

No. I've been I because and, Rick, I mean, you you've really seen this thing be I mean, I think we all have to a certain degree, but I feel like you had to, had to see different dimensions of how sideways this can go. And then you got no control over it. That's the problem. When that thing misbehaves, you've got very little insight into why you're trying to deal with a opaque firmware that you run the source code to, that it it it can be these these problems can be very transient.

Speaker 1: 44:12

It can be really nasty.

Speaker 5: 44:16

Yeah. I mean, there there was all sorts of implementation issues, and just getting it to behave correctly. But then there's also the whole it's it's part of a a surface area attack surface of the system that isn't particularly well thought through. A lot of the, you know, BMC style management functionality just kinda got tacked onto PC systems. And and this is one area where you have to really scrutinize exactly how that's implemented inside of the NIC to know whether you actually have a isolation between your management network and and your host system, and what happens in a case where you have a malicious host.

Speaker 5: 44:58

You know, these are circumstances you don't want to have to ever experience, but it's also things that you need to look through and consider and assess. And, unfortunately, many of the NICs, the host owns the device, and that that's sort of a flipped ownership model from what you would hope for.

Speaker 1: 45:16

And and we, I mean, we really were hoping, I think, and, of course, hope is not a strategy, but hoping we can make NCSI work. And it was not helping NCSI's case that we were asking very detailed questions about the implementation and getting back more or less crickets even from the the most forthcoming vendors. We're we're we're okay. We're like, oh, yeah. We're exposed to time.

Speaker 1: 45:37

It's like, okay. We've got a lot of Yeah.

Speaker 5: 45:39

It's it's clear that

Speaker 3: 45:41

it was clear that all this stuff is is designed somewhat not as an afterthought per se, but it is definitely not the main course when they design these devices. And so that did not inspire confidence that that we were gonna get out of this what we really needed. Because, yeah, if as Rick points out, if if it slips up and it just it is not designed properly, you're exposed now. And you can't you can't fix it because it's it's kind of there. You designed it in.

Speaker 1: 46:07

So we don't have one switch. We have 2 on on just that. That one board has 2 switches on it, that that are each complicated devices in their own right. Simeon, you've had your hand up for a while. Did you wanna jump in here with the question?

Speaker 4: 46:18

Yeah. I just wanted to, it it's worth noting that that that pattern of of having a separate switch in a large box that has many systems, is also something that you see in routers, in big, you know, chassis routers where each line card is, you know, a system on its own with its own CPU. It boots an operating system. So if you look at a, like, a a big Juniper router, for example, is, yes, you do have a separate switch and a separate management network, and if it's designed correctly, the user, the customer, never gets to put any systems on there, but it is an Ethernet switch at the end

Speaker 1: 46:53

of the day. Yeah. Right.

Speaker 3: 46:54

Right. I I I I never I used them as, before I worked here in Silicon Valley. I was I was back in the Netherlands, and I I worked on some of these these systems from Cisco and Juniper. And so I you're right. You don't see that.

Speaker 3: 47:08

It's hidden from you as the as the as the end user. And it wasn't actually until after we've we'd already done high level design of this thing, and we so happen to be one of the one of the electrical engineers that we worked with on this project, actually was at at, Foundry Networks for a while and where they also did these large switches. And and he when I when I first explained what we were building and I sketched what this was gonna look like, he's like, oh, that's exactly what we built then. And it it it worked the same way. It had a 100 gig a 100 megabit, management interface separate from the like, on the line card.

Speaker 3: 47:44

There was there was a a board processor on the line card that that, you know, managed power, etcetera, for the line card. And then and then there was even a here that is really like a serial link, but then we're doing some things with differential signaling and and and to make it a little look a little bit more like what we're doing with the rest of these signals so that the the the wiring makes sense. That is driven by an FPGA, and even that apparently, is done by these large, chassis vendors, because you need you need to power cycle these these these, these line cards even when the board controller might not might be jammed up and not not responsive. And so there's there's sort of a third sort of power button and a bunch of low level status that you want out of these things. But you need you need an out of band management network for the out of band management network.

Speaker 2: 48:36

And, Jason, did you have a question? Were you trying to get in earlier?

Speaker 6: 48:42

Yeah. My question was really around, the custom firmware, custom p four discussion, but I might have dropped off at the wrong time there with my client. It's just that you've got, of course, then the potential to run sort of SDN style customization on the, management, node that you're connecting by PCIe. So I was looking forward to when that discussion was kinda launched. And, was that part of what I missed there?

Speaker 3: 49:15

We didn't really get into that. So if you wanna get into that, we can we can we can address that now.

Speaker 6: 49:20

Well, yeah. I mean, just the the p four layer, if you think about it in terms of getting a merchant silicon switch, which does SDN, that would be hidden from you in a, you know, standard white box switch like you were talking about for the hyperscalers, but you would still then get SDN as a customization layer on top of that. And I was wondering, yeah, there's sort of a kind of tension there between where you want to implement what. But another question about, if you want to do SDN, you want to have, I'm guessing, as a non kind of network engineer guy, you wanna have low latency response from the management node that is actually hooked to this thing. So were you gonna be running the control plane stuff in the hypervisor base, operating system or as a VM and kind of do good tricks to get really low interrupt latency into a VM?

Speaker 3: 50:18

So, yeah, this would this because this device is connected to just any of our racks switches, we will be running that software or us or any of our compute nodes. We will be running that software on these compute nodes. It it it is still a little bit in the air. So there's there's a component of there's this immediate ASIC that needs attention over PCI Express because that's just the way that you manage this ASIC. So that is fixed to the node.

Speaker 3: 50:44

Like, that has to run on the machines that we are connecting these to. So that's there's no negotiation there. Whether or not we wanna have the SDN function to live somewhere else, that's up to us because we can basically, any any the traffic that the switch doesn't know what to do with, you you you direct that to what they call the CPU board. And that CPU board in this case is any of our can be any of our our our our servers. And then we would we are implementing, a piece that can pick up those packets and then process them accordingly.

Speaker 3: 51:18

Whether that runs directly on the host or in a VM is still to be We I I don't I'm not actually sure that the that the latency I'm not too concerned actually about latency, but I might be proven wrong

Speaker 6: 51:34

there. Well, I guess, yeah, at NCI, they did a trick where they wanted Lustre endpoints on one host because they had enough PCI cards in a Lustre router box that they actually wanted to be able to do multiple Lustre routers inside one box. So they were running they actually virtualized the Lustre router, and then they did tricks to basically get the late the interrupt latency down on the Lustre router VMs, so that it it wasn't being it was, you know, basically good enough so that it was like running it on the bare metal, of the machine, but you just got 2 of them. But net Intel had a I haven't been I've been out of this space for a bit, but Intel had a thing about network function virtualization where I think the there was a default assumption in hypervisors that if you do an m wait or something like that, one of these, or, you know, wait for something to happen at the VM. You must be wanting the control taken away from you because you can rotate it to pay, Paul, and run another VM in that time.

Speaker 6: 52:40

Whereas if you want minimum interrupt latency, you wanna deliver the interrupts straight down into the VM and have it go as quickly as possible, not need to be re dispatched from the hypervisor. And, we got around that at NCI with, basically, the whole machine basically acts like a big resistor anyway because that's what HPC does. It just turns, you know, current and voltage into a thermal energy and and and and and an IO bandwidth problem. But the thing is that they were actually running those, guest VMs doing the routing with a spin type, idle rather than a, primitive that was trapping to the VM or trapping to the hypervisor.

Speaker 3: 53:25

Well, we we can one of the options we for certainly have is we these the AMD CPUs we're using have have a significant number of cores, so we can always dedicate, you know, the number of cores that we we can we can dedicate number of, s number of cores to this.

Speaker 6: 53:42

Yeah. Yeah.

Speaker 3: 53:43

And then you can spin run on these. You can you can just VM's whole own that core effectively. And so as long as the single as long as the number of cores that we have, either that'd be a single core or multiple cores, if you can meaningfully multithread that that, that workload, that might that might just be Yeah. The way the way to go.

Speaker 6: 54:05

Yeah. Well, Intel was definitely sort of saying, well, if you want a network function virtualized in a VM, you are not interested in, you know, robbing Peter to PayPal. You're actually just interested in managing the complexity of having, you know, network stack inside of VM rather than just and being able to then split them out and have multiple of them if you so choose rather than have, it all just running in the base kernel. So, yeah, it just seems like something where the hypervisor, authors seem to have been making some assumptions of why would you wanna do that? And so we're sort of fighting that, but it's probably changed since I was looking at that.

Speaker 3: 54:42

The the good thing is that we have the flexibility to do this if we we if we need to run it directly on the on the, you know, in in the main OS, we can. If we if we because we control that layer as well. And if we if we wanna run it in a in a we we can run it in a in a, you know, traditional in a in a zone of of sorts. So, or in a full full featured VM, I we're not definitive on what that needs to look like.

Speaker 1: 55:12

So are you talking I mean, in terms of the the software side of this, I love the simulator that we got. I feel like the the the tooling that we've gotten on the, you know, has been really interesting.

Speaker 3: 55:22

What what was really impressive is that with the SDE that Intel provides you for this device comes a, it's a it's a simulation model they have extracted from the actual RTL that they designed the system with or the the ASIC with. And so you get a a a a simulation model that can run a p 4 program. So you and it and it actually the driver it's only the lowest level of the driver that we that that attaches to either your your real v your real PCI Express device or to this to this, to the simulation model. And so the entire driver stack and runtime that that they've built, that you use, runs on top of this unknowing that it is not actually the physical device. And the physical and the the simulation model can actually be mapped onto, real interfaces.

Speaker 3: 56:13

And so you can you can simulate a a a, Tufino based system using a regular box with a couple of nicks, and you can, you can write a p 4 program and then run that program. And you can you can trace every step. You can you can see how the parser in the pipeline works, how your packet is, extract like, how the the data is extracted. And then as it travels through these match action pipelines, it will log exactly to whatever to certain levels of granularity configurable by what you need. It will log what's happening and it will actually put these these packets out into virtual virtual Linux interfaces.

Speaker 3: 56:47

And so you can actually build a smaller version that can process, you know

Speaker 1: 56:53

Tens.

Speaker 3: 56:53

A couple 1,000. Yeah. A couple 1,000.

Speaker 1: 56:56

It's 40 tens.

Speaker 3: 56:57

You get you get a couple 1,000 packets per second because this is this is an actual, like like, this is a proper RTL model. So it is it I'm not quite I'm not sure if it's quite cycle accurate, but it is close enough for for for purposes of this that, that it that it is. And so you, yeah, it's it's it's absolutely very slow. You're not you're never going to deploy that as a quote unquote software switch because it's it that just doesn't work. But in terms of development, it's amazing because It's amazing.

Speaker 3: 57:25

Yeah. We've we've been working on this hardware platform for a year, but Nils and our team has been writing software for this for just as long, even though we've we we we we didn't have hardware in hand until we didn't even have a development platform from them until, I don't know, like, couple months ago. And so we've already been we've already been able to build significant amounts of pieces of infrastructure with just the simulation model by running the model in a VM with a couple of other VMs attached to it simulating a network. So that that has been quite quite good.

Speaker 1: 57:57

It's worth saying to the the the development vehicle we got for this thing is in a very traditional switch form factor, and I got a poor colleague, Josh Kuo. He's, like, rediscovering what he's, like, this stuff the the software that's

Speaker 4: 58:10

not Topino that's on this thing

Speaker 1: 58:12

is, not pleasant to deal with. And I Yeah.

Speaker 3: 58:16

Because there's a full feature in the BMC and everything there.

Speaker 1: 58:21

I the number of times Josh has been like, we really need to start a computer company and solve this problem. Poor Josh. I'm sorry, Josh. Josh has been but the the simulator has been amazing. And I think that you when you're doing hardware software codesign, it is really, really important that you find ways to unhook that software engineering from the hardware engineering, and that's been a really good one for us, I feel.

Speaker 1: 58:42

You know, all all credit to Intel.

Speaker 3: 58:44

What what what's what's great is that, though, the Barefoot team recognized that same philosophy of hardware software co design because nowadays, with taping out an ASIC like this takes months, like 6 months at least from getting masks made to actually getting the silicon done and and getting it, you know, the wafer cut up and packaged and and the first set of tests before it is in your lab, you're you're quickly looking at at 6 months, 9 months sometimes. For force your customers to go and build applications on top of, being for them to be able to do that on the actual RTL model that is that is as close as as sort of is needed, it's hugely valuable because and what what makes it kind of cool is that they decided to expose it to the customers as well so that the customers can can work with this thing, as they are speeding up their hardware or or as the silicon becomes available.

Speaker 1: 59:47

Yeah. It's one of those things that, right, three things validated the the direction we've gone. Thomas, you had your hand up.

Speaker 3: 59:54

Yeah. I was wondering how this switch fits in with the whole trust model, if it's controlled by one of your, VM hosts. So the the the board has has the same root of trust and and board control or service processor that we have on our main computing system. And so we're using the same, essentially, the same foundation that we're using for our host CPUs, to build this this chain of trust. We can do this we do the same with this board, and that brings us that that will allow us to boot pretty much into the, so we can bring up the the the the management network switch that that is part of the service processor, software payload.

Speaker 3: 01:00:39

And so we have some we can we can gain some confidence that this thing is running the intended the management network is running the intended configuration. And then the service processor will then, initiate power and reset for the Tofino ASIC, and that's that then attaches over PCI Express to one of our compute nodes that has the the driver and the payload, the p 4 program, for the switch. And then once the thing is on PCI, once host CPU in that host CPU in that compute node in its integrity of the the software it runs, that would now extend into the program and the control plane that is now execute that is now controlling the main data flow for this for this device.

Speaker 1: 01:01:30

And then all of that software, will be open. So you'd be able to to know exactly all the software that that that's executing all the way up to that that chain of trust. And the fact that it's the same service processor root of trust as we're using on the computer node, it makes really simplifies the system, which is you know, it's it's nice because there's so many things that that have complicated the system, necessarily. So we believe that it's it's really nice, you know, these things that actually simplify by reusing some of these components.

Speaker 3: 01:01:59

Yeah. Having a couple of these now more, quote, unquote, standard building blocks that we've more or less arrived at does make it a little bit more tractable because if we had to design yet more pieces for this, this would have taken so much longer.

Speaker 1: 01:02:15

And teaser, we are open sourcing our the the operating system Hubris. We're open source today tomorrow with Cliff Beffelstock at OSFC. We're excited about that.

Speaker 7: 01:02:25

Sorry to

Speaker 3: 01:02:26

So the

Speaker 7: 01:02:26

to butt in without putting my hand up. I don't actually know how to put my hand up in this app.

Speaker 1: 01:02:31

No problem, Edward. Go ahead. What's

Speaker 7: 01:02:33

up? So just a question on on failure domain. So it sound it sounds like one of the compute nodes is special. So, you know, what what happens when that compute node fails? And then the secondary question is, what happens if you you load a bad P4 program on the switch?

Speaker 7: 01:02:49

Have you cut yourself off from from fixing it, or do you have a separate connection to to one of the nodes? I'm sure you've thought of these things, but, you know, these are things that are bugging me hearing hearing what I've heard so far.

Speaker 3: 01:03:03

So there's the first layer of redundancy here is that we have 2 switches in the rack. So that's that's where we that's where we start. So if if one if one switch slash node fails, we still have another one available. And it is up to us to so our underlay our our overlay network that we're using to actually ship ship VM traffic around will be able to deal with one of these switches going away. If the node itself fails, it depends a little bit on how it fails, but the the Tufino can actually run disconnected, from p like, PCI Express can go away.

Speaker 3: 01:03:40

You can you can reset the PCI Express peripheral in the ASIC independent of of the data fabric that this thing has. And so we can actually well, Well, this is still to be We still have to validate

Speaker 1: 01:03:53

But we still have the power of this thing on.

Speaker 3: 01:03:55

We've been promised that you can you can reboot you can reboot the node and you can pick up the PCI Express link and pick up state out of the ASIC in such a way that you can continue running and as you don't have to cycle the ASIC, which was one of the things that was a pain point that the that the Broadcom ASICs we used, Facebook because a a a reset of the of the SDK meant that you had to reset the ASIC, which at that point meant that you would interrupt all the traffic through the data plane. Now the catch is that you at that point, once when when the host goes away, you can't program the tables anymore that this thing uses flows will keep running, but others other like the more SDN, like the more steering functions. Like, let's say that you're doing network like NATs. You do the address translation here. You can't establish new sessions.

Speaker 3: 01:04:47

So this is where then the other switch would have to take over, and we would have to detect that this situation is happening. Would that's what the out of bed management network would be used for, to adequately then, make sure that the that the hosts know to go through the other other switch. Because, effectively, every host in our network because we run these switches independent, every host using, ECMP and some some other trickery will be able to decide individually which which switching path to take, and so we can we can direct traffic the other way. But existing flows will will will continue, and we can we can we could then migrate those flows off if we if we wanted to. So there's a there's a possibility for a graceful degradation.

Speaker 3: 01:05:35

It's up to us to go and implement it, and some of that will exist in the at the very very beginning of our product, and some of that will be refined over time as we learn better how to control this thing.

Speaker 1: 01:05:49

Adam, does that make sense?

Speaker 3: 01:05:50

It it is no worse though than the than an existing 1U switch that you have in your rack that has a Z on D on a little socket, like a little daughter board plugged into this board, because that thing can can fail in exactly the same way as our, as our compute nodes would fail. So I'd in terms of failure domains, I would I would consider those somewhat equivalent from a hardware perspective. Now our compute node runs maybe a lot more software than the dedicated management CPU in the chassis for for most of these off the shelf these wide level switches do. So, yeah, there's more there's more chance for failure there. And and as as in so far as loading a wrong p 4 program, we will have to build, you know, the continuous integration capability and testing capability to make sure that we just don't ship broken p 4 programs.

Speaker 3: 01:06:46

There will have to be a process by which these are vetted and, you know, tested before they go out in in in updates to customers.

Speaker 1: 01:06:53

Boy, this is where the simulator is huge. Right? To be able to we've got lots of things we can go do.

Speaker 3: 01:06:57

Yeah. That will that will be a combination of simulation work, but also obviously actual racks that we will that we will run these on before software updates are blessed and go out. And this is where that that earlier discussion having multiple p 4 programs. Like, if we have to build if we're going to build be build building p 4 programs for every customer, that will quickly spiral spiral out of control. So I I don't think we can, but we might be able to do different verticals or or, you know, at some point.

Speaker 3: 01:07:27

But a part of that is our ability to build up enough automation to appropriately test all these things. Because simulating a network because you you can't just simulate or run with 1 rack. We will have to do we will have to rely even on more network simulation than just a a a model of the switch. We will have to figure out what happens in routing scenarios if you're running, you know, multi rack situations, tens, hundreds of racks, multiple multiple AZs. What does what does that thing kinda look like?

Speaker 3: 01:07:56

How does the how do these, these these failure models work? And how do we make sure that whatever we build for that, you know, every release works as intended. So, yeah, we'll have to build a lot more simulation capability.

Speaker 1: 01:08:08

A lot of fun software to build, for sure. Ari, this has been awesome. The immediate congratulations to you and everyone who has worked on this. This is a big aggressive I mean, as we we have joked, I can't remember Adam how frequently we've joked about it here, but, Oxide is 9 startups within a startup, and this is definitely one of them, maybe 2 of them. And I I mean, the fact that that you and team pulled this off, you got some serious curveballs, en route that you managed to field, and, I mean, it's just it's incredible to see.

Speaker 1: 01:08:45

And I think it's beaut and and then I think it's just great that the the artifact there's this great aesthetic beauty to, to to going back to what you originally tweeted, the artifact there. And I knew I mean, you you and I had had a conversation before you tweeted it, and I told you, like, this thing is gonna be people are gonna really gravitate to this because it's it's beautiful and it's sophisticated and it's impressive. It's it's a lot of fun to see.

Speaker 3: 01:09:11

Yeah. It's easy to when you're neck like, knee deep or waist deep or even neck deep in engineering to try and make this thing work, to forget about the, sort of the the the art aspects or or even to, I like to to marvel at the timescales at which this is happening. Because on the one hand, you hear numbers like 12 terabits per second, 6,000,000,000 packets per second. But then on the other end of the scale, there's a clock generator that is generating clocks for this for this chip that has a that has a a 150 femtoseconds jitter, worst case jitter. And so with the right with the right, you know, oscillators, etcetera, for that selected for that.

Speaker 3: 01:09:55

And so the, you're you're going from, you know, billions to to picosec or, like, yeah, nanoseconds, picoseconds, and femtoseconds. That that is because when we're talking about these these traces and the and you're you're talking about signals that propagate through this, you know, in the 100 of picoseconds, across the across the PCB. And so it's it's it's, it's it's really cool. Like, it's it's it's interesting to contemplate that sometimes.

Speaker 1: 01:10:29

It's it's an absolute marvel And, yeah. Looking forward to getting our boards back. As you so your tweet pointed out, our boards are shipped and that are the we we shipped the Cladl to our fab, and looking forward to bringing this thing up and on to the next chapter.

Speaker 3: 01:10:46

Yeah. I don't know if, well, I guess we can promise. We'll we'll put some pictures up. We we expect assembled systems in early January, so we'll we'll tweet a little bit more about as we as the assistant comes together.

Speaker 1: 01:10:59

And we are hopefully, don't blow any up, but Jesus Christ, this thing draws so much current.

Speaker 3: 01:11:06

Well, in those pictures, you'll see the silly the silly large heat sink that goes on top of this thing in order to keep it within the thermal envelope for the for the whole system. Which was a challenge in itself because the heat sink is so heavy that you need to really start thinking about shock and vybe as this thing needs to transported. And and this die, this chip has no heat spreader. It's open die directly exposed to that heat sink that might, you know, rattle along on top. So there's there's that that's what I mean by so many details to get right and opposed.

Speaker 3: 01:11:41

There's so many things that we have to

Speaker 1: 01:11:44

Adam, did you figure out. Would were you clued in on that crisis at all, the moment arm crisis?

Speaker 2: 01:11:49

Yeah. The the, the the fact that it's running without I mean, that that it's right on top of it is pretty crazy.

Speaker 1: 01:11:56

Well and that the heat sink was so large that you get very concerned about the moment arm and the ability to crack the PCB because you actually have the it's a it is a galactic heat sink on top of this, like, this nuclear reactor postage stamp that is the Yeah.

Speaker 2: 01:12:11

And I understood that they send us, like, that that there are, I guess, like, sort of a nerve PCBs that they sent to to test some of the physical properties so that when we crack something, we're not cracking the real PCB.

Speaker 3: 01:12:22

Well, yeah.

Speaker 2: 01:12:22

I mean, the the real chip.

Speaker 3: 01:12:24

We're we're doing some assembly, process development Intel has also shipped us non functional units that are mechanically the same as the actual thing. And so you you assemble a board with just the chip and then then on top and then you can mount the heatsink and then you can do shock and vibe, tests actually until the thing cracks. And so you can, so that you understand where the limit is, when do you start to damage this part? And then we can design enclosures and build sufficient specifications around it, so that in reality, this won't happen when when we start shipping systems. But there's, yeah, there's a lot of these things that

Speaker 1: 01:13:13

you just don't A lot of complexity. Yeah. Yeah. It's a lot of complexity. A lot of things have to come together, and a long way to go too.

Speaker 1: 01:13:21

We're we're really only at a weigh station here. So we've got a long way to go, I'm sure.

Speaker 3: 01:13:24

Like, we're we're concerned about enough about this that the first boards are actually not shipping with the heat sinks on them because we just don't we just we just don't trust existing shipping validated that the power works, that we don't have to or we we can try and do is hopefully, there's not that much rework, but whatever rework we need to do to get these things to run, do that first before the heat sink goes up goes on because it's like a almost 3 kilogram copper thing that sits on top. So it's it's makes the whole thing a lot less easy to handle. And so but yeah.

Speaker 1: 01:14:02

And we we might trust existing shipping services a little bit more if they would stop throwing our packages in the bushes. That's Yeah.

Speaker 3: 01:14:09

That would be nice.

Speaker 1: 01:14:09

That would be, so those existing existing existing shipping services, if you're listening, you you're in control. Stop throwing our packages at us. And but, Ariane, awesome work. And, again, thanks again for for joining us today. Thanks for everyone for asking, I think, a lot of really interesting questions and, on to bring up.

Speaker 1: 01:14:29

Looking forward to it.

Speaker 3: 01:14:30

Yeah. Likewise. Yeah. Thanks, man. Great.

Speaker 3: 01:14:34

Everyone. Thanks, everyone. Good night. Yep.

Speaker 2: 01:14:38

Good night. Yeah. Thanks, man.

Speaker 3: 01:14:40

Great. Everyone. Thanks, everyone. Good night.

Speaker 1: 01:14:42

Yep. Good night.