Hell is other networks
And then we are expecting some oxide folks who may or may not be oh, there you go. Hey, Levon.
Bryan Cantrill:Hey. Yeah.
Adam Leventhal:Hey, Bryan.
Bryan Cantrill:May or may Go ahead. Finish your sentence. May or may may or may not be turkeys. May or may not be I just feels like we're
Adam Leventhal:May or may not be able to let themselves on stage.
Bryan Cantrill:Oh, okay.
Adam Leventhal:That's that's ...
Bryan Cantrill:Well, look, know I struggle with this.
Adam Leventhal:Let's calm down here. No. Just mean we need to, you know, I don't wanna, you know, snow you with details here, but we have to give people particular roles in order for them to let themselves on stage.
Alan Hanson:What are the roles?
Adam Leventhal:Well, now you have it. You are now an oxidase member.
Levon Tarver:Yeah. I mean, I have a misconception is that we're good at all forms of technology. We're not.
Bryan Cantrill:No. Oh god. We're definitely not. I know. No.
Bryan Cantrill:It's and
Adam Leventhal:there is there we go. And here we go. Are you yeah. You letting those folks up?
Bryan Cantrill:You should do it.
Adam Leventhal:I've probably for the best.
Bryan Cantrill:It's just for the best. I'm trying to, and it's like offering me Nitro subscriptions or something. I don't know. The what what's going on? Is this thing beginning to lose the plot a little bit, by the way?
Bryan Cantrill:What's going on with that?
Adam Leventhal:You know, we I have a Nitro subscription for this. This is why we get
Bryan Cantrill:Oh. Better k.
Adam Leventhal:Yeah. We have better quality audio than just the free tier. But it's so ridiculous how we have to get it. Like, I can't like, what I wanna do is just, like, swipe the credit card like an old person and pay for the thing I want, but I can't do that.
Bryan Cantrill:What I have to do We won't be doing that. Is it is it is it a call sales thing? No.
Adam Leventhal:No. No. I no. It's it's it's the opposite of the enterprise edition. I have to buy Nitro, which gives me boosts, which I can give to the server of my choice, which I have deigned to to bless upon this Discord server.
Bryan Cantrill:Are you concerned that you are violating child endangerment laws? I mean, just by I mean, it just feels like this feels like something. It's like, why are you using this if not to get in to be to creep in on someone's Snapchat? We we we buy boosts. I mean, are you should we
Adam Leventhal:buy that?
Levon Tarver:Do you
Alan Hanson:use Roblox to buy those?
Bryan Cantrill:Yeah. Do you use yes.
Adam Leventhal:Yes. Okay. Yes. I've become everything we've all always hated. And, I buy boosts.
Adam Leventhal:I buy nitro or whatever and distribute the boosts. And Oh my god. And I also I also get, like, certain perks that I in on my Steam account, which is a thing I now have. Thank
Alan Hanson:you very much.
Bryan Cantrill:I will thank you for all of your service and sacrifices. Yes. This is definitely hub of get kind of vibes over here. I'm not sure. Like, are we will we be looking at this on at this day?
Bryan Cantrill:Like, oh, god. Do you remember when you're like, had tell you what boosts were? And and now that now we all get paid in boosts. Now it's like your w two shows boosts on it.
Adam Leventhal:It's like the Federal Reserve has been been disbanded. Disbanded? I mean, are the reserve currency of choice for internationally.
Bryan Cantrill:Tell me that's not a headline from tomorrow. I mean, just like, tell me that's impossible right now.
Levon Tarver:You were worried
Alan Hanson:about Bitcoin and here we are.
Bryan Cantrill:Yeah. Exactly. No. I I feel the if you could tell me that's impossible, I'd appreciate it. But
Adam Leventhal:it's impossible. I do. I before we get started, I do have good news, which is not the good news you know about, but I have different good news. Maybe you already know about it. But my son previously introduced to this podcast as a C plus plus partisan.
Adam Leventhal:Oh He's now has now taken up rust.
Bryan Cantrill:Look at that. It's a natural phase. You just have to let them go through it. Have to let them go through it. A parent of a C plus plus programmer, you're just holding your breath thinking like, I'm trusting this kid to make his own decisions, but I'm not at the party where we're up where we're overloading the comma operator.
Bryan Cantrill:He's got a like, he's got to figure out that like, I I don't I don't know if I'm and then look, he I'm sure he experimented with it and it's okay. He's he's he's But He's coming out of the
Levon Tarver:other end.
Adam Leventhal:The thing apropos of Discord, he sent me a screenshot where he's getting torched by his friends in Discord for needing to quote, unquote, memorize 3,000 built ins and having megabytes long large target files to which I have bad news, which I'm like routinely gigabytes, like tens of gigabytes of of target files I need to delete. But
Bryan Cantrill:Yeah. That is there are legitimate criticisms of Rust, and I feel the Discord has got is kinda like half right. They they they they they've kind of got some legitimate, but maybe some less. But this is actually interesting to know. So what goes on in the the teenage discords bullying the Rust program?
Bryan Cantrill:I'm kind of interested in the cyber bullying of of young Rust programmers. Is that a Yeah.
Adam Leventhal:No. I mean, I I I I have been sent exactly one screenshot of this Discord, which I know has been on for a long time ever. And it was this one. And
Bryan Cantrill:it talks about He obviously didn't give you the rider that I hear quite a bit, which is I'm gonna tell you this, but it's not for the podcast.
Alan Hanson:It's like,
Bryan Cantrill:you know, I can well, I'll I'll be the first of all, I'll be the judge of what's for the podcast. It was not for the podcast. Thank you very much.
Adam Leventhal:Yeah. I don't yeah. Didn't get that writer. So I assume implicitly Here
Bryan Cantrill:we are.
Adam Leventhal:Podcast fodder. Exactly. I'm sure all all of his buddies don't mind being quoted in the podcast as well. So, yeah, we're good.
Bryan Cantrill:We're good. Alright. I have to tell you, I I I lulled at your no egress. I thought it was very funny.
Adam Leventhal:Okay. I I You're you're you're gonna you're gonna fall over and die now, so not my joke. Not really? ChatGPT joke.
Bryan Cantrill:That's ChatGPT's joke. It's a good It's a good joke. Oh my god. It's a good joke.
Adam Leventhal:So I was like, I I was I I felt like there was something right there, and I really liked I mean, I know we went back and forth with ChatGPT. We did go back
Bryan Cantrill:and back with Jet GPT on the title. Yeah.
Adam Leventhal:Yeah. The title, which which I which I came up with.
Bryan Cantrill:It good. It was actually human. Yeah.
Adam Leventhal:Human crafted. Human hand. Yes. But I was like, oh, there's gotta be a pun here. And I did and I asked Jet GPT for a pun, its first option was no egress.
Adam Leventhal:I'm like, nailed it. Like that. Yeah.
Bryan Cantrill:Oh my god. Okay. So everything we've said is wrong and they actually are gonna replace us. So actually, I I sorry. I would like to, I would like to have an updated message to our future AI overlords.
Bryan Cantrill:That is that's that's pretty funny. Well, no egress of course being this is because the this is a Sartre reference to no exit, the the one act play. That hell is other people. And in this case, hell is not other people, hell is other networks, but we I I wanna get into it. So we got a a bunch of folks here today that work in this issue.
Bryan Cantrill:So we had a customer that was complaining of a particular think I it's the wrong word. They were observing actually pathological behavior. Sorry about that. I'm not
Adam Leventhal:complaining at all.
Bryan Cantrill:My god.
Adam Leventhal:Love this. Like, look, there's pathological behavior. What about your whiners? Am I right?
Bryan Cantrill:Oh, come on. What is come on. Yes. So there's smoke in the cabin. Come on.
Bryan Cantrill:You know, it's
Adam Leventhal:Come on. We have masks there for a reason.
Bryan Cantrill:Exactly. But you but you put on your and put on your mask if you're so worried about it. So worried about your flow of oxygen. The no. They they observed some really pathological behavior.
Bryan Cantrill:And I mean, was the kind of Alan, Will, was was Will the Will, were you kind of the first person to to field the the the report of of the pathology here?
Will Chandler:Yeah. Yeah. I think so. I think the we're just doing some tests to try to upload a a disk image from from the rack to the rack. So they're in the VM on the rack, and they're just trying to up to the to the upload at the control plane.
Will Chandler:And then it would run for thirty seconds, and it would just error out with a vague and unclear error.
Bryan Cantrill:So that's obviously bad. So this is so what they've observed, Will, correct me if I'm wrong, is that networking is fine going in and out of the rack and fine between two VMs in this on the same network in the rack. But when they cross VPCs in the rack, the network performance is deeply pathological. Is that is that a fair summary?
Will Chandler:Yeah. That's where that's where we landed eventually. Initially, we thought it was just image uploads and then further experimentation showed that it was anything that's crossing VPCs.
Bryan Cantrill:Oh, okay. Actually, so I I shouldn't rush the story then because fast forwarding. Yeah. Exactly. Okay.
Bryan Cantrill:So let me let me back up. So okay. So it's just image upload and so you're trying to figure out how do we what's the kind of process of of getting to to that discovery?
Will Chandler:Well, mean, so obviously, you want to blame Alan first, but Oh my god.
Bryan Cantrill:That's that was almost Diet Coke got the nose onto the mic. You know, we we do I I I there are other cameras in here?
Alan Hanson:How did
Bryan Cantrill:you time that so perfectly? Yes. Go ahead.
Will Chandler:Yeah. So it was it was really just a process of elimination. Right? So we we started with just trying to produce us our own racks, and we couldn't. And we so then we started to think, well, what's what's different about uploading an image to the control plane versus going from one one, VM within the same VPC to another?
Will Chandler:And I I think it might have been Angela or maybe it was Trey or Levon who said, well, you know, what's what's different about that is that we're exiting the rack, and then we're hairpinning back in to when we when we cross the VPC or go to the control plane. And then that kind of led us to to measure things more accurately. And I don't know. Maybe Lavonne or Trey can discuss in more detail, like, what we found. Or if we wanna we wanna tease more, I'll I'll leave it up to the podcasting experts.
Alan Hanson:Well, even I mean, to tease a little more, we did have a problem with image upload a year before or something.
Bryan Cantrill:Right. So this it's not like
Alan Hanson:in our mind of, oh, it's that it's come back. Right. Come back. Know? So we we spent some time wandering around in the weeds trying to figure out if it was that old problem in some new manifestation.
Bryan Cantrill:Which is always frustrating. It's like, we fixed this this can't be an image uploads have to be fast because we fixed this problem. And it's with it. I mean, I mean, course, our first instinct is to can we reproduce this? No, can't reproduce this.
Bryan Cantrill:But so now we've got to figure out the differences in environment. And it doesn't like it's a bit of a breakthrough to get the the because it wouldn't necessarily be obvious that the traffic is coming. I did because they they did sound like they did not do the experiment of uploading an image from not on the rack.
Alan Hanson:Oh, they did. They did try it from somewhere
Bryan Cantrill:else. Okay.
Alan Hanson:And initially, they were like, we're having a problem with image upload. And then as we went back and forth, then I think it was like, oh, well, if I try it from somewhere else, then it works okay.
Bryan Cantrill:Yeah. Interesting.
Alan Hanson:Fine. That was part that was another piece of the confusing puzzle.
Bryan Cantrill:Okay. So then so and and and Lavaughn and Trey, so are you are all four of you kinda on this from the get go? Sounds like
Alan Hanson:Well, Will certainly on from the beginning. And I think he started like bringing in more people as like the first round of stuff wasn't wasn't getting anywhere. And I don't know I don't know how how long before Levon or or Trey jumped in.
Levon Tarver:Might have
Will Chandler:been a day or two. Was escalated pretty quickly because it was it was so weird and it was so, reproducible as well. Was like, this like a one off. It's like, this happens every time I do this, and this is something really basic. So something was clearly Yeah.
Bryan Cantrill:Right. It's always helpful when these problems are kind of embarrassing. Frankly, we're just like, okay. This is like something is obviously wrong here. The so, Lavonne, Trey, the do you wanna hop in here with the kinda how you got to the the observation of, oh, wait a minute.
Bryan Cantrill:This is traffic that's hairpinning?
Levon Tarver:Yeah, so I also don't remember who exactly said the magic words, but the conversation essentially boiled down to as we're gathering details and trying to figure out exactly which system was talking to which system and like where was it located? Because with networking, there's always a black box and you're just trying to get as much data as possible. And it was either Alan or Will that informed us that, yes, when the customer is uploading an image or performing a file transfer from like their laptop or another system, external system that's authorized to talk to the rack, it's fine. But when it's happening from a virtual machine on the rack, it's experiencing an error. And that was the thing that was okay.
Levon Tarver:The only thing that's really different there is that the traffic is hairpinning because our oxide control plane runs in its own VPC separate from the instances and other services. That to me, you know, immediately was like, well, Okay, this traffic must leave the rack transit to the customers, connect it, switches or routers, and then come back down into the rack.
Bryan Cantrill:Okay. I did not realize that about this problem. Yeah. I thought they had reproduced this by by by with two VPCs that they created but in fact it's the fact that the control plane is in its own VPC that by nature if you are on the rack talking to the control plane it's VPC to VPC traffic.
Levon Tarver:Correct. Well, and also they were performing a file transfer from one VM to a VM in a different VPC. So it was because both of those had the common failure. That's what really led us to lean into, okay, VPC to VPC communication seems to be the common issue. This traffic pattern, what's unique to it versus something coming from outside the rack is that it's hairpinning through the upstream device.
Levon Tarver:So, like, that's that's really what made us suspicious. Yeah.
Bryan Cantrill:Yeah. And could you just describe, like, when you VPC, you were talking about virtual private cloud. Could you just elaborate a little bit on what that is and then how a bit of how we've kind of implemented that inside the oxide rack?
Levon Tarver:Sure. So things like VPCs, virtual private clouds, I think most people are familiar with the term now that it's kind of ambiguous, and it's like implementation specific based on who you're working with and how they decided to do it on their platform. But I think the most common element of a virtual private cloud is that it's essentially a private network that runs in a cloud like service. So like if you're using AWS or Google or Azure, you're going to have like a private network that can be stretched across a cluster of systems or machines that you're just you don't have to worry about. And as you create virtual machines, they can participate in this private network of systems.
Levon Tarver:And then you can allow this private network, if desired, to communicate with the outside world through some form of gateway. Like that is like, so that's, you know, very high level, simplified explanation that seems to be common across most cloud like systems. And so the Oxide Rack, because we're trying to bring that cloud like experience to customers that buy this system, have implemented it by creating, going a little bit into this how the sausage is made, but we use Geneve packet encapsulation and VPCs are roughly associated with like a VNI. And so that VNI creates a private network inside of our rack. That system or VMs that are on the same VPC, their traffic is labeled with the same VNI, and they're able to communicate with each other over IPv6, which didn't even encapsulate a VPN inside of rack.
Levon Tarver:So that's the high level of how that stuff works today. But we have a lot of information about this if people want to go deeper. I believe it's talked about on other podcasts, and we have some documentation about that stuff written elsewhere too.
Bryan Cantrill:The NRFD or two written on that one.
Levon Tarver:Oh, For sure.
Bryan Cantrill:Drop some references too. So we and then so so why does VPC to VPC traffic, why does that leave the rack? What what's the
Levon Tarver:So okay. So currently, the reason it leaves the rack is because, the way our system works today is if you want a VM inside of a VPC to communicate with another VM in the same VPC, our our virtual NIC is I describe it that way to our customers, but our virtual switch interface called OPTE has some rules in it where it can tell that where you're trying to go is on the same VPC, and so it knows how to get that traffic to another sled in our rack. But if it sees that what you're trying to get to is not in the same VPC, or is it to the rack switches to leave the rack by default? That's just how things are today. Things might change in the future as we, you know, enhance the feature set of the rack.
Levon Tarver:We have had some customers say that they kind of like this because they go, hey, if something's trying to communicate from one VPC to another VPC, we want to have like our security appliances in the middle of the path, and we don't want that traffic to ever leak from one VPC to another VPC without us being able to like, look at it that stuff. And that could be a very valid reason to do that. Some people will ask, well, why can't I just make it go directly from one VPC to another? And this customer was actually asking us this and we said, well, no one's really needed to do that yet. And it's not that we're not willing to do it.
Levon Tarver:This is how it works today.
Bryan Cantrill:Well, mean, because this is really the abstraction that you're really trying to implement here is this idea that we have called a silo where you it really is the idea of a of a tenant. And the if you if tenants are just as you were saying earlier, like, they're been customers of, actually, if a if if one tenant is talking to another tenant, I actually do want that to go through as just you said, Lavonne. I wanna be able to to put some policy around that. And I wanna be able to so going through I actually want this to go through my network because I don't expect tenants to be talking to tenants really. I expect tenants to be talking to other things in the network.
Bryan Cantrill:So there's the the reasons why this is like this is desirable behavior and one could envision a world in which we could it certainly if we had a customer for whom this was a a they expected to have a lot of tenant to tenant kind of a silo to silo communication. You could implement hairpinning, inside the rack, but it would be something additionally we would it'd be a feature we'd be adding effectively.
Levon Tarver:Right. Yeah.
Bryan Cantrill:Okay. So we so we know that fine. I mean, feels like we we we've we've determined that that traffic it when traffic leaves the rack, it's pathological. When it doesn't, it's not. And now, I mean, that almost does this feel like a breakthrough or does this feel like like or or does this feel like the floor is giving way beneath us?
Bryan Cantrill:Because they're like, oh, man. Now now this problem has become feels like the problem has become a lot more complicated. Is that is that a fair read?
Levon Tarver:So anytime we can find a way to, like, split the problem down to something more specific than the networking side of things, we kinda are smiling, even though like the world is still on fire, we
Bryan Cantrill:are not getting
Levon Tarver:an idea of where the fire is. You know, we're not like indiscriminately spraying water. We're just kind of like, oh, well, it's got to be over here, right? Somewhere.
Alan Hanson:I think that's a that's a really great point that when you're able to just narrow it a little bit, it gives you some sort of hope that you have, like, some control over what's going on. Yeah. Because some of these when they're, like, kind of intermittent or you're not sure, like, anytime you can do that you're like, okay, we're we're we're getting somewhere.
Bryan Cantrill:We are actually making progress on this. Even if we are yeah. Even if the progress we have made has has now like this thing is now venturing out into the ether and not well, poorly venturing out into the ether. So now we need to understand the ether lowercase e I guess much better in terms of what could possibly be going on.
Levon Tarver:For sure. And so it was exactly kind of like this thing we became sure of, or we're like, all right, we're pretty sure it is only affecting hairpin traffic and an additional detail. It only happened when the traffic rates went over a very vague kind of like, we couldn't find like a hard number, but it was only happening with high bit rate traffic or high packet per second traffic. They were they were able to like, just do standard HTTP requests between VMs or from a VM to the control plane. So, and that stuff was happening like 100% of the time, no problems.
Levon Tarver:But it was only with larger file transfers and like prolonged sessions that we were seeing any sort of failures. And so it was that plus the hairpin detail.
Alan Hanson:That they were seeing failures. We weren't seeing any failures yet.
Bryan Cantrill:Okay. Yeah, was going to ask.
Levon Tarver:The customer is seeing failure. So it was those two details. I think Trey overheard this, and that's what kind of gave him an idea of where to start looking in more detail. Does that sound right, Trey? I think it was like right at that point where you kind of put us on the trail of the thing we eventually labbed out.
Trey Aspelund:Yeah, that sounds right. I think, guess for context, of in a past life, I spent, I don't know, six or seven years just doing network troubleshooting a whole lot. So a lot of those kinds of issues, you tend to develop an intuition and things have a smell. And hearing that it's only affecting hairpin traffic, it's only coming up when there's high enough throughput, it's only happening for certain flows at different parts of the network. Like those are all kind of clues that give a certain smell to the problem.
Trey Aspelund:That was definitely kind of the point where I was like, I have a hunch here as to what's going on.
Bryan Cantrill:And then what what is the actual failure mode that you're seeing? So are we are we seeing packets being dropped? Are we seeing actual high rates of packet loss? Are we seeing resets? What are we seeing?
Trey Aspelund:Yeah, that's a great question. During that process, I think, I don't remember if it was me or Levon or both of us. But we asked to get some packet captures from the two different VMs. So it figured it'd be easier if the customer just had a couple Linux boxes, throw one in one VPC, throw one in the other, run t speed up, and get a PCAP on both sides. And so we threw that through Wireshark.
Trey Aspelund:Just at glance, you can immediately see that this thing is just one side is saturated with retransmissions, and the other side is saturated with duplex. So essentially, what you're seeing in TCP is that is the signal for packet loss. And one side's saying, yeah, this is the last one I got from you. You got anything else? The other one's like, dude, I keep sending that same packet.
Trey Aspelund:What is going on?
Bryan Cantrill:Okay. So we're seeing very high rates of packet loss under some conditions. But when we when we saturate the network or we get high bit rate, high packet rate, we are would we see a high rates of packet loss.
Levon Tarver:Yep, yep, exactly.
Alan Hanson:And still that's not us.
Bryan Cantrill:It's still that right. We do not see this and we don't see this in our environment. We don't see it
Alan Hanson:on our dog food rack. We tried it on our colo rack. We had some other chest rack where we tried it and it wasn't reproducing there.
Bryan Cantrill:And not something we'd heard from other customers either. So this is something that we've only, which feels like, I mean, it feels like there's a kind of a broken leg here somewhere, but it
Alan Hanson:Nobody else has that same leg, whatever.
Bryan Cantrill:Right. Exactly. So this is what we're talking about, like, what does their infrastructure look like in terms of what are we actually talking about here? Where does that packet go when it when it leaves the oxide rack?
Levon Tarver:Yeah. So the the environment was a pretty straightforward deployment. We had the oxide rack compared, connected to a redundant pair of switches and, kind of like a, so we have two switches on oxide rack, each switch on oxide rack had an uplink to each upstream switch. So both of the customers switches had redundant connections to both of the oxide racks switches. And we were pairing with those switches over BGP and sharing the same prefix information to both.
Levon Tarver:So so that was the overall design of the environment. Okay. The switches were like Cisco and Nexus nine case. Okay. Yeah.
Levon Tarver:And and
Bryan Cantrill:so the okay. So one of the challenges here is like we've that's infrastructure that we can we replicate that infrastructure exactly? I mean, that's and can we do it without millions of dollars with the Cisco gear? I guess that's the that's the challenge.
Levon Tarver:Yeah. So fortunately, we did we do we do try to have some example of, like, the common vendor routers and switches in our labs. So we were able to find a Cisco Nexus nine ks in our lab and get it set up. I think Alan dumped on that and we essentially roughly replicated part of their environment.
Bryan Cantrill:Okay. So that's it. Now were you able to do this with the full redundancy, Owen? Or how are you?
Alan Hanson:I was worried about that at first because I was like, we don't have that many that much hardware to reproduce. But let's just try with a single switch and do it that way with just one uplink.
Trey Aspelund:That That agrees with a good point. Do we want to talk about what happened when we tried doing that in their environment?
Bryan Cantrill:Yeah, for sure.
Trey Aspelund:Lavon, do you want to go for that one? Cause I think you were leading the leading the charge on dropping all the links and everything.
Levon Tarver:So yeah, like Trey, my background is also a decent chunk of those years was spent helping operate data center network for different data center and cloud companies. So kind of like, this is, if there's a good term for it, this is my favorite kind of pain. Like, you know, I feel like we get paid for the pain that we're good at managing. And I kind of got a little Stockholm syndrome going on where I like, oh, maybe I can look into this. Maybe I can help with this.
Levon Tarver:This could be cool. So I see a lot of comments in the chat today's Oxide and Friends conversation. So I see my fellow network nerds and I was we were thinking a lot of these things like, okay, could there be some like multi path issues going on? Could there be like some buffer issues going on, etcetera, etcetera, etcetera. So I was like, okay, for us to make it very easy for us to eliminate some of these potential issues or scenarios from this equation, let's take the system all the way down to one uplink, so we're just watching traffic go into one and come back down the same one and get an idea of what's going on.
Levon Tarver:This ultimately kind of revealed a few other issues that the customer is having with regards to configuration.
Bryan Cantrill:Oh, interesting.
Levon Tarver:You know? But, fortunately, like, you know, those things were things that were able to kind of, like, get started out pretty quickly. But this is what gave us the confidence to know that it was not a multipathing issue because we were still seeing the problem present, even on a single And that is what allowed us to confidently test the scenario in our lab just using a single 9K switch.
Bryan Cantrill:So that's on the one hand, that's a pretty invasive test for them to do, imagine. But on other hand, it's it gives us, Alan, to your kind of point earlier about like that that feels like it's really bifurcating the problem space and just knowing, Lavonne, you must have been relieved at some level. It's like, It happens without multiple uplinks. That's gonna make the the reproduction of this hopefully much simpler.
Levon Tarver:Absolutely. Like, I you know, that that's where things get, like, you know, it's very much a black box. Like, have zero idea what most of us have zero idea what what a Cisco switch is really doing under the hood or any vendor switch. It's not just them, know, I'm not trying to beat up on them. A lot of vendor switches, you're just kind of guessing as to like what the actual behavior is if you can't, if they don't have like an exact command that you somehow like acquired the knowledge of, you know, foreshadowing.
Levon Tarver:And so just boiling it down to something else, okay, this is something that will occur even with a single uplink. This is like really helping us narrow down the issue. And again, this was something that I went to Trey with, or I think Trey was on the call and we started spitballing some ideas and I think it put Trey on the path that ultimately led us to testing the right things to get the resolution.
Bryan Cantrill:Trey, what was that? What did he tee you up with?
Trey Aspelund:I mean, really just a really nice repro.
Bryan Cantrill:Interesting.
Trey Aspelund:Honestly, the whole situation there, one of the things that I guess I'd kind of one of the bits of knowledge that I'd kind of taken with me when I came to oxide was
Adam Leventhal:I've I've had a lot
Trey Aspelund:of issues where packets go into a switching ASIC and they take a wrong turn or you just don't quite understand what how the ASIC is supposed to work. And one of the things I'm aware of that a lot of ASICs do is they have exception paths, meaning that if a packet has certain attributes or certain qualities about it, Maybe it's a result of a lookup. Maybe it's something about, you know, the contents of some of the headers. But there are certain paths or certain attributes of these packets that cause the ASIC to consider an exception, and it will fire it off to the CPU to be handled rather than forwarding it in the switching ASIC or handling it entirely in the switching ASIC. So my my hunch there was that we were hitting some kind of exception path, specifically, you know, one for ICMP redirects.
Trey Aspelund:Now the circumstance that you would see an ICMP redirect packet get ejected by the ASIC is if a packet comes in, it needs to be routed. We do the route lookup and the result of the route lookup has the same egress interface as was the ingress interface. So what that tells the ASIC is, hey, somebody's routing packets to me when the shortest path to get to that destination is not actually through me. So I should generate myself an ICMP redirect packet, send that back to the sender, and then they'll just update their forwarding tables. Naturally, that was a very kind of nineties idea in terms of like network security and trusting devices to redirect where you're sending traffic and all those things.
Trey Aspelund:So not really a very modern thing that people use anymore. But the exception path is still live in the ASIC. And so that was one where I was like, Okay, I think this is what's going on. Let's kind of take another swing at this and see what we can find out. And so this problematic for a couple reasons.
Trey Aspelund:Because one, the ASIC is meant for high throughput, low latency, very consistent latency profiles, all those kinds of things. Whereas the CPU, you have no idea how that thing's going to get scheduled by their operating system. You have no idea what other processes or threads are competing for those resources. Yeah. Not only that, but the connected connection between the CPU and the ASIC is typically just a PCIe link.
Bryan Cantrill:Right. A cocktail straw. Yeah.
Trey Aspelund:So, you know
Bryan Cantrill:And the CPU itself is often grossly underpowered. I mean, it's like, this is like, this thing is just not designed to be it's designed to like work at all. It's not designed to work quickly if you if you've been tossed out of the ASIC.
Trey Aspelund:Yeah, exactly. I think in this case, the Nexus had, I think even a Xeon on it. So it wasn't like a too bad of a processor, but still you don't want it going through that CPU. I mean, the other thing about it is these network operating systems, they have policers in place for the front path. So that control plane policer, they have several in charge of saying like, well, I need to make sure that if I'm running a routing protocol that that doesn't get swamped by somebody who doesn't know which way to route their packets or somebody who keeps sending me things to be routed, but there's TTLs expired and things like that.
Trey Aspelund:So that was really the hunch that we started going down. And I kind of phoned a friend and asked an old colleague of mine who used to do BU escalation for Nexus at Cisco and was like, hey, what commands do I need to go look at this? So we kind of went from that, started digging around in like the control plane policers, looking for counters. And actually, totally spun us the wrong direction because we started looking at it in the customer environment, and those counters weren't even incrementing.
Bryan Cantrill:Okay, interesting.
Trey Aspelund:Yeah. Well, we took it back to our lab, and counters were incrementing, but not the counter we expected. So when we looked at like the control plane policer policy, what we expected was that the packets would be classified as those IP redirects. But what ended up happening is they were just got classified in like the the general classifier. So anything that's left over that doesn't have a more specific rule.
Trey Aspelund:So after we saw that in our lab, I'm like, okay. So if our version of Nexus has a bug where packets aren't being classified correctly, maybe theirs is too.
Bryan Cantrill:Oh, man. This is like where I mean, okay. So now you're having to like superimpose, but like, okay, this thing could now have a bug that would change its behavior. We're not seeing the like for like behavior. So the counter we're seeing just to re restate what you said.
Bryan Cantrill:The counter we're seeing we're seeing a counter bump internally, but it is it's not the counter that we expect. On the customer side, are we not seeing a counter bump at all? Or are we what are we seeing on on the customer side?
Trey Aspelund:I think they had enough traffic on the control plane that they were seeing various counters increment, but we weren't sure which one our packets was actually mapping to. Yvonne, you were
Alan Hanson:even
Trey Aspelund:starting to tell. Yeah.
Levon Tarver:Yeah. Yeah. Because because, like, Trey brought up, you know, the theory that he just kind of explained everyone. And they checked the counters and the counters were zero. That may have been like on a Friday or something.
Levon Tarver:And then the whole weekend, I was just thinking about it and then come back and we're looking at some other things. And I was just like, dude, no. I think what you said just makes too much sense. It's like that Doctor. House episode where he's like, no.
Levon Tarver:Symptoms fit.
Adam Leventhal:It's just hiding.
Bryan Cantrill:It's like, no. I'd like this hypothesis can't die right now. It's too no. No. No.
Bryan Cantrill:I refuse to let this hypothesis die.
Levon Tarver:I was like, man, it's just too perfect. And I was like but I just know how some how funky some of this stuff can be. So I you know, we went and chased it out again. I was just like, I'm just gonna look at all the counters. And was like, I'm gonna clear all the counters.
Levon Tarver:This is our lab. I'm gonna clear all the counters. This is the only thing running on it. And Alan's running the file transfer in the lab. And so to be clear, these counters that we're watching, they are packet drop counters for the control plane police.
Levon Tarver:So this is the control plane policing saying, I am dropping traffic for for violating these rules I have in place to protect the CPU.
Bryan Cantrill:Can I just ask? So because you both use is it control plane policing? Sounds like a technical Cisco term. Is that correct?
Levon Tarver:It might be a Cisco term.
Trey Aspelund:They have an acronym called COP, like c o p p. So it's control plane police, I guess, the one that they use. But in general, like, would it be helpful to explain a network policer?
Bryan Cantrill:Yes, please. It would be very helpful.
Trey Aspelund:Yeah. So there, I mean, under, under the umbrella of like QoS or quality of service, there's a few different, like mechanisms that you can use. There's shaping, which is basically like, hey, I'm gonna buffer this packet for a little bit, but until there's enough availability on the egress interface to be able to transmit the packet. And then there's policing, which is saying like, okay, I'm giving you a committed rate of X packets per second. And maybe I'll let you burst into like X plus Y packets per second.
Trey Aspelund:And if you go anything beyond that, I'm dropping every other packet that goes beyond that rate limit.
Bryan Cantrill:Okay.
Trey Aspelund:So essentially you're setting a hard cap for what the bandwidth
Bryan Cantrill:Hard cap. Okay. Yeah. Woof. Okay.
Bryan Cantrill:I mean, I definitely understand it, but the consequence so that means the consequences of exceeding that cap are grave.
Levon Tarver:So you're just doing
Alan Hanson:like one image upload, you may not hit that limit. Have to do a lot. You have to do enough work to get to the point where things start to break.
Bryan Cantrill:Yeah. Interesting.
Levon Tarver:So that was like one of the things that really made it smell like that because we're going well, know, traffic rates is fine, higher traffic rates, we start seeing packet loss. So this smells like, you know, getting positive to CPU and violating this. Mean, it's like, it is a hunch, but I was just like, we're just gonna look at all the counters, you know, and see if we can find some that are incrementing. And we found, as Trey said, we found some counters that we did not expect to be incrementing, but they were. And then this led us to a part of scenario that I think Brian finds very, very intriguing.
Levon Tarver:Which is, so to solve this problem, you can go to the interface that is configured as the layer three interface, as they say, or basically it's the interface that is enabled with an IP address that can do IP to IP communication for, like, your routing protocols or receive, you know, layer three traffic and be used to perform, like, routing decisions. So you can go to whatever interface it is, whatever, like, switch port that is on your Cisco device and issue the no IP redirects command, which will turn off the behavior of punting it to the CPU. And instead, we'll just allow that forwarding to be handled by the ASIC like it was any other packet. And so when we issue that command on the relevant interfaces, packet loss stopped, remove the command, packet loss started again. I was like, all right, we have a strong amount of smoke here.
Levon Tarver:This could be where the fire is.
Bryan Cantrill:Lavanya, you are mean, the one thing that I I do find kind of nuts is that so you got this IP redirect policy that you and if it if you have set no IP redirects, it like, you you don't it doesn't report this as configuration. Right?
Levon Tarver:Yeah, so this was the thing that you and I talked about that you found very funny, and I said, because one of the things was like, why isn't this kind of stuff more obvious? And it's just a weird quirk that happens with network vendors and network devices. So on a on a Cisco device, you can do, like, show running configuration, and it'll show you all the things that have been configured that are not like default configuration. So if you get a Cisco switcher router and start changing some things on the configuration, it'll show all the changes you made. Now one would ask, well, what if I wanna know what the default configuration is?
Levon Tarver:They have a command called show running configuration all, or it's, you know, us that do this all the time. We just do show show run all, and it'll show you I'm gonna force
Adam Leventhal:you to
Bryan Cantrill:call in quotes for this command.
Levon Tarver:Yeah. But there is a quirk for some reason, you don't know if IP redirects are enabled or not because it doesn't show up if you do shell run or if you do shell run all. It does not show this command. You just gotta know. It's a thing because there, you know That's different
Adam Leventhal:BU that made that.
Bryan Cantrill:That's exactly right. They didn't come to the right architecture committee.
Alan Hanson:And we did get the customer to give us their output, you know, as we're trying to look for things. We're like, well, can you send us your config? And they sent us this. And of course, you know, it was the sum of it.
Levon Tarver:So the only way you know this is configured by default is if you already have the experience, someone told you this, or you went and read the Cisco document annotation for the IP redirects feature for Cisco Nexus. That's the only way you'll know is configured by default on this platform. Ironically enough, Cisco's official documentation for IDP redirects on Cisco Nexus says they recommend disabling it on all of your Layer three interfaces, even though it is enabled by default, they deem it no longer necessary, potentially problematic, and essentially kind of outline it as one of the things that was kind of like holdover from an era where we had low bandwidth interfaces you didn't want any unnecessary communication coming over low bandwidth interfaces. But now we live in a world where we have 100 gigabit interfaces. So like some of this is maybe not necessary anymore.
Levon Tarver:It's just moments like this, you know, like the reason this topic came up between Brian and I, it's it's a big moment of empathy for anyone running infrastructure these days because you can have someone who's very knowledgeable in technology and is used to integrating systems, but there's just still a little weird stuff like this where even when you're trying to look at everything and look at all the details, there may be details that are there that you just don't know about. And this is just like one of those, well, you know, wow.
Bryan Cantrill:When you also can know that an ICMP redirect, if someone else is saying in the chat, like, you have to know that an ICMP redirect is not happening in the ASIC. That it's happening in this that it actually kicks out to the CPU, which, I mean, I guess is understandable. But the it's If
Adam Leventhal:you if you say
Bryan Cantrill:so Understandable. Yeah. And and guess and and Trey, this is where look. Lavaughn, you described it as you and Trey kinda summoning the dark knowledge.
Levon Tarver:Yeah. That's what it was. Because because this this is exactly the reason we couldn't reproduce it on dog food or on our other systems that we run 20 fourseven and beat on to try to find like any problems. Like the reason we didn't hit them there is because our switch platforms that is handling the traffic there don't have that behavior by default. And yeah, so you just couldn't know.
Levon Tarver:And it was kind of like the summation of dark knowledge, like Trey tapped into some of his and then he presented it to me and I was just like, yeah, I think this thing's lying to us or is or there's some other little quirk that we just don't know about. So I'm just going be a stubborn mule and like brute force, you know, my way and look through this stuff until I find until I am like a % sure that it's not like an IP redirects punting to CPU issue. Because I was just like that theory is just, it just fits too well. And I was like, it's not that, I have no idea what it is.
Bryan Cantrill:Right. So if you were able to get this, I mean, so there are a couple of key breakthroughs here. One of them, as we said earlier, is that kind of bifurcation of knowing that this happened with only a single link that they that we could eliminate that someone else pointed out in the chat. Like on the one hand that was a very invasive experiment for the customer to run. On the other hand, it's it is a very concrete ask in terms of like you are helping us bifurcate the problem.
Bryan Cantrill:I assume, I mean, certainly when I was on the other when we've been on the other end of that where we've got someone with a mysterious problem and they and an engineer from a partner wants us to do something invasive to help them understand the problem. I'm always very, very willing to do that because it's like, okay, this is not, you're not just telling me to like reboot it and see if it goes away. You're not like trying to actually magically solve my problem. You're trying to actually understand my problem, which feels like a very different disposition and footing. Is that was that kind of their reaction, Lavonne, as we were asking them to do some of these experiments?
Levon Tarver:Yeah, in general, the customer was really fantastic about kind of helping us understand what things they were willing to do and not willing to do in terms of like, hey, making changes X, Y, and Z are too invasive or violate policies we have, But yeah, we can do A, B, and C. And so we kind of approach the situation very transparently saying, Hey, we haven't seen this problem before. We want to gather as much information as we can while we have everyone here together so that we can bother you less as we go and replicate it in our lab.
Bryan Cantrill:Okay. So can I actually just pause you for a second, Levon? Because I just wanna like I I wanna underline something. You said Sure. We haven't seen this problem before.
Bryan Cantrill:Right. That is such a great way of phrasing it. As opposed to you're the only ones seeing this problem. You know, I mean, it feels like it's it is such a subtle difference. And the like, we haven't seen this problem before.
Bryan Cantrill:Says my disposition is to this is a problem and it is my disposition to understand it. And I wanna work with you to understand it. Versus the one that I've heard a lot of is you're the only one seeing this problem. And I'm like, that feels much more like
Levon Tarver:Yeah.
Bryan Cantrill:I I I'm on my own, I?
Adam Leventhal:Important pronoun distinction in those like you versus we.
Bryan Cantrill:It really is. And it It really is. Because I I I think that, know, and I have always said like we add to to to our customers, like, we're not we we are are never gonna tell you. Because then also I feel like you should never tell someone they're the only person seeing this problem because, like, what does that even mean? Like, how is that even actionable for anybody?
Bryan Cantrill:I mean, it's like
Adam Leventhal:It helps them feel special? I don't know.
Bryan Cantrill:Yeah, exactly. So sorry, Lavonne. I just had to stop you because I love the way you say we have not seen this problem before and being transparent with them about like that. So we so we really need to understand what it is that is different about this environment. Why haven't we seen this problem before and working with them collaboratively to and I also love what you're saying about like being transparent about here are the kinds of experiments we want to go do and them being like, look, we can't do these others because they're too invasive or they violate these other policies.
Bryan Cantrill:So what can we collaborate and find other ways to get you some information, it sounds like.
Levon Tarver:For sure. And like it is tough sometimes to have conversations when novel problems show up because they did ask us if other customers were using this in a similar way and I had to be honest and go, well, not that I'm aware of.
Bryan Cantrill:At the
Levon Tarver:same time, I had to immediately follow it up with, but I'm not telling you that you're not supposed to be able to use it this way. Because based on like how we've designed it, you should be able to use the rack the way you're using it. I don't I can't see any reason why you shouldn't. So like, if it's not working for you, like we wanna figure out why.
Bryan Cantrill:We wanna figure it out. Yeah, exactly.
Alan Hanson:Like, this is we're proving ourselves to you. We're, you know, a new company with a new product, and we really wanna make sure if there's a problem, we figure out what it is. If it's us, we wanna know. And if it's not us, we wanna help you figure out, you know, where where it is.
Levon Tarver:Yep. Yep. Yeah. And, they were really good at partnering with us and, helping us get any information we needed as quickly as, like, practical.
Bryan Cantrill:So, Lavonne, so in this and then so we make the kind of the breakthrough that at least on our system
Alan Hanson:With our down rev Cisco software.
Bryan Cantrill:With our down rev Cisco software. So not I mean, kind of not exactly like for like, but we've been able to reproduce and it's always obviously when you've reproduced symptoms, there's always a little bit of like, God, I different problems can obviously manifest the same symptoms and the fact that we're actually like, well, we're not even manifesting the same symptoms because we've got we're gonna get to different counter that's kind of incrementing.
Trey Aspelund:Speaking of the down rev Cisco version, my my friend who I called in, he's when he heard the version, he's like, oh, it's like you guys like pain or something.
Bryan Cantrill:And you're like, actually, you met Lavon, Lavon actually described I when when Lavon was describing, like, his favorite kind of pain, I almost thought, like, we should add to the oxide candidate materials. You know, we should what is your favorite kind of pain? Describe so you're So we are running maybe a slightly down red version.
Levon Tarver:I believe we may have been a major reason behind. But it was like, in theory, it should, you know, and I mean, this is like very load bearing in theory. We were just working on what we had and trying to get as close to the same behavior as possible. And so, we forwarded our findings to the customer and said, Hey, in our testing, here is the exact commands we run, here's the scenario we had set up, here's what we saw, here's how you can replicate it on your side to see if you see the same things. If you see the same things, here is the command you can issue on the specific interfaces connected to our rack that should, that if it's the exact same problem, this was what resolved the problem for us.
Levon Tarver:And the customer was very prompt in giving that a try, and they reported back pretty quickly that the problem cleared up. So we were all leaning on the edges of our chairs waiting to hear back because I really, really, really was like, I just got to know, was this it? Was it not it? Because if it's not it, I have no idea what it is. If it's I'm very, very, very happy that it's this.
Levon Tarver:Because it's a simple fix that doesn't require anything intrusive. It doesn't require crazy rearchitecture, yada yada yada. Yeah, and also like through the rest of the kind of troubleshooting we did, we actually helped them find some other things that they were able to address to make everything work, to ensure that everything would be pretty robust throughout various predictable misbehaviors, like if a link went down or something like that, we knew that traffic was gonna forward correctly over the other links and stuff like that.
Bryan Cantrill:That's great. So we just we found some other issues in terms of this, hey, these these are things that could become issues later, and we can get that kind of resolved at the same time, it sounds like.
Levon Tarver:Yeah. They were able to resolve that stuff pretty quickly too. So it it was just overall something that ended up working out pretty well.
Bryan Cantrill:Can I ask a question that maybe I shouldn't this is like the kind of thing that where, you know, like when we Is this time to stop the recording? No. Well, maybe. I they we it just so we know that like we're not we're not we are we not just picking on Cisco. When we were at another customer and we were dealing with some Junos commands and I was asking some some questions and Rai was having to be like, oh god, I'm gonna have to explain some Junos history.
Bryan Cantrill:It's like, no, it's first of all, it's not you. It's like, but okay. So can I ask you like, why is this IP redirects from a Cisco perspective when we're aren't we talking about ICMP redirects? Are they using IP as like an abbreviation for ICMP or or do they are they refer like sorry. Can you or or should I like or the bottom you're like, if you're gonna get hung up on this kind like, you're just like not ready for this.
Bryan Cantrill:If you are such a fragile little flower, that thing is gonna wilt you, like, you just need to, like, let other people deal with this because this is just not you're not ready
Adam Leventhal:for this. You can't phrase the question, and you're definitely not ready for the answer.
Bryan Cantrill:Right. You're not ready for the answer.
Levon Tarver:Yeah. I don't think any of us can handle the truth when it comes to that. There was a podcast I listened to, I think the guy's name is Terry Slattery. He was like a very influential guy early in the Cisco days, but apparently there was a lot of interesting history in the development of like the Cisco router, the Cisco CLI, how things are named, how things weren't supposed to be named. Apparently these guys still have emails so they have the receipts.
Levon Tarver:Was very interesting. Interesting. It was very entertaining. But yeah, like one of the big things that if you talk to any person who works frequently with network infrastructure, there are a lot of opinions about the various CLIs and commands and the structure of those commands and the naming of those commands. Yeah, you're like, is, because you're like, ICMP redirects is part of the ICMP protocol suite, so why is the command IP redirects?
Levon Tarver:Answer is, I don't know. If you configure routing is you know, they call it IP routing and, like, when you put OSPF on interfaces, like, IP OSPF, you know, enable blah blah blah. Because, like, they just have, like, a command hierarchy that they choose to stuff things under. I don't want to, like, hand wave over it and pretend, like, difficult decisions aren't made there, but I just don't I haven't seen a single CLI that everyone is happy with, that everyone agrees with.
Bryan Cantrill:Yeah. That's fair.
Levon Tarver:And I've I've seen most of them, you know, from Junos to Arista to Cisco to Nokia SR Linux, they're all a little different. Sometimes they're different because the lawyers make them have to be different. I think, you know, certain vendors have sued other vendors for being too similar to their stuff. You know? So it's like Oh, that's that's thanks
Bryan Cantrill:for sending the industry forward, everybody. My bad. That's super helpful.
Levon Tarver:Yeah. So it's it's a it's a pretty incredible thing to to learn about and to behold. Okay.
Bryan Cantrill:But but it seems like I shouldn't stare at this on too much of this. This is like when I think it was as as I was remembering it, it was the difference between Juno s and Juno s evolved that I was asking Rye for. Yeah. And it was it was was a fight, oh god, you're not ready for this. You're not you just just go home.
Bryan Cantrill:The okay. So the so we it's a Lavonne. We we get them to run the no IP redirects and and I think and, you know, I think because you and I were talking after you all had made this discovery, but before we had run it at the customer site. And Alan was like, great news. Like, we got it.
Bryan Cantrill:We've got this thing nailed. And Lovato, you and I are talking like, we have something that is consistent that is is quite consistent with the behavior that we're seeing. And we are opposite. Was getting a much more like, you know, I much more more of the this the
Alan Hanson:Afana has more legal training than me.
Bryan Cantrill:Think that's what it was.
Adam Leventhal:Certainly better counseling. That's right.
Levon Tarver:Yeah. I have a lot of experience in managing people's disappointment. So, you know That
Trey Aspelund:is the best way to describe vendor support I've ever heard.
Levon Tarver:Yeah. Well, it was even better because I was like, you know, so for history, I used to work at, I worked at Rackspace and I worked at Equinix. And so you are now the vendor in front of a vendor. So, you know, back in the glory days of Rackspace where, you know, they basically say, Hey, do whatever it takes to help the customer be successful, you know, super fun time. But being on the call with a customer, was like, I had to just tell them like, Hey, some funky stuff is happening and it's gonna take us some time to figure out.
Levon Tarver:I can only promise that I will keep you up to date with everything I see until I get the resolution. And I can promise you that I'm gonna be working on this resolution until I find it, or until I hand it over to someone else at the end of my shift. And so it's like over promising, like I just found that customers just wanted to know the truth so they can manage things better on their side. When you kind of just don't tell them the truth, you're not straightforward with them, you're messing with their ability to manage what they need to do on their side of the things. That's just pretty whack, you know?
Levon Tarver:It is really eyebrow raising when I see things don't happen that way. And most of the time, you know, like customers are not going to just show up with an 18 wheeler and start yanking servers out of the data center and like, I'm going to another hosting provider. Like they want you to fix it. You know, they're like, hey, yeah, we're just anything we can do to help you fix the problem and get us back up, we're happy to help you with two. Like, they don't want to, like, you know, burn you at the stake.
Levon Tarver:And as heretical as it may be to say this, most of the time the vendors, like, they're not wanting the customer to be unhappy either.
Bryan Cantrill:Yeah, I think that's definitely true. That's right. Yeah, yeah, yeah. I mean for a long time I thought Dell was lying to me and then I realized that actually Dell was not lying to me. Actually, they definitely want this problem to go away.
Bryan Cantrill:And they are they are Yeah.
Levon Tarver:But sometimes they just genuinely don't know.
Bryan Cantrill:They don't know. We're just Well, they don't know. And then but then it's really hard to say we don't know. We Right. And the I mean, I just think that there's a lot of wisdom and empathy in the kind of what you were saying there about managing expectations appropriately.
Bryan Cantrill:I guess mine in some cases, but it just the the the expectations of of the customer Because I do feel that the I mean, kind of the opposite extreme, which folks in the chat were also alluding to is when you're just tossing fixes at them. Do this, do this, do this. We think it's this, We think it's this, do this. And you're not bifurcating the problem. You're not it's just kind of Hail Marys.
Bryan Cantrill:And when I mean, there's a real like you can really fray trust pretty quickly if you are if those are not really well informed. And, you know, we had and I think this is where we're, you know, we're all we've all got kind of our our scar tissue, but a specific case from Sun that I don't I don't think we've we've told here, but Adam maybe just tell me if we have, but the we had a a customer in London and they were seeing the eCache parity error. This this CPU issue that had like six different root causes that has cast this like unbelievably long shadow over my own career. And they were getting the I mean, Sun Support was trying to do the right thing, but they were guessing. And they were just hawking guesses out there.
Bryan Cantrill:And one of the things is like, it's vibration. You're seeing these because of vibration and we think your data center is too close to a tube station because you're within a quarter of a mile of a tube station. I mean talk about what I mean and the customer had to map out all the data centers in London and pointed out that there was no data center that was not within a quarter of a mile from the tube because the tube is everywhere in London. And then Sun said, actually, you know what, it's not that it's your you're in a dusty environment.
Alan Hanson:Dirty. Remember the dirty one.
Bryan Cantrill:Yeah. And you're in a dusty environment and you actually and they installed a they spent a million pounds on a new HVAC system because to it to make the end and they did make it much cleaner. But the rate of CPU errors didn't drop at all. And Sun in a kind of moment of desperation invited the customer up to the facility in Scotland to see where these things were boxed and they were actually like manufactured. And they would come in from one facility and they were deboxing in the same room that they were doing the burn in testing and the room itself in which their machines were being having the burn in testing was incredibly dusty and famously or infamously as the story was told to me, the exec from the customer wiped their finger on a surface, looked at his finger, which was black with with dirt and held up his fingertip to the Sun execs and said, I spent a million quid on what exactly?
Bryan Cantrill:And I mean, Sun's only redeeming quality in that moment was a total sense of shame. And I mean, it's really important to not like when we when you're asking someone to experiment with a change in their environment, you need to give them Lavon just as you described like total transparency about here's where we are, here's what we know about the problem. And here's why we believe that this experiment is either going to advance our understanding or now you're at the point of on where it's like we actually believe like this experiment is gonna advance our understanding quite a bit because if this doesn't solve it, then we know we don't we know we're down the wrong path here. This is not the issue maybe you would not have given up the hypothesis that that maybe this is too beautiful hypothesis to let die. So you would have
Adam Leventhal:to Brian, this is a great point. One of the things when I was running support over at Delphix, we we had a problem with performance issues. And I'm sure this is your experience. Two performance issues can be very challenging. Right?
Adam Leventhal:Like hard to know.
Bryan Cantrill:Super challenging.
Adam Leventhal:Yeah. And we instituted a rule, which was three strikes and you're out for the support team. Because what what would happen is we we would see things that were out of place. We think we'd see things that we knew changing it would be an improvement to the overall health of the system. But every time we we ask the customer to make a change, even if we moderated those expectations, the fact that we're making a change at all just felt like an opportunity for it to be fixed.
Adam Leventhal:And when it wasn't, it went away. So only we only advocated for changes that we had high confidence were going to impact that core problem. As you're saying, Brian, like be able to explain, we think it might be this, we changed this thing, and then that helps narrow it down. At least even if it doesn't benefit, we know not to look in some particular vicinity. But it took some real deliberate effort to not make changes that we knew would be improvements, but to kind of defer those or make changes where even making the change was easy and validating whether the change was going to be beneficial was hard.
Adam Leventhal:Still, you know, those cycles of hope and then dashed hope were really tough for in a rotated customer trust, like even when you provide context.
Bryan Cantrill:Yeah. So so important to, would to kind of communicate all of this. And so I well, I'm curious from your perspective because I mean, it must be I mean, you work on a team where so so many are kind of in support along with you. I mean, what what is it like to kind of be I mean, this is a this is a real team effort to get this debugged.
Will Chandler:Yeah. Yeah. You know, it was great. Yeah. In terms of, you know, there was never any, like, grumbling of, like, well, gosh, I gotta work on this feature or anything like that.
Will Chandler:You know? Like, the instant there was it was clear there's a customer issue. We had as many people as we wanted, maybe more people than we wanted. Well, that's not true. As many people who needed to want it.
Will Chandler:Issue jumping on and providing,
Levon Tarver:you know
Alan Hanson:Do people know that do people know that when we have we have a problem like this, we'll open up a meet and we'll like send out a call and, you know, try to snipe our coworkers and say, oh, we're seeing a problem. Give like a basic idea and then people will just join the meet if they think they can contribute or they wanna make fun of us or whatever. And we'll, like, live debug, and we'll have these, like people will throw out ideas, and we'll we'll talk about it. And that's, like, part of part of our culture almost to have these, like, meetings.
Bryan Cantrill:Well, and we talked about this in the culture idiosyncrasies episode. Well, no. No. No. But just in terms of like this is we didn't talk about this specifically.
Bryan Cantrill:We talked about this idea of kind of debugging when you have a remote team Mhmm. And being able to leverage that for debugging. But having a remote team where you can leverage it for customer support is really great where you don't have to like it just makes it much faster to get the I get the right expertise in the room, I imagine.
Alan Hanson:Yeah. And certainly, like, without Levon and Trey, we would have been dead. I don't know what we ever would we never would have found this.
Bryan Cantrill:Well, no. We we we obviously know that. As I tell everyone, I mean, just about everyone, I mean, or everyone at Oxide, it's like, if you feel like we would be screwed without you, it's because we would be screwed without you. It's not actually your imagination. So yeah.
Bryan Cantrill:Sorry. Well, we do in terms of the so getting the right folks then also helping manage the the customer expectations. And then because from your perspective, I assume that this is a new domain of darkness for you. I don't know how much you've spent in the kind of the networking world. I assume you were learning along with the rest of us on some of this stuff.
Will Chandler:Absolutely. Yeah. Yeah. That that level of networking is definitely new to me. I I think I was interviewing Trey was one of people I interviewed with, and I was like, well, I've never seen BG Breaks before.
Will Chandler:And he's like, oh, I I have. So yeah. So this is very interesting to be exposed to this level of networking. But, yeah, in terms of in terms of communication, it was really just, you know, being being transparent with the customer like we've already said. You know?
Will Chandler:Like, we we don't know what the problem is. Here's I don't know if we'd say we had theories. I think we just we just be trying to gather information and narrow things down. I don't think we really threw out ideas until we were pretty confident that we we had, you know, a solid candidate. And just, you know, being you know, keeping in contact with them, not not just, like, dropping off the face of the earth, making sure that they were aware and that we had covered any concerns they had.
Will Chandler:You know, pretty standard support stuff, I suppose.
Bryan Cantrill:Yeah. But I think it also just important in turn and then so we get to Lavonne, so the we have them now try the setting no IP redirects. Your belief is that it will be very interesting to see the crops may fail, but you're you're but certainly were guarded optimism.
Levon Tarver:Yeah. But a few things like Will was saying, oh, pretty standard support stuff. If I remember correctly, Will recompile their CLI binary to enable additional debugging stuff and like send them a new binary. It's like, hey, run this so I can see more detail. Like that is not standard
Bryan Cantrill:for a lot
Levon Tarver:of people to do. So he was just as much like digging in there and being a superhero as anyone else. Actually helped to give us more detail because I believe him doing that gave us the data that showed that the reason the file transfer to the control plane was failing was because of like some timeouts happening with like some of the sockets in the background. So, like Yeah. We had to add
Will Chandler:some additional logging to the CLI, and that should be a lot better now too. So we we should have to recompile that
Trey Aspelund:in the future.
Levon Tarver:Yeah. But he, says it so casually. You know? I just got I kinda have to, like, point that out there. Like, yeah.
Levon Tarver:This is it is pretty cool.
Bryan Cantrill:But yeah. Like Well, could you explain on that in terms of, like because you actually so this resulted in in an improved CLI, it sounds like.
Will Chandler:Yeah. Yeah. So, you know, right now, I actually need to get back to finishing up this PR, but I could give a a customer a banner today if I needed to. So right now, you know, like, if you if you have a problem with the connection, we unless it's, an authentication request, we don't really log any details on this the change that we have in flight will mean that any API request we make will get, you know, the correlation ID, the timing on the request. So it'll be a lot easier to nail this down because we had to kinda manually hack that into the CLI for for this customer issue, which, wasn't a lot of work, but it was just, you know, like, we would nice if we can just tell the customer to do it off the bat instead of, like, you download this shady binary from my
Bryan Cantrill:ticket. You're okay.
Adam Leventhal:No. A %. That should just be in the CLI, and it's great that you're adding that.
Bryan Cantrill:It's also great that we're, like, taking that experience and like not just solving like an issue here, but finding other opportunities to improve everything. Improve a a customer's network, but improve also the product, improve everything about this, like what are the other opportunities we can find to actually make make this either less likely happen in the future or more diagnosable in the future?
Alan Hanson:And even like more more tools like what are the tools that we don't have that we want right now to be able to answer this question or that question? And we get a lot of stuff like that that falls out of this too, where somebody goes off and improves this script and somebody goes off and writes this piece. And next time we want this binary included in the system rather than having to copy it over there.
Bryan Cantrill:As as someone says in the chat, it's like this is what people should mean when they talk about agile development. But it's like, this is definitely lowercase a agile. This is not uppercase a agile, but yeah, I would just exactly as you described.
Adam Leventhal:Yeah. The the thing that we do, I think better than any place I've ever been before is not just like building the thing and not just recording it, but closing that loop so that the the improved tool isn't the thing that runs out of Will's home directory, but it's the tool. And the new facility isn't just written down, but it's actually encoded somewhere. And it kind of closing that So it's not just the, you know, tradition of system administration, but actually like what is landed and available to everyone.
Bryan Cantrill:Totally. And then on on then, Lavonne, the short answer is, like, so did it work? Did they actually what happened when they said no IP redirects?
Levon Tarver:Yeah. So, you know, as you said, I managed my expectations and was doing my best to, like, you know, just be not overly excited about this, but I was like, man, there's just a ton of smoke here. So I think that's where the fire is. You know, sometimes that's just all you have. And, yeah, they will send the information over to the customer.
Levon Tarver:Customer responded pretty quickly that things were looking a lot better and we double checked to see if they changed anything else. And they said, Nope, it was just the command you recommended. Everything's good now. So that put me in pretty high spirits. But we did have other contingencies for continued troubleshooting.
Levon Tarver:Like we were still making more preparations for but it was just good to know that everyone's attention, detail, and efforts really kinda got us down the right path pretty quickly for a fairly obscure problem. So that was that was pretty cool.
Bryan Cantrill:Yeah. And when I think this is one of these things that like and part of the reason I I wanted to have this discussion because I think that this is such an interesting case where to me, you've got two sets of behaviors that are actually reasonable. I mean, it's easy it's kinda easy to criticize the Cisco implementation. And I certainly don't like the fact that like the that that you doesn't show you easily that it that it it has IP redirects enabled. That in and the fact that it's handled by the CPU and not the ASIC.
Bryan Cantrill:Mean, they're bunch of like a little bit of implementation details. But the in the abstract, the idea that like, hey, this switch, this piece of equipment sent me a packet and it is the destination for this thing. And I should let somebody know that this feels like a misconfiguration of some sort and oh, by the way, I've got an ICMP redirect that is actually for this purpose. Like that is that feels in the abstract like reasonable behavior. And it also feels like reasonable behavior that we that that we are not actually hairpinning these packets inside the oxide rack that we're sending it on to the the the customers network to to hairpin back doesn't seem like unreasonable behavior.
Bryan Cantrill:And again, you can argue with both of those things, but like the behaviors themselves by themselves are not completely unreasonable. But then you combine them and you get a completely unreasonable system. You get a system that is actually totally pathological. And this to me is like just a very vivid embodiment of the kinds of problems that just don't get resolved. That that it's the integrators of technology that have to resolve these problems on their own.
Bryan Cantrill:It's the customer that has to resolve these kind of problems on their own. And I feel like that that's this problem needed ultimately, it needed oxide or Cisco. Someone is gonna have to go like get past the mindset of like you're the only one seeing this and and go into like, I'm gonna take responsibility for the whole problem. And of course, that's that's always I mean, we always are gonna strive to do that. But did this problem really necessitated that kind of approach.
Bryan Cantrill:Is that fair a fair characterization? I mean, I think and how are these think I even probably asked you this, but like, what happens when these problems are seen? I guess people just suffer. I guess it's just pain when people see these that's how the That's how the dark knowledge is formed from the pain.
Levon Tarver:Yeah, so someone in the chat kind of hinted at earlier, right? Is when a problem is not well understood, oftentimes because of like hidden designs or assumptions or defaults or things like that. You don't have the people who know why things behave the way they do in the room. Like that was the advantage that the customer had interacting with us is like, they brought me on the call and I helped write some of the code that impacts how these packets are handled. And I can talk directly to some of the other people who write some of the code.
Levon Tarver:And so I'm very, very familiar compared to like, you know, customer will be of how our system is behaving. So that's a whole part of the black box eliminated in terms of like what you can understand versus what you can't. So then having experience working with Vendor Gear helped fill in some of the other details, and we're able to chase a resolution. But unfortunately, if you don't have enough of that knowledge and experience in one place, people just have to work around the problem. The term workaround becoming a noun instead of like a verb.
Levon Tarver:Anyone who's dealt with a vendor at this level, workarounds are just way more common than one would think. Oh, here's a workaround for this. Here's a workaround for that. Here's a workaround for this. And the workarounds just get more and more and more and more bizarre.
Levon Tarver:But it's just kind of a sign that when you're trying to integrate things together, you get into integration hell really, really fast. Once you start doing things that are at the intersection of assumptions that two different vendors made. Depending on where in the lifecycle those things happen, I've had someone at a specific vendor tell me, ask them, Hey, on earth does this problem happen? And the problem I was dealing with in that time was someone did an SNMP poll to a certain MIB and it made the switch restart. I was like, How does that happen?
Levon Tarver:It's like, Dude, the person who wrote that code isn't even here anymore. I don't know. And I was like, Okay, fair enough. I get it. But you know, and so like, so this is how we live in a world of workarounds.
Levon Tarver:It's just really hard to know how things work anymore. And I think the key thing that stood out for us is we do know how a lot of the stuff on our part works. And just having that allowed us to make great strides in troubleshooting what was happening on the customer's infrastructure and driving a real resolution instead of a workaround, where we can say this is the exact thing that is causing the behavior and as long as you don't need this thing for a specific reason, you know, can just change that and Cisco even recommends not having this turned on. And there are couple of documents they recommend not having this turned on. Can go, why is it not a default?
Levon Tarver:I don't know. Why can't you find it? I don't know. But it's like, that's a different question, maybe for a different business as Adam said.
Bryan Cantrill:Well, outstanding work. And I think it was really, I think inspiring for all of us to to see and I think it was this was a real team effort to to get this thing understood. And I loved the, I mean, all along braced. Well, one finding other problems along the way, but then also brace like, might not be it and then we're going to take the next step and we're to figure out and we you know, we have a firm belief that magic does not happen in these systems and if there's behavior that is pathological, we're gonna understand it. Maybe it may take us a while and it may take a bunch of us and it may take a bunch of meats that we're all hopping on and nerds typing one another on, but we'll get there and we'll get there with transparency for for and managing expectations properly with the customer.
Bryan Cantrill:Well, thank you all. This is really, like I said, it was really wanted to get you four on here to talk about this because it was I think just so so evocative of the kind of problems that we're trying to resolve at oxide and where we do actually, we wanna end the the the vendor blame game for sure for those running on prem compute.
Trey Aspelund:Before we wrap up, Brian, I did just want to say, our customer, I wanted to give them a shout out. They were amazing to work with. Were very easygoing, very understanding with us, very willing both to try things that we were suggesting, but also like just pull information for us and really just work with us because they also wanted to understand it at that level. So I just wanted to make sure we got a shout out to them. If they're listening, they'll know exactly who they are.
Trey Aspelund:But
Bryan Cantrill:yeah, exactly. Someone else did have this problem. They had no no no no. Yeah, exactly. And I think this has been, we, you know, when when we started oxide, we wanted to have a company that, that customers would love to buy from and part of that is us loving our customers.
Bryan Cantrill:And I think that we really really appreciate all of their efforts. So and and hope that they I know that I mean they were actually I I think they love to get this thing resolved and appreciated our transparency all along. So yeah, Trey, that very good point and deeply appreciated. Well, thank you again, Adam. We do have we've got no podcast next week.
Bryan Cantrill:Correct?
Adam Leventhal:Right.
Bryan Cantrill:But boy, do we have a banger two weeks from now.
Adam Leventhal:I'm like vibrating. I've gotten confirmation from both guests and I still don't believe it. I'm only gonna believe it when it happens, but we
Bryan Cantrill:I also I too am only gonna believe when it happens and I don't mean to sound, yes. This this one this one it does sound too good to be true and no, it's not Morris Chang. But I am convinced that this is like you're you are really so, Adam, do wanna describe because this is Yeah. Stuart Guebet. Do you wanna describe yes.
Bryan Cantrill:It's Larry Ellison, Chad. That's who it is. That's Larry Ellison.
Adam Leventhal:We we've been we've both been talking about this book, Character Limit, for a while now. I think I got the recommendation from you. I think, Brian, you might have gotten the recommendation either on the show or or from the socials. I can't remember. But loved Character Limit about Elon Musk's disastrous takeover Twitter, which feels all the more relevant today.
Adam Leventhal:And so we've reached out to Kate Conger and Ryan Mac at New York Times, the authors of this book, and they're gonna be joining us in two weeks. And I could not be more excited.
Bryan Cantrill:Maybe they're gonna be joining us. I mean, do you wanna qualify that at all? We we think we've got every reason to believe that they I mean, it just feels
Alan Hanson:like the Levon
Bryan Cantrill:I know. I'm just worried they're gonna be like, hey, we listen to a couple of episodes of you
Adam Leventhal:I know.
Bryan Cantrill:Turkeys. And I I sorry. Like, the the the actually, like, the the the PR agency won't let us
Will Chandler:do I'm sorry.
Bryan Cantrill:It's not us. We want to, but, you know, they do have actually standards as it turns out.
Adam Leventhal:We heard you teasing it two weeks ago, and you were obviously much too excited. So we're gonna pass. No. Let's hope they don't hear
Bryan Cantrill:One or both of us can be like, this is I think I blocked your cohost on Blue Sky. They didn't, but I'm like, no. I think that they I we're very excited.
Trey Aspelund:Yeah. This is
Bryan Cantrill:yeah. We have we we
Adam Leventhal:have I mean, they they they're they're in. Go read the book. Go buy the book and read it. And, it's gonna be a great discussion. Cannot look, look forward to that more.
Adam Leventhal:Just Yeah.
Bryan Cantrill:And I I think that we I'm not in my I'm not in our wildest dreams and we think that that that this book I mean, obviously, we'll talk to Kate and Ryan about it. But this is this book has now become much more important than just Twitter. And, we've got, I think, a lot to talk about. And as we said at the time, Adam, this is not just a great title. I feel like we're always like having explaining of the books are great despite their title.
Bryan Cantrill:This is a great book with a great title.
Adam Leventhal:Yeah.
Bryan Cantrill:Very well written, very well researched and cannot wait to have these two on. So Yeah. Amazing if true.
Adam Leventhal:Amazing if true.
Bryan Cantrill:And yeah. So that that'll be in two weeks. So so folks who got two weeks to And
Adam Leventhal:an hour earlier. But oh, oh, someone posted the audiobook. I've I I listen. I didn't listen to the audiobook. I read this one, you know, with my eyes.
Adam Leventhal:But I've heard that the audiobook has, like, a very sarcastic voicing of Elon Musk. So I'm I'm, like, kind of regretting not doing the audiobook version for this one. But I've heard, you know, about a plug for the audiobook audiobook there.
Bryan Cantrill:I okay. I'm I'm gonna go I mean, I've read the book, I'm gonna go listen to the audiobook. Mean, obviously. Nice. Nice.
Bryan Cantrill:This is terrific. Well, anyway, I can't wait. This is gonna be great. And obviously, pressure's on to get Morris Chang. There's just another way to say it.
Bryan Cantrill:I think I think Yeah. Awesome. Alright. Well, thanks again. Thanks again, Will and Trey Lavaughan.
Bryan Cantrill:Alan, thank you very much. Thank you for all your terrific work on this problem. Thank you, oxide customer for all your hard work and get get this resolved and
Alan Hanson:look forward Patience with us.
Bryan Cantrill:Yes, absolutely. And look forward to seeing you all next time. Thanks everyone.
Creators and Guests
