Rack-scale Networking

Bryan and Adam are joined by a number of members of the Oxide networking team to talk about the networking software that drives the Oxide rack. It turns out that rack-scale networking is hard... and has enormous benefits!


We've been hosting a live show weekly on Mondays at 5p for about an hour, and recording them all; here is the recording from February 27th, 2023.

In addition to Bryan Cantrill and Adam Leventhal, speakers included Ryan Goodfellow, Levon Tarver, Ben Naecker, and Arjen Roodselaar.

Links
Here's (much of) the live chat from the show:
  • ahl https://github.com/oxidecomputer/oxide-and-friends/blob/master/2021_11_29.md
  • ahl That's the Sidecar switch episode
  • bcantrill https://p4.org/
  • admchl What does "at line rate" mean?
  • Riking Line rate = As fast as the packets could possibly come. 1Gbit, 10Gbit, 100Gbit, etc
  • admchl Do you need ASICs to hit that speed? I assume x86_64 is not going to be fast enough for these specialised operations?
  • levon Yes, the Tofino 2 is the ASIC
  • bcantrill You need ASICs
  • bnaecker Yes, you really can't do these kinds of operations on a general purpose CPU.
  • rng_drizzt Yeah, you need specialized silicon here.
  • JustinAzoff Right, also often across all ports at the same time in both direction. a 48 port 10gbps switch will have a line rate of 960gbps (10 ** 48 ** 2)
  • duckman So the advantage is being able to offload compute to the switch?
  • bnaecker Yes, and specifically that you can separate the data plane (operations on the packets) from the control plane (decisions about what operations to allow or make).
  • tahnok What's TCAM?
  • levon Ternary Content Addressable Memory
  • bnaecker https://en.wikipedia.org/wiki/Content-addressable_memory#Ternary_CAMs
  • ryaeng Sure beats logging into a number of Cisco switches and making changes at the console.
  • admchl This is my favourite episode in a long time, this is all really fascinating.
  • rng_drizzt the first Sidecar episode was nearly 1.5 years ago ü§Ø , right after we cut the first rev
  • levon That episode blew my mind
  • duckman This sounds like a big deal on the scale of ebpf
  • duckman Or bigger
  • bnaecker It is extremely useful for understanding the processing pipelines. As long as you only run single-packet integration tests üôÇ
  • od0 just want to go out and find things to write P4 code for
  • JustinAzoff <@354365572554948608> yeah one way to think about that sort of thing is that xdp can be used to run little programs on a nic, where p4 is kind of like that, but running on effectively a nic with 48+ ports
  • bcantrill https://github.com/oxidecomputer/p4
  • SyntheticGate sidecar is the "codename" of our switch box
  • SyntheticGate "gimlet" is our server sled
  • bcantrill https://github.com/oxidecomputer/propolis
  • wmf So you have P4 and OPTE in the hypervisor at the same time?
  • bnaecker OPTE is in the host kernel.
  • arjenroodselaar The P4 runtime Ry described only exists in the test bed, where it high level simulates the switches. OPTE is part of the production environment.
  • arjenroodselaar The rough difference between P4 and OPTE is that P4 works on individual packets without much concept of a session (so it can't reason about TCP streams, packet order etc, so no firewall like functionality), while OPTE aims to operate on streams of packets.
  • JustinAzoff So you can run 100 VMs on a test system and wire them up to your virtual switch compiled by x4c?
  • arjenroodselaar Correct.
  • bcantrill OPTE == Oxide Packet Transformation Engine
  • admchl Gimlet?
  • rng_drizzt Compute server
  • rng_drizzt The Sidecar switch is actually just a PCIe peripheral to a Gimlet.
  • bnaecker The Gimlet managing the Sidecar is often called a "Scrimlet" for "Sidecar attached Gimlet"
  • Riking and "how do i reconfigure this giant network without hosing my ability to reconfigure this giant network"
  • ShaunO can identify with that - we seriously struggle to keep our own products inter-operating, let alone anyone else's
  • levon It can feel like a Sisyphean task.
  • a172 Setup a much smaller/simpler network in parallel that is accessible from "not your network" that gets you to the management interface.
  • levon It's a whole new world when you can look at the actual table definitions in P4
  • rng_drizzt Owning all the layers here is immensely beneficial
  • levon Those DTrace probes have been very helpful
  • bnaecker Those probes turned out to be everywhere. They are are in: SQL queries, HTTP queries, log messages, Propolis hypervisor state, virtual storage system, networking protocol messages, the P4 emulator, and probably more that I'm forgetting about.
  • levon For those unfamiliar with the DTrace tool, or the rationale behind leveraging DTrace over other tracing / debugging tools: https://www.cs.princeton.edu/courses/archive/fall05/cos518/papers/dtrace.pdf
  • bcantrill https://github.com/oxidecomputer/progenitor
  • ahl some notes on rust codegen: https://github.com/ahl/codegen-template
  • arjenroodselaar DDM! Bring us home!
  • a172 it astonishes me how many "cloud" type architectures are built on v4 only or v4 first.
  • a172 IPv6 is older than Wi-Fi
  • a172 It solves real problems. PLEASE use it.
  • nyanotech yessss finally someone realizes broadcast domains are also failure domains
  • JustinAzoff the worst part of v6 is trying to run dual stack v4+v6, v6 only networks are fairly simple
  • levon And the bigger the broadcast domain, the more irritating it is to troubleshoot it
  • bcantrill "Hash and pray"
  • arjenroodselaar FWIW while DDM is a cool thing we're building, one of the "simple" tasks Tofino does for us is NAT between the networks of our customers and their VPC networks they implement on our platform.
  • arjenroodselaar Simple NAT is still surprisingly expensive and being able to do that at line rate is pretty nice.
  • Riking TCP retransmits in steady state seems like an obvious observation point?
  • arjenroodselaar Yes, you see TCP retransmits.
  • arjenroodselaar But if you're running say Memcache over UDP and you get a sudden burst of incoming data as a result of a large number of cache queries you drop those packets (because the buffers can't keep up) and you see cache request timeouts.
  • arjenroodselaar FB did some work on this about 10 years ago to avoid this ingest and dropped packets which hurt your p99 latency.
  • Riking yeah smartnic is pushing the intelligence to the machine
  • levon I know someone who basically polled all of the switches for buffer drops in an attempt to divine which paths were dropping packets due to micro-congestion
  • admchl I feel like I'm in a secret society meeting learning The Hidden Truth behind Reality of The Network
  • wmf I would argue if the entire hypervisor is on the smart NIC then you're no worse off than the Oxide architecture
  • a172 I once stumbled on a bug where the vendor's custom protocol for monitoring (because snmp/syslog just cant keep up) had a trace log on the process, that could not be turned off. Some sort of race condition enabled it, and it happened on 1/3 of system boots. It was ~20k logs/s, iirc.
  • a172 (im going to look up those numbers)
  • levon I haven't worked with a SmartNIC fast enough to do this well
  • JustinAzoff We use a FPGA Nic in our products for fast packet capturing. the service that bootstraps it had an issue that caused it to log an error... for every single packet...
  • JustinAzoff that managed to log the same error something like 250,000 times a second
  • arjenroodselaar The problem with SmartNICs is that their power features are way less advanced than the power scaling that x86 CPUs do. So you either run them or you don't, and they come with a 50-75W penalty. Unless you can really get useful work done for that 50W budget, a x86 CPU is much more flexible.
  • arjenroodselaar What we really want is an AMD Epyc SoC with some amount of FPGA fabric That would let you build whatever makes sense there while still having much of the flexibility with respect to how/where you consume power.
  • a172 It was enough to mess us up. 250k would have killed us even faster.
  • JustinAzoff Yeah, it happily wrote that error message until the multi TB data array filled up. We reworked how log rate limiting and log rotation worked after that
  • a172 I was mostly amused that the process that the process that existed because snmp/syslog couldn't keep up was getting a syslog for every iteration of a loop in the process
  • a172 of course, if you are sending a packet for every packet you send, that sounds like it quickly becomes an exponential problem.
  • JustinAzoff and to circle back around, this was code inside of the vendor SDK, that is not open source, that we couldn't fix ourselves. it's one of the only components of our system that we don't control. i wish we had our own NIC (that would probably run something like p4)
  • levon And thus, this is how we become the way we are (at Oxide)
  • a172 ours was on production network hardware (wireless controller). There is no hope of having source or any insight true observability into it. (edit: saying there was no insight is a little harsh)
  • JustinAzoff one thing that came up before was if p4 was like ebpf.. there's actually a ebpf backend for p4 that supports some of the features: https://github.com/p4lang/p4c/blob/main/backends/ebpf/README.md
  • bcantrill Thanks, all!
If we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!
Rack-scale Networking
Broadcast by