- Why should we have multiple server uplinks?
- Should we use LAG or individual uplinks between servers and ToR switches?
- How could a server detect upstream path failure?
- Multihoming a server into a layer 2 domain that stretches beyond a single switch (EVPN/VXLAN, upside down STP based tree, etc)
- Running a routing process such as FRR on the server itself – here the service IP is a loopback and the server NICs are nothing more than BGP speaking transit interfaces. In this case BGP is used to signal to the server which path is active end-to-end.
- SmartNICs (aka DPUs) where the NIC runs a routing process like BGP north, but present basic traditional L2 to the OS. The advantage of this is the OS would only need a single NIC presented to it, and then the DPU would effectively be a 3 or 4 port switch. 1 or 2 ports presented to the OS (in case server admin truly believes they want to bond something for HA) and ports 3 and 4 are running routing to the TOR switch.
- Then there is a special consideration to be given to hypervisors such as ESXi. On ESXi I can neither monitor gateway APR nor I can run FRR, so effectively unless VMware made something, the only possible solution to the problem is a SmartNIC.
I know you said LAG/MLAG/bonding is a dead horse, but it seems like that’s not the case. In my company we put a moratorium on MLAG. We may permit LACP to a single switch, but we discourage it. The bond we promote is “linux type 1” (aka active/passive) and we ask the server admins to track the ARP for the gateway. I’m sure you realize why, but I’m going to state it anyway. The ARP tracking is to help the server realize that “the active” switch lost all uplinks, since by default the bonding driver only tracks link state and will not realize that the switch became isolated.
Challenges with that is there are bugs in ARP tracking code in bond driver. I also think conceptually, this is not a long term play. More host means more ARP, and more ARP means more load on the control plane which is usually running on a craptastic CPU so easy to overrun. More load on the CP, means rate policers need to be tweaked or the ARP requests get dropped and the bond drivers trigger failover.
Long story short, I’d like to discuss what other folks are doing about dual homing hosts that is not bonding a L2. We are trying to get FRR stack installed on linux boxes to route to the host, but so far server admins are super resisting to this.
The worst offenders
Special mention goes for ESXi which does not even support ARP tracking so it’s stuck in the link state tracking world (there is a feature called beacon probing, but it’s super weird as it requires 3 switches for some reason and who does 3 switches per host?)
And even more special mention goes to some storage vendors like Pure who will work only over MLAG. Our storage admins purchased an array without consulting with the network team and now it’s been sitting there collecting dust for the past 1.5yrs since the only way to connect it to the network in HA mode is to MLAG the A/B side switches and we don’t support that (not to mention that MLAG between A/B switches totally destroys the whole idea of SAN-A/SAN-B separation).