How do we merge storage networks into leaf-and-spine fabric (40/100G)? It makes sense regarding throughput, but poses additional risks as opposed to separate FC SAN: fabric outage causes disks to stop responding.
We are running a multi-tenant DC network, spanning 3 sites. The network has three layers (per site): DC-LAN, DC-EDGE and DC-WAN-CORE. DC-EDGE is used for connecting all incoming circuits (internet, all sorts of wan connections). DC-LAN is used for connecting workloads. DC-WAN-CORE connects 3 sites together.
Right now, we are using MPLS-L3VPNs between DC-LAN, DC-EDGE and DC-WAN-CORE. DC-LAN (L2 part) is a traditional network with VPC and OTV.
We want to move to VXLAN/BGP-EVPN for DC-LAN. But what to do with DC-EDGE and DC-WAN-CORE? Still use MPLS (but with SR instead of LDP) or also use VXLAN/BGP-EVPN? In the last option, we could use (Cisco) N9k only boxes. In the first option, we could also use only N9k boxes, but then using real routers in the DC-EDGE makes more sense to have more routing capabilities and insights at the border of your network.
Can we do a generic DC deploy – a site with pair of ISPs, requirement of segmentation and connectivity to cloud and other locations.
Between MLAGs and EVPN is there anything else available as an option?
Can you talk about how you would design a branch office these days? Again would you necessarily do EVPN or SDA, which would be an overkill for an office with 100-200 people in it (I am not talking about college campuses).
Why don’t we use Spine Leaf design for service provider transport networks? Is Clos design only for DC why can’t we use outside DC?
It would be interesting to hear about leaf and spine architectures in an enterprise/campus network. I assume that the service provider case is very similar, but a large enterprise or campus has different operational and security models. Possibly a place to discuss how Charles Clos has taken over the networking world now that switches are non-blocking. Bell Labs and the carriers strike back! :-)
It would be interesting to go over adapting leaf and spine for ISPs or probably IXPs. Most of the materials that are currently available are focused on the datacenters with more of compute and storage nodes not ISPs and service based fabric. What do you think?
- Why should we have multiple server uplinks?
- Should we use LAG or individual uplinks between servers and ToR switches?
- How could a server detect upstream path failure?
- Multihoming a server into a layer 2 domain that stretches beyond a single switch (EVPN/VXLAN, upside down STP based tree, etc)
- Running a routing process such as FRR on the server itself – here the service IP is a loopback and the server NICs are nothing more than BGP speaking transit interfaces. In this case BGP is used to signal to the server which path is active end-to-end.
- SmartNICs (aka DPUs) where the NIC runs a routing process like BGP north, but present basic traditional L2 to the OS. The advantage of this is the OS would only need a single NIC presented to it, and then the DPU would effectively be a 3 or 4 port switch. 1 or 2 ports presented to the OS (in case server admin truly believes they want to bond something for HA) and ports 3 and 4 are running routing to the TOR switch.
- Then there is a special consideration to be given to hypervisors such as ESXi. On ESXi I can neither monitor gateway APR nor I can run FRR, so effectively unless VMware made something, the only possible solution to the problem is a SmartNIC.
I know you said LAG/MLAG/bonding is a dead horse, but it seems like that’s not the case. In my company we put a moratorium on MLAG. We may permit LACP to a single switch, but we discourage it. The bond we promote is “linux type 1” (aka active/passive) and we ask the server admins to track the ARP for the gateway. I’m sure you realize why, but I’m going to state it anyway. The ARP tracking is to help the server realize that “the active” switch lost all uplinks, since by default the bonding driver only tracks link state and will not realize that the switch became isolated.
Challenges with that is there are bugs in ARP tracking code in bond driver. I also think conceptually, this is not a long term play. More host means more ARP, and more ARP means more load on the control plane which is usually running on a craptastic CPU so easy to overrun. More load on the CP, means rate policers need to be tweaked or the ARP requests get dropped and the bond drivers trigger failover.
Long story short, I’d like to discuss what other folks are doing about dual homing hosts that is not bonding a L2. We are trying to get FRR stack installed on linux boxes to route to the host, but so far server admins are super resisting to this.
The worst offenders
Special mention goes for ESXi which does not even support ARP tracking so it’s stuck in the link state tracking world (there is a feature called beacon probing, but it’s super weird as it requires 3 switches for some reason and who does 3 switches per host?)
And even more special mention goes to some storage vendors like Pure who will work only over MLAG. Our storage admins purchased an array without consulting with the network team and now it’s been sitting there collecting dust for the past 1.5yrs since the only way to connect it to the network in HA mode is to MLAG the A/B side switches and we don’t support that (not to mention that MLAG between A/B switches totally destroys the whole idea of SAN-A/SAN-B separation).