Provision support for IRF MAD LACP Split brain detection

When you use IRF to group multiple Comware switches into 1 logical device, it is generally recommended to enable some split brain detection (split brain happens when all the stacking links are down).

For the MAD LACP method, only Comware switch could be used so far, now the Provision switches firmware has been updated, so an LACP link between a Provision and Comware IRF can be used for the MAD LACP.

Background

The split brain detection mechanism, which is known as Multiple Active Detection (MAD), is available through LACP, BFD, ARP and ND. I personally only use the LACP and BFD methods.

The LACP method is easy, since you can use an existing link-aggregation to a peer switch for the MAD detection. However, this uses an extended LACP PDU, using an additional TLV in the LACP packet. The TLV contains the active master ID of the IRF system.

When the peer LACP device receives this information, it should proxy this information back to the original IRF system over all the other ports of the link-aggregation.

As a result, the IRF system will receives its own ID information back on the other ports of the link-aggregation.

When the ID is the same, everything is OK. When the IDs are different however, it means there is a split brain.

Provision support

So far, only Comware devices could be used as the peer detection device, since any other vendor LACP implementation would not proxy the the additional TLV back over the other link-aggregation ports.

Now the Provision firmware has received an update and Provision switches can be used to provide the MAD LACP support.

This is good news for any mixed Provision-Comware networks, where MAD BFD may not have been possible for whatever reason.

Example configuration

Example is using a 3500 with K.15.16.0004 , so all 5400/3500/3800/2920 switches have support for it. The 2620 can also be used with current firmware (check release notes).
The Comware IRF is based on a 3600, a 100Mbit Comware switch (ports used are Ex/0/x, as opposed to Gx/0/x for a Gigabit switch)

The example assumes an IRF system is running already, so it shows only the steps to enable MAD LACP between the Comware IRF and Provision switch.

Steps:

  1. On Comware IRF, define an LACP link-aggregation to Provision
  2. On Comware IRF, enable the link-aggregation for MAD LACP
  3. On Provision, define an LACP link-aggregation to Comware IRF
  4. On Provision, enable the link-aggregation to perform MAD LACP pass-through

 IRF: define LACP link-aggregation to Provision

 # Define Bridge Aggregation 24
[switch-irf] interface bridge 24
 # Enable LACP
[switch-irf-Bridge-Aggregation24] link-aggregation mode dynamic
[switch-irf-Bridge-Aggregation24] quit

 # Assign 2 physical interfaces to BAGG 24
[switch-irf] int range e1/0/24 e2/0/24
[switch-irf-if-range] port link-aggregation group 24
%Jan  1 00:23:34:889 2010 switch-irf LAGG/5/LAGG_ACTIVE: Member port Ethernet1/0/24 of aggregation group BAGG24 becomes ACTIVE.
%Jan  1 00:23:34:919 2010 switch-irf IFNET/3/LINK_UPDOWN: Bridge-Aggregation24 link status is UP.
[switch-irf-if-range]
%Jan  1 00:23:36:479 2010 switch-irf LAGG/5/LAGG_ACTIVE: Member port Ethernet2/0/24 of aggregation group BAGG24 becomes ACTIVE.
[switch-irf-if-range] quit
[switch-irf]

IRF: enable link-aggregation for MAD LACP

 # Enter the BAGG
[switch-irf] int bridge 24
 # Enable MAD LACP, assign a domain ID (should be unique per IRF system in your network)
[switch-irf-Bridge-Aggregation24] mad enable
 You need to assign a domain ID (range: 0-4294967295)
 [Current domain is: 0]: 1
 The assigned  domain ID is: 1
 Info: MAD LACP only enable on dynamic aggregation interface.
[switch-irf-Bridge-Aggregation24]quit

 # Review MAD Configured methods
[switch-irf] display mad
MAD ARP disabled.
MAD LACP enabled.
MAD BFD disabled.

 # Review MAD verbose configuration
[switch-irf] display mad verbose
Current MAD status: Detect
Excluded ports(configurable):
Excluded ports(can not be configured):
  GigabitEthernet1/0/25
  GigabitEthernet1/0/26
  GigabitEthernet2/0/25
  GigabitEthernet2/0/26
MAD ARP disabled.
MAD enabled aggregation port:
  Bridge-Aggregation24
MAD BFD disabled.
[switch-irf]

Provision: define LACP link-aggregation to Comware

 # Create trk object with LACP protocol enabled
HP-3500-24(config)# trunk 23,24 trk1 lacp

 # Review LACP status
HP-3500-24(config)# show lacp

                                    LACP

           LACP      Trunk     Port                LACP      Admin   Oper
   Port    Enabled   Group     Status    Partner   Status    Key     Key
   -----   -------   -------   -------   -------   -------   ------  ------
   23      Active    Trk1      Up        Yes       Success   0       290
   24      Active    Trk1      Up        Yes       Success   0       290


 # Review LACP detailed peer information
HP-3500-24(config)# show lacp peer

LACP Peer Information.


System ID: 2c27d7-79dc80


  Local  Local                       Port      Oper    LACP     Tx
  Port   Trunk  System ID      Port  Priority  Key     Mode     Timer
  ------ ------ -------------- ----- --------- ------- -------- -----
  23     Trk1   b8af67-38764b  24    32768     1       Active   Slow
  24     Trk1   b8af67-38764b  54    32768     1       Active   Slow


HP-3500-24(config)#

 # On Comware, verify LACP detailed peer information
[switch-irf] display link-aggregation verbose Bridge-Aggregation 24
Loadsharing Type: Shar -- Loadsharing, NonS -- Non-Loadsharing
Port Status: S -- Selected, U -- Unselected
Flags:  A -- LACP_Activity, B -- LACP_Timeout, C -- Aggregation,
        D -- Synchronization, E -- Collecting, F -- Distributing,
        G -- Defaulted, H -- Expired

Aggregation Interface: Bridge-Aggregation24
Aggregation Mode: Dynamic
Loadsharing Type: Shar
System ID: 0x8000, b8af-6738-764b
Local:
  Port             Status  Priority Oper-Key  Flag
--------------------------------------------------------------------------------
  Eth1/0/24        S       32768    1         {ACDEF}
  Eth2/0/24        S       32768    1         {ACDEF}
Remote:
  Actor            Partner Priority Oper-Key  SystemID               Flag
--------------------------------------------------------------------------------
  Eth1/0/24        23      0        290       0xdc80, 2c27-d779-dc80 {ACDEF}
  Eth2/0/24        24      0        290       0xdc80, 2c27-d779-dc80 {ACDEF}
[switch-irf]

Provision: enable MAD LACP Pass-through

 # enable MAD LACP TLV pass-through (not enabled by default)
HP-3500-24(config)# interface trk1 lacp mad-passthrough enable

 # Review MAD LACP configuration
HP-3500-24(config)# show lacp mad-passthrough

  Trunk-Group  LACP-MAD-PASSTHROUGH
  ------------ ---------------------
  Trk1         Enabled

 # Review MAD LACP counters
HP-3500-24(config)# show lacp mad-passthrough counters

                MAD Passthrough  MAD Passthrough  MAD Passthrough
  Port   Trunk  PDUs Tx          PDUs Rx          PDUs Dropped
  ------ ------ ---------------- ---------------- ----------------
  23     Trk1   4                10               6
  24     Trk1   4                11               7
HP-3500-24(config)#

Validation

The setup validated by shutting down the IRF links, to force a split brain. The commands are executed on the IRF member 2 console, which will shutdown its ports as a result of the detection.

 # Forced shutdown of IRF stacking links
[switch-irf] int range g1/0/25 g1/0/26
[switch-irf-if-range] shutdown

 # Console is logged out, since new Master is selected for this partition
<switch-irf>
#Jan  1 00:39:15:197 2010 switch-irf SHELL/4/LOGIN:
 Trap 1.3.6.1.4.1.25506.2.2.1.1.3.0.1: login from Console
%Jan  1 00:39:15:353 2010 switch-irf SHELL/5/SHELL_LOGIN: Console logged in from aux1.


 # Initial ENTER commands need to wait for the Management to become available again
 System is busy in recovering configuration, please wait a moment...
 System is busy in recovering configuration, please wait a moment...

 # New console login is now effective
<switch-irf>


 # No console messages have been seen, since the console was not active yet
 # so review the log file
<switch-irf> dis logbuffer reverse
Logging buffer configuration and contents:enabled
Allowed max buffer size : 1024
Actual buffer size : 512
Channel number : 4 , Channel name : logbuffer
Dropped messages : 0
Overwritten messages : 0
Current messages : 10

%Jan  1 00:39:30:497 2010 switch-irf SHELL/6/SHELL_CMD: -Task=au1-IPAddr=**-User=**; Command is dis logbuffer reverse
%Jan  1 00:39:15:513 2010 switch-irf SHELL/5/SHELL_LOGIN: Console logged in from aux1.
%Jan  1 00:39:14:878 2010 switch-irf IFNET/3/LINK_UPDOWN: Bridge-Aggregation24 link status is DOWN.
%Jan  1 00:39:14:863 2010 switch-irf LAGG/5/LAGG_INACTIVE_PHYSTATE: Member port Ethernet2/0/24 of aggregation group BAGG24 becomes INACTIVE because the port's physical state (down) is improper for being attached.
%Jan  1 00:39:14:862 2010 switch-irf LAGG/5/LAGG_INACTIVE_CONFIGURATION: Member port Ethernet1/0/24 of aggregation group BAGG24 becomes INACTIVE because the port's configuration is improper for being attached.
%Jan  1 00:39:14:862 2010 switch-irf IFNET/3/LINK_UPDOWN: Ethernet2/0/24 link status is DOWN.
%Jan  1 00:39:14:862 2010 switch-irf MAD/1/MAD_COLLISION_DETECTED: Multi-active devices detected, please fix it.
%Jan  1 00:39:14:607 2010 switch-irf STM/3/STM_LINK_STATUS_DOWN:
 IRF port 2 is down.
%Jan  1 00:39:14:607 2010 switch-irf HA/5/HA_SLAVE_TO_MASTER: Slave board in slot 2 changes to master.
%Jan  1 00:00:33:320 2010 switch-irf IC/6/SYS_RESTART: -Slot=1; System restarted --
HP Platform Software.
<switch-irf>

 # Verify MAD Process has shutdown all interfaces, 
 # so only the other IRF Member remains online on the network
<switch-irf> dis interface brief down
The brief information of interface(s) under bridge mode:
Link: ADM - administratively down; Stby - standby
Interface            Link Cause
BAGG24               DOWN MAD ShutDown
Eth2/0/1             DOWN MAD ShutDown
Eth2/0/2             DOWN MAD ShutDown
Eth2/0/3             DOWN MAD ShutDown
Eth2/0/4             DOWN MAD ShutDown
Eth2/0/5             DOWN MAD ShutDown
Eth2/0/6             DOWN MAD ShutDown
Eth2/0/7             DOWN MAD ShutDown
Eth2/0/8             DOWN MAD ShutDown
Eth2/0/9             DOWN MAD ShutDown
Eth2/0/10            DOWN MAD ShutDown
Eth2/0/11            DOWN MAD ShutDown
Eth2/0/12            DOWN MAD ShutDown
Eth2/0/13            DOWN MAD ShutDown
Eth2/0/14            DOWN MAD ShutDown
Eth2/0/15            DOWN MAD ShutDown
Eth2/0/16            DOWN MAD ShutDown
Eth2/0/17            DOWN MAD ShutDown
Eth2/0/18            DOWN MAD ShutDown
Eth2/0/19            DOWN MAD ShutDown
Eth2/0/20            DOWN MAD ShutDown
Eth2/0/21            DOWN MAD ShutDown
Eth2/0/22            DOWN MAD ShutDown
Eth2/0/23            DOWN MAD ShutDown
Eth2/0/24            DOWN Link-Aggregation interface down
GE2/0/25             DOWN Not connected
GE2/0/26             DOWN Not connected
GE2/0/27             DOWN MAD ShutDown
GE2/0/28             DOWN MAD ShutDown

<switch-irf>

On the Provision side, verify that the link-aggregation has only 1 active port remaining:

HP-3500-24(config)# show lacp

                                    LACP

           LACP      Trunk     Port                LACP      Admin   Oper
   Port    Enabled   Group     Status    Partner   Status    Key     Key
   -----   -------   -------   -------   -------   -------   ------  ------
   23      Active    Trk1      Up        Yes       Success   0       290
   24      Active    Trk1      Down      No        Success   0       290

Supplemental validation

Run another split-brain check, to see the console output. This can be forced using the mad restore command.

The original mad restore command is intended to be used in this rare occasion:

* IRF configured between switches (example SW1/SW2)
* Split brain occurs, SW2 MAD detects it and shuts down all interfaces
* SW1 is the only surviving node, network still ok
* SW1 encounters a power failure, so the network is down (SW2 has all ports down, so no more network)
* Instead of performing a full reboot of SW2 to get it online again, the admin can use mad restore on SW2 to enable the interfaces again. Network will be back online after this command, since there is no more split brain condition (SW1 is powered down).

You can abuse this functionality to run multiple split brain tests without having to do a full reboot of the switches, this is what is done in this example.

Since the mad restore will be done on member2, the interfaces will come UP, MAD LACP will detect the split brain again, and all interfaces will be SHUTDOWN again. But this time you can follow the process on the console log output as well.

[switch-irf] mad restore
This command will restore the device from multi-active conflict state. Continue? [Y/N]:y
Restoring from multi-active conflict state, please wait...
[switch-irf]
%Jan  1 00:52:42:448 2010 switch-irf IFNET/3/LINK_UPDOWN: Ethernet2/0/24 link status is UP.
%Jan  1 00:52:42:568 2010 switch-irf MAD/1/MAD_COLLISION_DETECTED: Multi-active devices detected, please fix it.
%Jan  1 00:52:42:709 2010 switch-irf IFNET/3/LINK_UPDOWN: Ethernet2/0/24 link status is DOWN.
[switch-irf]

Conclusion

This example shows how a Provision LACP link-aggregation can be used to assist a Comware IRF system for the split brain detection.

 

 

 

 

 

This entry was posted in Comware5, Comware7, IRF, Provision and tagged , , , , , , . Bookmark the permalink.

4 Responses to Provision support for IRF MAD LACP Split brain detection

  1. jonathan says:

    Hello quick question about enabling LACP MAD on a bridge aggregation link if prompts you for the domain id which i leave as the ID of the IRF pair i am on but why does it prompt for this and in what scenarios would you chose a different domain ID to the switch you are on ?

  2. jonathan says:

    ok figured it out thought i was setting a domain per bridge-aggregation interface but it seems to change the whole switches irf domain

  3. Rob says:

    Hi, I have a couple of questions.
    1. I am using route aggregation interfaces exclusively between the core/distribution layers. Can MAD LACP be enabled on a route aggregation interface in the same way as a bridge aggregation interface? The command is accepted, but I cannot find a reference to a MAD/RAGG config example.
    2. Some references state that MAD needs to be enabled on both ends of a link and others do not. Is this necessary for the comware LACP extensions to work properly?

    • Hi Rob,
      1. the MAD information is part of the LACP packet exchange, so if LACP is ok, MAD is ok. In fact, from a link-agg point of view, there is no difference between a BAGG with LACP and RAGG with LACP. This is only a switch local config difference with regards to routing/switching over the link-aggregation, so it has no impact on the MAD LACP process.
      2. Both ends must understand the MAD LACP extensions. Comware switches understand these LACP extensions by default, no config required except for enabling LACP (so target does not need to be an IRF, just a Comware device). When you have a non-Comware device, such as the ArubaOS-switch (Provision), you need to enable support for the MAD LACP extensions manually.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s