2. Subnet Manager
While there is a physical connection, the IB network will not be ready without a Subnet Manager (SM). The Subnet manager actively configures the IB fabric. It either runs as a service on one of the cluster nodes or on managed IB network switches.
The SM initializes and configures all devices on the fabric, including assigning local identifiers (LIDs) and establishes traffic paths within the network. It can isolate faults and there can be redundant subnet managers. However, there can only be one activate at a time.
Similar to MAC addresses in Ethernet, in InfiniBand each device has multiple unique GUIDs (Globally Unique Identifier).
- Node GUID
identifies the HCA, Switch or Router
- Port GUID
identifies the port on an HCA, Switch or Router
- System GUID
combination of multiple GUIDs as one
- LID (Local Identifier)
16-bit address, assigned by Subnet Manager used to route packets, not persistent through reboots
Warning
Since we are all using the same IB fabric for all clusters in Temple, we will only configure a single subnet manager for all training clusters. The following just illustrates how this would be done. DO NOT INSTALL OPENSM ON YOUR OWN CLUSTER
This is how you can install the OpenSM daemon on a system to act as subnet manager:
# DO NOT EXECUTE!
[root@master ~]# yum install opensm
[root@master ~]# systemctl enable opensm
[root@master ~]# systemctl start opensm
[root@master ~]# systemctl status opensm
● opensm.service - Starts the OpenSM InfiniBand fabric Subnet Manager
Loaded: loaded (/usr/lib/systemd/system/opensm.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2021-02-12 10:48:18 EST; 13s ago
Docs: man:opensm
Process: 24400 ExecStart=/usr/libexec/opensm-launch (code=exited, status=0/SUCCESS)
Main PID: 24401 (opensm-launch)
CGroup: /system.slice/opensm.service
├─24401 /bin/bash /usr/libexec/opensm-launch
└─24402 /usr/sbin/opensm
Feb 12 10:48:18 master opensm-launch[24400]: Log File: /var/log/opensm.log
Feb 12 10:48:18 master opensm-launch[24400]: -------------------------------------------------
Feb 12 10:48:18 master OpenSM[24402]: /var/log/opensm.log log file opened
Feb 12 10:48:18 master OpenSM[24402]: OpenSM 3.3.21
Feb 12 10:48:18 master opensm-launch[24400]: OpenSM 3.3.21
Feb 12 10:48:18 master opensm-launch[24400]: Using default GUID 0x2c903000b86f9
Feb 12 10:48:18 master OpenSM[24402]: Entering DISCOVERING state
Feb 12 10:48:18 master opensm-launch[24400]: Entering DISCOVERING state
Feb 12 10:48:18 master OpenSM[24402]: Entering MASTER state
Feb 12 10:48:18 master opensm-launch[24400]: Entering MASTER state
Once there is an active SM, your IB port status with change to ACTIVE
[root@master ~]# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x0002c903000b86f8
System image GUID: 0x0002c903000b86fb
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 10
LMC: 0
SM lid: 10
Capability mask: 0x0259086a
Port GUID: 0x0002c903000b86f9
Link layer: InfiniBand
Verify the same is true on your compute nodes:
[root@master ~]# ssh c01
[root@c01 ~]# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x0002c903000b8964
System image GUID: 0x0002c903000b8967
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 18
LMC: 0
SM lid: 10
Capability mask: 0x02590868
Port GUID: 0x0002c903000b8965
Link layer: InfiniBand