On systems equipped with NVIDIA NVSwitch or NVLink GPUs, the Graid service (graid) fails to initialize and start properly. This issue specifically affects high-performance computing (HPC) and data center environments utilizing NVIDIA's advanced GPU interconnect technologies
Fabric routing and topology management fails → causing Graid service startup failure because Graid relies on full GPU initialization.
Figure 3 shows a simplified HGX-2 GPU baseboard.
Figure 3 Simplified HGX-2 Baseboard
The HGX-2 baseboard contains eight V100 GPUs and six corresponding first generation NVSwitches. From a PCIe tree perspective, the eight GPUs and six NVSwitches will appear on the PCIe tree as PCIe devices on the host system.
Here is an example:
lspci | grep -i nvidia
34:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
36:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
39:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
57:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
59:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
5c:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
5e:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
61:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
62:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
63:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
65:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
66:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
67:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
Figure 4 shows a simplified NVIDIA HGX A100 baseboard diagram.
Figure 4 Simplified HGX A100 Baseboard
The NVIDIA HGX A100 baseboard PCIe topology is like an HGX-2 baseboard with eight A100 GPUs and six corresponding second-generation NVSwitches. The eight GPUs and six NVSwitches will appear on the PCIe tree as PCIe devices on the host system.
Here is an example:
lspci | grep -i nvidia
36:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
3b:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
41:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
45:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
59:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
5d:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
63:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
67:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
6d:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
6e:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
6f:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
70:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
71:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
72:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
Figure 5 shows an NVIDIA HGX H100 GPU baseboard.
Figure 5 A Simplified Simple NVIDIA HGX H100 Baseboard Diagram
lspci | grep -i nvidia
07:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
08:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
09:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
0a:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
1b:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
43:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
52:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
61:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
9d:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
c3:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
d1:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
df:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
nvidia-fabricmanager-550 supports 550.90).1. Remove the Default NVIDIA Driver (Version 550.67)
If the NVIDIA driver version installed by the GRAID pre-install script (550.67) does not match the version required by the NVIDIA Fabric Manager (e.g., 550.90), you need to uninstall it.
sudo /usr/bin/nvidia-uninstall -s -q
-s option runs the uninstaller in silent mode.-q option suppresses output messages.3. Install the Compatible NVIDIA Driver
a. Download the Required NVIDIA Driver
Note: Replace <driver-version> with the required version (e.g., 550.90.07).
b. Install the NVIDIA Driver
-s: Silent installation mode.--no-systemd: Skip installing systemd services.--no-opengl-files: Do not install OpenGL libraries (suitable for headless servers).--no-nvidia-modprobe: Do not install the nvidia-modprobe utility.--dkms: Enable DKMS module for automatic recompilation with kernel updates.4. Install NVIDIA Fabric Manager
Ensure that the Fabric Manager version matches the NVIDIA driver version installed.
a. Update Package Repository
b. Install Fabric Manager Package (refer to fabric-manager-user-guide)
Note: Replace <driver-version> with your NVIDIA driver version (e.g., 550).
6. Adjust Graid Service Dependencies
Modify the Graid service unit file to ensure it starts after the Fabric Manager service.
a. Open the Graid Service Unit File
b. Edit the [Unit] Section
nvidia-fabricmanager.service to the Wants and After directives.sysinit.target from the Before directive, if present.Updated [Unit] Section Example:
Esc to switch to command mode.:wq and press Enter to save changes and exit.d. Reload Systemd Daemon
7. Restart the Graid Service
a. Stop the Graid Service
8. Verify Graid Service Status
Check that the Graid service is running without issues.
GRAID Pre-install Script Default Driver Version:
Importance of NVIDIA Fabric Manager:
Version Compatibility:
Updating Repositories:
Consult Documentation: