Resolving Graid Service Startup Issue on Systems with NVSwitch/NVLink GPUs#

Resolving Graid Service Startup Issue on Systems with NVSwitch/NVLink GPUs#

Environment

RAID Model: SR1000 or SR1010 etc

Host Hardware: AMD/Intel

Operating System: Linux (Ubuntu/RHEL based/Suse)


Issue

On systems equipped with NVIDIA NVSwitch or NVLink GPUs, the Graid service (graid) fails to initialize and start properly. This issue specifically affects high-performance computing (HPC) and data center environments utilizing NVIDIA's advanced GPU interconnect technologies

Root Cause

NVIDA Fabric Manger is required only when NVSwitch hardware is present.
Without the Fabric Manager:
NVSwitch / NVLink fabric cannot initialize GPU-to-GPU communication does not become available

Fabric routing and topology management fails → causing Graid service startup failure because Graid relies on full GPU initialization.

How to Check Whether Fabric Manager is Required

Fabric Manager is needed ONLY IF your system contains NVSwitch or NVLink bridges.

GPU Baseboard Topologies
The following section provides information about different baseboard PCIe topologies, with a focus on GPU and NVSwitches, and how the topologies will appear on a host system.


The HGX-2 GPU Baseboard

Figure 3 shows a simplified HGX-2 GPU baseboard.

Simplified HGX-2 Baseboard

Figure 3 Simplified HGX-2 Baseboard

The HGX-2 baseboard contains eight V100 GPUs and six corresponding first generation NVSwitches. From a PCIe tree perspective, the eight GPUs and six NVSwitches will appear on the PCIe tree as PCIe devices on the host system.

Here is an example:

lspci | grep -i nvidia
34:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
36:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
39:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
57:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
59:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
5c:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
5e:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM3 32GB] (rev a1)
61:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
62:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
63:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
65:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
66:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
67:00.0 Bridge: NVIDIA Corporation Device 1ac2 (rev a1)
The NVIDIA HGX A100 GPU Baseboard

Figure 4 shows a simplified NVIDIA HGX A100 baseboard diagram.

Simplified HGX A100 Baseboard

Figure 4 Simplified HGX A100 Baseboard

The NVIDIA HGX A100 baseboard PCIe topology is like an HGX-2 baseboard with eight A100 GPUs and six corresponding second-generation NVSwitches. The eight GPUs and six NVSwitches will appear on the PCIe tree as PCIe devices on the host system.

Here is an example:

lspci | grep -i nvidia
36:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
3b:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
41:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
45:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
59:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
5d:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
63:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
67:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
6d:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
6e:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
6f:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
70:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
71:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
72:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
The NVIDIA HGX H100 GPU Baseboard

Figure 5 shows an NVIDIA HGX H100 GPU baseboard.

A Simplified Simple NVIDIA HGX H100 Baseboard Diagram

Figure 5 A Simplified Simple NVIDIA HGX H100 Baseboard Diagram


The NVIDIA HGX H100 baseboard PCIe topology has eight GPUs and four NVSwitches on the PCIe tree as PCIe devices on the host system.
Here is an example:
lspci | grep -i nvidia
07:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
08:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
09:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
0a:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
1b:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
43:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
52:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
61:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
9d:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
c3:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
d1:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
df:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
NVIDIA HGX B200/B300 GPU Baseboard

Figure 6 shows a simplified NVIDIA HGX B200/B300 baseboard diagram.




Resolution
The Graid pre-install script installs NVIDIA driver version 550.67 by default, which may not be match with the NVIDIA Fabric Manager available in repositories (e.g., nvidia-fabricmanager-550 supports 550.90).

To resolve this issue, install the NVIDIA Fabric Manager before starting the GRAID service. Follow the steps below:

  • Uninstall the default NVIDIA driver installed by the Graid pre-install script if there is a version mismatch.
  • Install the compatible NVIDIA driver version.
  • Install the NVIDIA Fabric Manager corresponding to the driver version.
  • Adjust the GRAID service dependencies to ensure it starts after the Fabric Manager.
  • Restart the GRAID service and verify its status.

  • 1. Remove the Default NVIDIA Driver (Version 550.67)

    If the NVIDIA driver version installed by the GRAID pre-install script (550.67) does not match the version required by the NVIDIA Fabric Manager (e.g., 550.90), you need to uninstall it.


    1. sudo /usr/bin/nvidia-uninstall -s -q
    • The -s option runs the uninstaller in silent mode.
    • The -q option suppresses output messages.
    2. Reboot the System
    Restart the system to ensure the driver is completely removed.
    1. sudo reboot

    3. Install the Compatible NVIDIA Driver

    a. Download the Required NVIDIA Driver

      Note: Replace <driver-version> with the required version (e.g., 550.90.07).

    Example:

    b. Install the NVIDIA Driver

    1. sudo bash NVIDIA-Linux-x86_64-<driver-version>.run -s --no-systemd --no-opengl-files --no-nvidia-modprobe --dkms
    Example:
    1. sudo bash NVIDIA-Linux-x86_64-550.90.07.run -s --no-systemd --no-opengl-files --no-nvidia-modprobe --dkms
    Installer Options Explanation:
    • -s: Silent installation mode.
    • --no-systemd: Skip installing systemd services.
    • --no-opengl-files: Do not install OpenGL libraries (suitable for headless servers).
    • --no-nvidia-modprobe: Do not install the nvidia-modprobe utility.
    • --dkms: Enable DKMS module for automatic recompilation with kernel updates.

    4. Install NVIDIA Fabric Manager

    Ensure that the Fabric Manager version matches the NVIDIA driver version installed.

    a. Update Package Repository

    Ubunut
    1. sudo apt-get update
    RHEL based
    1. sudo yum update 

    b. Install Fabric Manager Package (refer to fabric-manager-user-guide)

    Note: Replace <driver-version> with your NVIDIA driver version (e.g., 550).

    Ubuntu
    1. sudo apt-get install -y nvidia-fabricmanager-<driver-version>
    RHEL based
    1. sudo dnf module install nvidia-driver:<driver-branch>/fm
    Example:
    Ubuntu
    1. sudo apt-get install -y nvidia-fabricmanager-550
    RHEL based (refer to nvidia-driver-installation-guide)
    1. sudo rpm --erase gpg-pubkey-7fa2af80*
    2. sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    3. sudo dnf clean expire-cache
    4. sudo dnf module install nvidia-driver:550/fm
    Note: If the required version is not available in the repository, download it from the fabric-manager-user-guide.

    5. Enable and Start NVIDIA Fabric Manager Service
    1. sudo systemctl enable nvidia-fabricmanager
    2. sudo systemctl start nvidia-fabricmanager

    6. Adjust Graid Service Dependencies

    Modify the Graid service unit file to ensure it starts after the Fabric Manager service.

    a. Open the Graid Service Unit File

    1. sudo vim /lib/systemd/system/graid.service

    b. Edit the [Unit] Section

    • Add nvidia-fabricmanager.service to the Wants and After directives.
    • Remove sysinit.target from the Before directive, if present.

    Updated [Unit] Section Example:

    1. [Unit]
    2. Description=Graid Service
    3. Wants=graidcore@0.service graidcore@1.service nvidia-fabricmanager.service
    4. After=local-fs.target graidcore@0.service graidcore@1.service nvidia-fabricmanager.service
    5. Before=shutdown.target



    c. Save and Exit the Editor
    • Press Esc to switch to command mode.
    • Type :wq and press Enter to save changes and exit.

    d. Reload Systemd Daemon

    1. sudo systemctl daemon-reload

    7. Restart the Graid Service

    a. Stop the Graid Service

    1. sudo systemctl stop graid
    b. Reset NVIDIA GPUs
    1. sudo nvidia-smi -r
    c. Start the Graid Service
    1. sudo systemctl start graid

    8. Verify Graid Service Status

    Check that the Graid service is running without issues.

    1. sudo systemctl status graid


    Additional Notes

    • GRAID Pre-install Script Default Driver Version:

      • The GRAID pre-install script installs NVIDIA driver version 550.67 by default.
      • If this version does not match the version required by the NVIDIA Fabric Manager, you must uninstall it and install the correct version as outlined above.
    • Importance of NVIDIA Fabric Manager:

      • The Fabric Manager is essential for systems using NVSwitch or NVLink technology, as it manages the high-speed GPU interconnects.
      • Without it, GPU communication required by the GRAID service cannot be established.
    • Version Compatibility:

      • Ensure that the NVIDIA driver and Fabric Manager versions match exactly (e.g., both are 550.90.07).
      • Mismatched versions can lead to incompatibility and service failures.
    • Updating Repositories:

      • If the required Fabric Manager version is not available, update your package repositories or download the package directly from NVIDIA's website.
    • Consult Documentation:



    By following these steps, you should resolve the Graid service startup issue on systems with NVSwitch or NVLink GPUs, ensuring that the Graid service operates correctly with the appropriate NVIDIA driver and dependencies.