Resolving Graid Service Startup Issue on Systems with NVSwitch/NVLink GPUs#

Resolving Graid Service Startup Issue on Systems with NVSwitch/NVLink GPUs#

Environment

RAID Model: SR1000 or SR1010 etc

Host Hardware: AMD/Intel

Operating System: Linux (Ubuntu/RHEL based/Suse)


Issue

On systems equipped with NVIDIA NVSwitch or NVLink GPUs, the Graid service (graid) fails to initialize and start properly. This issue specifically affects high-performance computing (HPC) and data center environments utilizing NVIDIA's advanced GPU interconnect technologies

Root Cause

The primary cause of this failure is the absence of the NVIDIA Fabric Manager, a critical software component required for managing NVSwitch and NVLink interconnects. The Fabric Manager is essential for:
  1. Initializing and configuring the NVSwitch/NVLink fabric
  1. Managing GPU-to-GPU communication over the high-speed interconnects
  2. Optimizing data routing and load balancing across the GPU cluster

Resolution

The Graid pre-install script installs NVIDIA driver version 550.67 by default, which may not be match with the NVIDIA Fabric Manager available in repositories (e.g., nvidia-fabricmanager-550 supports 550.90).

To resolve this issue, install the NVIDIA Fabric Manager before starting the GRAID service. Follow the steps below:

  • Uninstall the default NVIDIA driver installed by the Graid pre-install script if there is a version mismatch.
  • Install the compatible NVIDIA driver version.
  • Install the NVIDIA Fabric Manager corresponding to the driver version.
  • Adjust the GRAID service dependencies to ensure it starts after the Fabric Manager.
  • Restart the GRAID service and verify its status.

  • 1. Remove the Default NVIDIA Driver (Version 550.67)

    If the NVIDIA driver version installed by the GRAID pre-install script (550.67) does not match the version required by the NVIDIA Fabric Manager (e.g., 550.90), you need to uninstall it.


    1. sudo /usr/bin/nvidia-uninstall -s -q
    • The -s option runs the uninstaller in silent mode.
    • The -q option suppresses output messages.
    2. Reboot the System
    Restart the system to ensure the driver is completely removed.
    1. sudo reboot

    3. Install the Compatible NVIDIA Driver

    a. Download the Required NVIDIA Driver

      Note: Replace <driver-version> with the required version (e.g., 550.90.07).

    Example:

    b. Install the NVIDIA Driver

    1. sudo bash NVIDIA-Linux-x86_64-<driver-version>.run -s --no-systemd --no-opengl-files --no-nvidia-modprobe --dkms
    Example:
    1. sudo bash NVIDIA-Linux-x86_64-550.90.07.run -s --no-systemd --no-opengl-files --no-nvidia-modprobe --dkms
    Installer Options Explanation:
    • -s: Silent installation mode.
    • --no-systemd: Skip installing systemd services.
    • --no-opengl-files: Do not install OpenGL libraries (suitable for headless servers).
    • --no-nvidia-modprobe: Do not install the nvidia-modprobe utility.
    • --dkms: Enable DKMS module for automatic recompilation with kernel updates.

    4. Install NVIDIA Fabric Manager

    Ensure that the Fabric Manager version matches the NVIDIA driver version installed.

    a. Update Package Repository

    Ubunut
    1. sudo apt-get update
    RHEL based
    1. sudo yum update 

    b. Install Fabric Manager Package (refer to fabric-manager-user-guide)

    Note: Replace <driver-version> with your NVIDIA driver version (e.g., 550).

    Ubuntu
    1. sudo apt-get install -y nvidia-fabricmanager-<driver-version>
    RHEL based
    1. sudo dnf module install nvidia-driver:<driver-branch>/fm
    Example:
    Ubuntu
    1. sudo apt-get install -y nvidia-fabricmanager-550
    RHEL based (refer to nvidia-driver-installation-guide)
    1. sudo rpm --erase gpg-pubkey-7fa2af80*
    2. sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    3. sudo dnf clean expire-cache
    4. sudo dnf module install nvidia-driver:550/fm
    Note: If the required version is not available in the repository, download it from the fabric-manager-user-guide.

    5. Enable and Start NVIDIA Fabric Manager Service
    1. sudo systemctl enable nvidia-fabricmanager
    2. sudo systemctl start nvidia-fabricmanager

    6. Adjust Graid Service Dependencies

    Modify the Graid service unit file to ensure it starts after the Fabric Manager service.

    a. Open the Graid Service Unit File

    1. sudo vim /lib/systemd/system/graid.service

    b. Edit the [Unit] Section

    • Add nvidia-fabricmanager.service to the Wants and After directives.
    • Remove sysinit.target from the Before directive, if present.

    Updated [Unit] Section Example:

    1. [Unit]
    2. Description=Graid Service
    3. Wants=graidcore@0.service graidcore@1.service nvidia-fabricmanager.service
    4. After=local-fs.target graidcore@0.service graidcore@1.service nvidia-fabricmanager.service
    5. Before=shutdown.target



    c. Save and Exit the Editor
    • Press Esc to switch to command mode.
    • Type :wq and press Enter to save changes and exit.

    d. Reload Systemd Daemon

    1. sudo systemctl daemon-reload

    7. Restart the Graid Service

    a. Stop the Graid Service

    1. sudo systemctl stop graid
    b. Reset NVIDIA GPUs
    1. sudo nvidia-smi -r
    c. Start the Graid Service
    1. sudo systemctl start graid

    8. Verify Graid Service Status

    Check that the Graid service is running without issues.

    1. sudo systemctl status graid


    Additional Notes

    • GRAID Pre-install Script Default Driver Version:

      • The GRAID pre-install script installs NVIDIA driver version 550.67 by default.
      • If this version does not match the version required by the NVIDIA Fabric Manager, you must uninstall it and install the correct version as outlined above.
    • Importance of NVIDIA Fabric Manager:

      • The Fabric Manager is essential for systems using NVSwitch or NVLink technology, as it manages the high-speed GPU interconnects.
      • Without it, GPU communication required by the GRAID service cannot be established.
    • Version Compatibility:

      • Ensure that the NVIDIA driver and Fabric Manager versions match exactly (e.g., both are 550.90.07).
      • Mismatched versions can lead to incompatibility and service failures.
    • Updating Repositories:

      • If the required Fabric Manager version is not available, update your package repositories or download the package directly from NVIDIA's website.
    • Consult Documentation:



    By following these steps, you should resolve the Graid service startup issue on systems with NVSwitch or NVLink GPUs, ensuring that the Graid service operates correctly with the appropriate NVIDIA driver and dependencies.