Resolving Griad Service Startup Issue on Systems with NVSwitch/NVLink GPUs#

Resolving Griad Service Startup Issue on Systems with NVSwitch/NVLink GPUs#

Environment

RAID Model: SR1000 or SR1010 etc

Host Hardware: AMD/Intel

Operating System: Linux (Ubuntu/RHEL based/Suse)


Issue

On systems equipped with NVIDIA NVSwitch or NVLink GPUs, the Graid service (graid) fails to initialize and start properly. This issue specifically affects high-performance computing (HPC) and data center environments utilizing NVIDIA's advanced GPU interconnect technologies

Root Cause

The primary cause of this failure is the absence of the NVIDIA Fabric Manager, a critical software component required for managing NVSwitch and NVLink interconnects. The Fabric Manager is essential for:
  1. Initializing and configuring the NVSwitch/NVLink fabric
  1. Managing GPU-to-GPU communication over the high-speed interconnects
  2. Optimizing data routing and load balancing across the GPU cluster

Resolution

The Graid pre-install script installs NVIDIA driver version 550.67 by default, which may not be match with the NVIDIA Fabric Manager available in repositories (e.g., nvidia-fabricmanager-550 supports 550.90).

To resolve this issue, install the NVIDIA Fabric Manager before starting the GRAID service. Follow the steps below:

  • Uninstall the default NVIDIA driver installed by the Graid pre-install script if there is a version mismatch.
  • Install the compatible NVIDIA driver version.
  • Install the NVIDIA Fabric Manager corresponding to the driver version.
  • Adjust the GRAID service dependencies to ensure it starts after the Fabric Manager.
  • Restart the GRAID service and verify its status.

  • 1. Remove the Default NVIDIA Driver (Version 550.67)

    If the NVIDIA driver version installed by the GRAID pre-install script (550.67) does not match the version required by the NVIDIA Fabric Manager (e.g., 550.90), you need to uninstall it.


    1. sudo /usr/bin/nvidia-uninstall -s -q
    • The -s option runs the uninstaller in silent mode.
    • The -q option suppresses output messages.
    2. Reboot the System
    Restart the system to ensure the driver is completely removed.
    1. sudo reboot

    3. Install the Compatible NVIDIA Driver

    a. Download the Required NVIDIA Driver

      Note: Replace <driver-version> with the required version (e.g., 550.90.07).

    Example:

    b. Install the NVIDIA Driver

    1. sudo bash NVIDIA-Linux-x86_64-<driver-version>.run -s --no-systemd --no-opengl-files --no-nvidia-modprobe --dkms
    Example:
    1. sudo bash NVIDIA-Linux-x86_64-550.90.07.run -s --no-systemd --no-opengl-files --no-nvidia-modprobe --dkms
    Installer Options Explanation:
    • -s: Silent installation mode.
    • --no-systemd: Skip installing systemd services.
    • --no-opengl-files: Do not install OpenGL libraries (suitable for headless servers).
    • --no-nvidia-modprobe: Do not install the nvidia-modprobe utility.
    • --dkms: Enable DKMS module for automatic recompilation with kernel updates.

    4. Install NVIDIA Fabric Manager

    Ensure that the Fabric Manager version matches the NVIDIA driver version installed.

    a. Update Package Repository

    Ubunut
    1. sudo apt-get update
    RHEL based
    1. sudo yum update 

    b. Install Fabric Manager Package (refer to fabric-manager-user-guide)

    Note: Replace <driver-version> with your NVIDIA driver version (e.g., 550).

    Ubuntu
    1. sudo apt-get install -y nvidia-fabricmanager-<driver-version>
    RHEL based
    1. sudo dnf module install nvidia-driver:<driver-branch>/fm
    Example:
    Ubuntu
    1. sudo apt-get install -y nvidia-fabricmanager-550
    RHEL based (refer to nvidia-driver-installation-guide)
    1. sudo rpm --erase gpg-pubkey-7fa2af80*
    2. sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    3. sudo dnf clean expire-cache
    4. sudo dnf module install nvidia-driver:550/fm
    Note: If the required version is not available in the repository, download it from the fabric-manager-user-guide.

    5. Enable and Start NVIDIA Fabric Manager Service
    1. sudo systemctl enable nvidia-fabricmanager
    2. sudo systemctl start nvidia-fabricmanager

    6. Adjust Graid Service Dependencies

    Modify the Graid service unit file to ensure it starts after the Fabric Manager service.

    a. Open the Graid Service Unit File

    1. sudo vim /lib/systemd/system/graid.service

    b. Edit the [Unit] Section

    • Add nvidia-fabricmanager.service to the Wants and After directives.
    • Remove sysinit.target from the Before directive, if present.

    Updated [Unit] Section Example:

    1. [Unit]
    2. Description=Griad Service
    3. Wants=graidcore@0.service graidcore@1.service nvidia-fabricmanager.service
    4. After=local-fs.target graidcore@0.service graidcore@1.service nvidia-fabricmanager.service
    5. Before=shutdown.target



    c. Save and Exit the Editor
    • Press Esc to switch to command mode.
    • Type :wq and press Enter to save changes and exit.

    d. Reload Systemd Daemon

    1. sudo systemctl daemon-reload

    7. Restart the Graid Service

    a. Stop the Graid Service

    1. sudo systemctl stop graid
    b. Reset NVIDIA GPUs
    1. sudo nvidia-smi -r
    c. Start the Griad Service
    1. sudo systemctl start graid

    8. Verify Graid Service Status

    Check that the Graid service is running without issues.

    1. sudo systemctl status graid


    Additional Notes

    • GRAID Pre-install Script Default Driver Version:

      • The GRAID pre-install script installs NVIDIA driver version 550.67 by default.
      • If this version does not match the version required by the NVIDIA Fabric Manager, you must uninstall it and install the correct version as outlined above.
    • Importance of NVIDIA Fabric Manager:

      • The Fabric Manager is essential for systems using NVSwitch or NVLink technology, as it manages the high-speed GPU interconnects.
      • Without it, GPU communication required by the GRAID service cannot be established.
    • Version Compatibility:

      • Ensure that the NVIDIA driver and Fabric Manager versions match exactly (e.g., both are 550.90.07).
      • Mismatched versions can lead to incompatibility and service failures.
    • Updating Repositories:

      • If the required Fabric Manager version is not available, update your package repositories or download the package directly from NVIDIA's website.
    • Consult Documentation:



    By following these steps, you should resolve the Graid service startup issue on systems with NVSwitch or NVLink GPUs, ensuring that the Graid service operates correctly with the appropriate NVIDIA driver and dependencies.
      • Related Articles

      • License Un-Binds from Graid Card After a Hard Power Event in Windows

        Environment RAID Model: SR-1000, SR-1001, SR-1010 Host Hardware: AMD, Intel Operating Systems: Microsoft Windows Issue After successful application of the SupremeRAID license key, then a system power down event, the license can lose its bonding with ...
      • fstab#

        Environment RAID Model: SR1000 or SR1010 etc Host Hardware: AMD/Intel Operating System: Linux Issue The Appendix of the User manual suggest that the UUID is used as shown below. This method may not have the desired affects on some Linux distro's. ...
      • How do I expand the storage pool or volume capacity#

        Environment RAID Model: SR1000 or SR1010, SR1001 Host Hardware: Intel, AMD Operating System: Linux Issue User would like to expand the Volume or storage. Resolution 1. Set up the LVM with SupremeRAID solution wit original size. # graidctl c pd ...
      • How to enable mail notification with SupremeRAID#

        Environment RAID Model: SR1000 or SR1010, SR1001 Host Hardware: Intel, AMD Operating System: Linux / Windows Issue Mail notification Resolution Linux Email Notification Agent deb (opens new window)rpm Windows Mail Notification Agent ...
      • Setting up the dual-controller to enable High Availability (HA) and auto-failover feature.#

        Environment RAID Model: All Supreme RAID model Host Hardware: AMD/Intel Operating System: Linux SupremeRAID driver: 1.3.x or later Description This feature enables the SupremeRAID system to automatically fail over to another SupremeRAID card when one ...