Environment
RAID Model: SR1000 or SR1010 etc
Host Hardware: AMD/Intel
Operating System: Linux (Ubuntu/RHEL based/Suse)
Issue
On systems equipped with NVIDIA NVSwitch or NVLink GPUs, the Graid service (graid
) fails to initialize and start properly. This issue specifically affects high-performance computing (HPC) and data center environments utilizing NVIDIA's advanced GPU interconnect technologies
Root Cause
The primary cause of this failure is the absence of the NVIDIA Fabric Manager, a critical software component required for managing NVSwitch and NVLink interconnects. The Fabric Manager is essential for:
- Initializing and configuring the NVSwitch/NVLink fabric
- Managing GPU-to-GPU communication over the high-speed interconnects
- Optimizing data routing and load balancing across the GPU cluster
Resolution
The Graid pre-install script installs NVIDIA driver version 550.67 by default, which may not be match with the NVIDIA Fabric Manager available in repositories (e.g., nvidia-fabricmanager-550
supports 550.90).
To resolve this issue, install the NVIDIA Fabric Manager before starting the GRAID service. Follow the steps below:
Uninstall the default NVIDIA driver installed by the Graid pre-install script if there is a version mismatch.Install the compatible NVIDIA driver version.Install the NVIDIA Fabric Manager corresponding to the driver version.Adjust the GRAID service dependencies to ensure it starts after the Fabric Manager.Restart the GRAID service and verify its status.
1. Remove the Default NVIDIA Driver (Version 550.67)
If the NVIDIA driver version installed by the GRAID pre-install script (550.67) does not match the version required by the NVIDIA Fabric Manager (e.g., 550.90), you need to uninstall it.
sudo /usr/bin/nvidia-uninstall -s -q
- The
-s
option runs the uninstaller in silent mode. - The
-q
option suppresses output messages.
2. Reboot the System
Restart the system to ensure the driver is completely removed.
3. Install the Compatible NVIDIA Driver
a. Download the Required NVIDIA Driver
Note: Replace <driver-version>
with the required version (e.g., 550.90.07
).
Example:
b. Install the NVIDIA Driver
- sudo bash NVIDIA-Linux-x86_64-<driver-version>.run -s --no-systemd --no-opengl-files --no-nvidia-modprobe --dkms
Example:
- sudo bash NVIDIA-Linux-x86_64-550.90.07.run -s --no-systemd --no-opengl-files --no-nvidia-modprobe --dkms
Installer Options Explanation:
-s
: Silent installation mode.--no-systemd
: Skip installing systemd services.--no-opengl-files
: Do not install OpenGL libraries (suitable for headless servers).--no-nvidia-modprobe
: Do not install the nvidia-modprobe
utility.--dkms
: Enable DKMS module for automatic recompilation with kernel updates.
4. Install NVIDIA Fabric Manager
Ensure that the Fabric Manager version matches the NVIDIA driver version installed.
a. Update Package Repository
Ubunut
RHEL based
b. Install Fabric Manager Package (refer to fabric-manager-user-guide)
Note: Replace <driver-version>
with your NVIDIA driver version (e.g., 550
).
Ubuntu
- sudo apt-get install -y nvidia-fabricmanager-<driver-version>
RHEL based
- sudo dnf module install nvidia-driver:<driver-branch>/fm
Example:
Ubuntu
- sudo apt-get install -y nvidia-fabricmanager-550
5. Enable and Start NVIDIA Fabric Manager Service
- sudo systemctl enable nvidia-fabricmanager
- sudo systemctl start nvidia-fabricmanager
6. Adjust Graid Service Dependencies
Modify the Graid service unit file to ensure it starts after the Fabric Manager service.
a. Open the Graid Service Unit File
- sudo vim /lib/systemd/system/graid.service
b. Edit the [Unit]
Section
- Add
nvidia-fabricmanager.service
to the Wants
and After
directives. - Remove
sysinit.target
from the Before
directive, if present.
Updated [Unit]
Section Example:
- [Unit]
- Description=Griad Service
- Wants=graidcore@0.service graidcore@1.service nvidia-fabricmanager.service
- After=local-fs.target graidcore@0.service graidcore@1.service nvidia-fabricmanager.service
- Before=shutdown.target
c. Save and Exit the Editor
- Press
Esc
to switch to command mode. - Type
:wq
and press Enter
to save changes and exit.
d. Reload Systemd Daemon
- sudo systemctl daemon-reload
7. Restart the Graid Service
a. Stop the Graid Service
- sudo systemctl stop graid
b. Reset NVIDIA GPUs
c. Start the Griad Service
- sudo systemctl start graid
8. Verify Graid Service Status
Check that the Graid service is running without issues.
- sudo systemctl status graid
By following these steps, you should resolve the Graid service startup issue on systems with NVSwitch or NVLink GPUs, ensuring that the Graid service operates correctly with the appropriate NVIDIA driver and dependencies.
Related Articles
License Un-Binds from Graid Card After a Hard Power Event in Windows
Environment RAID Model: SR-1000, SR-1001, SR-1010 Host Hardware: AMD, Intel Operating Systems: Microsoft Windows Issue After successful application of the SupremeRAID license key, then a system power down event, the license can lose its bonding with ...
fstab#
Environment RAID Model: SR1000 or SR1010 etc Host Hardware: AMD/Intel Operating System: Linux Issue The Appendix of the User manual suggest that the UUID is used as shown below. This method may not have the desired affects on some Linux distro's. ...
How do I expand the storage pool or volume capacity#
Environment RAID Model: SR1000 or SR1010, SR1001 Host Hardware: Intel, AMD Operating System: Linux Issue User would like to expand the Volume or storage. Resolution 1. Set up the LVM with SupremeRAID solution wit original size. # graidctl c pd ...
How to enable mail notification with SupremeRAID#
Environment RAID Model: SR1000 or SR1010, SR1001 Host Hardware: Intel, AMD Operating System: Linux / Windows Issue Mail notification Resolution Linux Email Notification Agent deb (opens new window)rpm Windows Mail Notification Agent ...
Setting up the dual-controller to enable High Availability (HA) and auto-failover feature.#
Environment RAID Model: All Supreme RAID model Host Hardware: AMD/Intel Operating System: Linux SupremeRAID driver: 1.3.x or later Description This feature enables the SupremeRAID system to automatically fail over to another SupremeRAID card when one ...