MicroK8s GPU Validator CrashLoopBackOff when Using Graid (SupremeRAID) Cards

Environment

Kubernetes distribution: MicroK8s (with the gpu addon)
GPU Management: NVIDIA Device Plugin for Kubernetes + NVIDIA GPU Operator
Hardware: Systems with both Graid (SupremeRAID) cards and NVIDIA GPUs installed
CUDA Toolkit version: 12.8.x

Background: MicroK8s with NVIDIA Device Plugin for Kubernetes

When you enable the gpu addon in MicroK8s, it deploys the NVIDIA device plugin for Kubernetes and the NVIDIA GPU Operator. Together, they manage GPU discovery, driver/runtime installation, and validation across the cluster.

How it works

Enable GPU Addon
```
microk8s enable gpu
```
This installs:
- NVIDIA device plugin (DaemonSet) to advertise GPU resources to kubelet.
- NVIDIA GPU Operator, which deploys supporting components including:
  - nvidia-driver-daemonset
  - nvidia-device-plugin-daemonset
  - nvidia-cuda-validator and nvidia-operator-validator jobs
Node labeling
Nodes with GPUs are automatically labeled:
```
nvidia.com/gpu.present=true
```
Pod Scheduling
Workloads can request GPUs using standard Kubernetes resource requests:
```
resources:
  limits:
    nvidia.com/gpu: 1
```
Pods are scheduled to nodes with GPU resources advertised by the device plugin.
Validator Jobs
The GPU Operator runs two validator jobs:
- nvidia-cuda-validator
- nvidia-operator-validator
These validators run simple CUDA workloads (e.g., vector addition) to confirm the GPUs are healthy and the runtime environment is set up correctly. A successful run looks like:
```
[Vector addition of 50000 elements]
Success!
```

Issue

When running MicroK8s with both SupremeRAID (Graid) cards and NVIDIA GPUs, the validator jobs may fail and enter a CrashLoopBackOff state. The validator logs often show:


Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)!
[Vector addition of 50000 elements]

Root Cause

Validator sees all GPUs by default
By default, the validator jobs run with NVIDIA_VISIBLE_DEVICES=all, meaning they attempt to use every GPU on the host — including the one being used exclusively by the Graid service.
Since that GPU is busy, the validator returns CUDA-capable device(s) is/are busy or unavailable.
CUDA 12.8.x + KASLR/HMM bug
According to the
CUDA 12.8.1 Release Notes
, certain kernels with KASLR enabled can trigger a UVM HMM initialization failure, causing CUDA to fail during startup. Validator jobs running on these kernels will also fail even on otherwise available GPUs.

Resolution

A) Exclude the Graid GPU from the validator using `NVIDIA_VISIBLE_DEVICES`

Goal: Prevent the validator from touching the GPU used by the Graid card.

Identify GPU UUIDs and find the Graid GPU


nvidia-smi --query-gpu=index,uuid,name,serial --format=csv

Example:


index, uuid, name, serial
0010# ← Graid GPU20

Patch the validator jobs to whitelist only compute GPUs


# nvidia-cuda-validatorsetenv
# nvidia-operator-validator (if needed)setenv
# Restart validator jobs
microk8s kubectl -n gpu-operator-resources delete job nvidia-cuda-validator --ignore-not-found
microk8s kubectl -n gpu-operator-resources delete job nvidia-operator-validator --ignore-not-found

Optional: Persist in your workloads
For workloads that need GPUs, set:


env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "GPU-AAAAAAAA-...,GPU-CCCCCCCC-..."

⚠ Do not run GPU workloads with privileged: true. This bypasses NVIDIA_VISIBLE_DEVICES and exposes all GPUs to the container.

B) Disable HMM or KASLR to fix CUDA initialization failure

Option 1 – Disable KASLR (NVIDIA recommended):


sudo sed -ri 's/GRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="nokaslr /' /etc/default/grub
sudo update-grub
sudo reboot

Option 2 – Disable HMM for nvidia_uvm (no reboot required):


echo 'options nvidia_uvm uvm_disable_hmm=1' | sudo tee /etc/modprobe.d/uvm.conf
sudo modprobe -r nvidia_uvm || true
sudo modprobe nvidia_uvm

Delete and recreate validator jobs afterward.

Verification

Confirm validator pods are no longer crashing:


microk8s kubectl get pods -n gpu-operator-resources

Check validator logs for successful CUDA tests:


microk8s kubectl -n gpu-operator-resources logs job/nvidia-cuda-validator --all-containers --tail=200

Run a simple CUDA workload on a whitelisted GPU to confirm proper initialization.

Quick Reference


# List GPU UUIDs
nvidia-smi --query-gpu=index,uuid,name,serial --format=csv

# Patch validator to exclude Graid GPU
microk8s kubectl -n gpu-operator-resources \
  set env job/nvidia-cuda-validator NVIDIA_VISIBLE_DEVICES=GPU-AAAAAAAA-...,GPU-CCCCCCCC-...

# Disable HMM
echo 'options nvidia_uvm uvm_disable_hmm=1' | sudo tee /etc/modprobe.d/uvm.conf
sudo modprobe -r nvidia_uvm || true
sudo modprobe nvidia_uvm

References

✅ Summary:
When using MicroK8s with the NVIDIA device plugin and SupremeRAID, you must exclude the Graid GPU from Kubernetes using NVIDIA_VISIBLE_DEVICES and apply the CUDA HMM/KASLR workaround to avoid validator CrashLoopBackOff errors.

Related Articles
[Linux] How to Resolve Graid Driver Failure After NVIDIA Upgrade#
Environment RAID Model: SupremeRAID™ SR1000 / SR1010 / SR1001 Host Hardware: all server (x86, Intel/AMD platform) Operating System: Linux SupremeRAID™ Version: all NVIDIA Driver : 570.124.04 (CUDA 12.8) [Default version] NVIDIA Driver : 580.65.06 ...
Graid Performance Benchmarking 2025 - Linux #
Environment RAID Model: All Host Hardware: AMD/Intel Operating System: Linux Storage Performance Testing on Linux This document provides quick and straightforward instructions for performing storage performance testing using the FIO benchmarking tool ...
[Linux] Controller Shows "MISSING" After License Application#
Environment RAID Models: SupremeRAID SR-1000, SR-1010, SR-1001 Host Hardware: Any Operating System: Linux Issue On certain Linux systems, after applying the SupremeRAID license, the controller may be reported as "MISSING" in the output of graidctl or ...
Resolving GPU Allocation Issues for Xorg in Multi-GPU Systems
Environment RAID Model: SR1000 or SR1010 etc Host Hardware: AMD/Intel/Supermicro model etc Operating System: Linux Issue Xorg is defaulting to SupremeRAID card instead of user's GPU for display output. Example: Users may find that the Xorg server ...
[Linux] OS booting got the error message after GPU DMA allocated
Environment RAID Model: All Supreme RAID models Host Hardware: AMD/Intel Operating System: Linux SupremeRAID Driver: 1.3.x and later versions Description A known issue exists with the NVIDIA driver in older kernel versions, such as Ubuntu 20.04. ...

MicroK8s GPU Validator CrashLoopBackOff when Using Graid (SupremeRAID) Cards

MicroK8s GPU Validator CrashLoopBackOff when Using Graid (SupremeRAID) Cards

Environment

Background: MicroK8s with NVIDIA Device Plugin for Kubernetes

How it works

Issue

Root Cause

Resolution

A) Exclude the Graid GPU from the validator using NVIDIA_VISIBLE_DEVICES

B) Disable HMM or KASLR to fix CUDA initialization failure

Verification

Quick Reference

References

Related Articles

[Linux] How to Resolve Graid Driver Failure After NVIDIA Upgrade#

Graid Performance Benchmarking 2025 - Linux #

[Linux] Controller Shows "MISSING" After License Application#

Resolving GPU Allocation Issues for Xorg in Multi-GPU Systems

[Linux] OS booting got the error message after GPU DMA allocated

A) Exclude the Graid GPU from the validator using `NVIDIA_VISIBLE_DEVICES`