MicroK8s GPU Validator CrashLoopBackOff when Using Graid (SupremeRAID) Cards

MicroK8s GPU Validator CrashLoopBackOff when Using Graid (SupremeRAID) Cards

Environment

  • Kubernetes distribution: MicroK8s (with the gpu addon)

  • GPU Management: NVIDIA Device Plugin for Kubernetes + NVIDIA GPU Operator

  • Hardware: Systems with both Graid (SupremeRAID) cards and NVIDIA GPUs installed

  • CUDA Toolkit version: 12.8.x


Background: MicroK8s with NVIDIA Device Plugin for Kubernetes

When you enable the gpu addon in MicroK8s, it deploys the NVIDIA device plugin for Kubernetes and the NVIDIA GPU Operator. Together, they manage GPU discovery, driver/runtime installation, and validation across the cluster.

How it works

  1. Enable GPU Addon

    Quote
    microk8s enable gpu

    This installs:

    • NVIDIA device plugin (DaemonSet) to advertise GPU resources to kubelet.

    • NVIDIA GPU Operator, which deploys supporting components including:

      • nvidia-driver-daemonset

      • nvidia-device-plugin-daemonset

      • nvidia-cuda-validator and nvidia-operator-validator jobs

  2. Node labeling
    Nodes with GPUs are automatically labeled:

    nvidia.com/gpu.present=true
  3. Pod Scheduling
    Workloads can request GPUs using standard Kubernetes resource requests:

    resources: limits: nvidia.com/gpu: 1

    Pods are scheduled to nodes with GPU resources advertised by the device plugin.

  4. Validator Jobs
    The GPU Operator runs two validator jobs:

    • nvidia-cuda-validator

    • nvidia-operator-validator

    These validators run simple CUDA workloads (e.g., vector addition) to confirm the GPUs are healthy and the runtime environment is set up correctly. A successful run looks like:

    [Vector addition of 50000 elements] Success!

Issue

When running MicroK8s with both SupremeRAID (Graid) cards and NVIDIA GPUs, the validator jobs may fail and enter a CrashLoopBackOff state. The validator logs often show:

Quote
Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)! [Vector addition of 50000 elements]

Root Cause

  1. Validator sees all GPUs by default
    By default, the validator jobs run with NVIDIA_VISIBLE_DEVICES=all, meaning they attempt to use every GPU on the host — including the one being used exclusively by the Graid service.
    Since that GPU is busy, the validator returns CUDA-capable device(s) is/are busy or unavailable.

  2. CUDA 12.8.x + KASLR/HMM bug
    According to the
    CUDA 12.8.1 Release Notes
    , certain kernels with KASLR enabled can trigger a UVM HMM initialization failure, causing CUDA to fail during startup. Validator jobs running on these kernels will also fail even on otherwise available GPUs.


Resolution

A) Exclude the Graid GPU from the validator using NVIDIA_VISIBLE_DEVICES

Goal: Prevent the validator from touching the GPU used by the Graid card.

Quote
  1. Identify GPU UUIDs and find the Graid GPU

    nvidia-smi --query-gpu=index,uuid,name,serial --format=csv

    Example:

    index, uuid, name, serial 0010# ← Graid GPU20
  2. Patch the validator jobs to whitelist only compute GPUs

    # nvidia-cuda-validatorsetenv # nvidia-operator-validator (if needed)setenv # Restart validator jobs microk8s kubectl -n gpu-operator-resources delete job nvidia-cuda-validator --ignore-not-found microk8s kubectl -n gpu-operator-resources delete job nvidia-operator-validator --ignore-not-found
  3. Optional: Persist in your workloads
    For workloads that need GPUs, set:

    env: - name: NVIDIA_VISIBLE_DEVICES value: "GPU-AAAAAAAA-...,GPU-CCCCCCCC-..."

Do not run GPU workloads with privileged: true. This bypasses NVIDIA_VISIBLE_DEVICES and exposes all GPUs to the container.


B) Disable HMM or KASLR to fix CUDA initialization failure

Option 1 – Disable KASLR (NVIDIA recommended):

Quote
sudo sed -ri 's/GRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="nokaslr /' /etc/default/grub sudo update-grub sudo reboot

Option 2 – Disable HMM for nvidia_uvm (no reboot required):

Quoteecho 'options nvidia_uvm uvm_disable_hmm=1' | sudo tee /etc/modprobe.d/uvm.conf sudo modprobe -r nvidia_uvm || true sudo modprobe nvidia_uvm

Delete and recreate validator jobs afterward.


Verification

  1. Confirm validator pods are no longer crashing:

    Quote
    microk8s kubectl get pods -n gpu-operator-resources
  2. Check validator logs for successful CUDA tests:

    Quote
    microk8s kubectl -n gpu-operator-resources logs job/nvidia-cuda-validator --all-containers --tail=200
  3. Run a simple CUDA workload on a whitelisted GPU to confirm proper initialization.


Quick Reference

# List GPU UUIDs nvidia-smi --query-gpu=index,uuid,name,serial --format=csv # Patch validator to exclude Graid GPU microk8s kubectl -n gpu-operator-resources \ set env job/nvidia-cuda-validator NVIDIA_VISIBLE_DEVICES=GPU-AAAAAAAA-...,GPU-CCCCCCCC-... # Disable HMM echo 'options nvidia_uvm uvm_disable_hmm=1' | sudo tee /etc/modprobe.d/uvm.conf sudo modprobe -r nvidia_uvm || true sudo modprobe nvidia_uvm

References


Summary:
When using MicroK8s with the NVIDIA device plugin and SupremeRAID, you must exclude the Graid GPU from Kubernetes using NVIDIA_VISIBLE_DEVICES and apply the CUDA HMM/KASLR workaround to avoid validator CrashLoopBackOff errors.

    • Related Articles

    • [Linux] How to Resolve Graid Driver Failure After NVIDIA Upgrade#

      Environment RAID Model: SupremeRAID™ SR1000 / SR1010 / SR1001 Host Hardware: all server (x86, Intel/AMD platform) Operating System: Linux SupremeRAID™ Version: all NVIDIA Driver : 570.124.04 (CUDA 12.8) [Default version] NVIDIA Driver : 580.65.06 ...
    • Graid Performance Benchmarking 2025 - Linux

      Environment RAID Model: All Host Hardware: AMD/Intel Operating System: Linux Storage Performance Testing on Linux This document provides quick and straightforward instructions for performing storage performance testing using the FIO benchmarking tool ...
    • [Linux] Controller Shows "MISSING" After License Application#

      Environment RAID Models: SupremeRAID SR-1000, SR-1010, SR-1001 Host Hardware: Any Operating System: Linux Issue On certain Linux systems, after applying the SupremeRAID license, the controller may be reported as "MISSING" in the output of graidctl or ...
    • Resolving GPU Allocation Issues for Xorg in Multi-GPU Systems

      Environment RAID Model: SR1000 or SR1010 etc Host Hardware: AMD/Intel/Supermicro model etc Operating System: Linux Issue Xorg is defaulting to SupremeRAID card instead of user's GPU for display output. Example: Users may find that the Xorg server ...
    • [Linux] OS booting got the error message after GPU DMA allocated

      Environment RAID Model: All Supreme RAID models Host Hardware: AMD/Intel Operating System: Linux SupremeRAID Driver: 1.3.x and later versions Description A known issue exists with the NVIDIA driver in older kernel versions, such as Ubuntu 20.04. ...