Kubernetes distribution: MicroK8s (with the gpu
addon)
GPU Management: NVIDIA Device Plugin for Kubernetes + NVIDIA GPU Operator
Hardware: Systems with both Graid (SupremeRAID) cards and NVIDIA GPUs installed
CUDA Toolkit version: 12.8.x
When you enable the gpu
addon in MicroK8s, it deploys the NVIDIA device plugin for Kubernetes and the NVIDIA GPU Operator. Together, they manage GPU discovery, driver/runtime installation, and validation across the cluster.
Enable GPU Addon
This installs:
NVIDIA device plugin (DaemonSet) to advertise GPU resources to kubelet.
NVIDIA GPU Operator, which deploys supporting components including:
nvidia-driver-daemonset
nvidia-device-plugin-daemonset
nvidia-cuda-validator
and nvidia-operator-validator
jobs
Node labeling
Nodes with GPUs are automatically labeled:
Pod Scheduling
Workloads can request GPUs using standard Kubernetes resource requests:
Pods are scheduled to nodes with GPU resources advertised by the device plugin.
Validator Jobs
The GPU Operator runs two validator jobs:
nvidia-cuda-validator
nvidia-operator-validator
These validators run simple CUDA workloads (e.g., vector addition) to confirm the GPUs are healthy and the runtime environment is set up correctly. A successful run looks like:
When running MicroK8s with both SupremeRAID (Graid) cards and NVIDIA GPUs, the validator jobs may fail and enter a CrashLoopBackOff
state. The validator logs often show:
Validator sees all GPUs by default
By default, the validator jobs run with NVIDIA_VISIBLE_DEVICES=all
, meaning they attempt to use every GPU on the host — including the one being used exclusively by the Graid service.
Since that GPU is busy, the validator returns CUDA-capable device(s) is/are busy or unavailable
.
NVIDIA_VISIBLE_DEVICES
Goal: Prevent the validator from touching the GPU used by the Graid card.
Identify GPU UUIDs and find the Graid GPU
Example:
Patch the validator jobs to whitelist only compute GPUs
Optional: Persist in your workloads
For workloads that need GPUs, set:
⚠ Do not run GPU workloads with
privileged: true
. This bypassesNVIDIA_VISIBLE_DEVICES
and exposes all GPUs to the container.
Option 1 – Disable KASLR (NVIDIA recommended):
Option 2 – Disable HMM for nvidia_uvm
(no reboot required):
Delete and recreate validator jobs afterward.
Confirm validator pods are no longer crashing:
Check validator logs for successful CUDA tests:
Run a simple CUDA workload on a whitelisted GPU to confirm proper initialization.
✅ Summary:
When using MicroK8s with the NVIDIA device plugin and SupremeRAID, you must exclude the Graid GPU from Kubernetes using NVIDIA_VISIBLE_DEVICES
and apply the CUDA HMM/KASLR workaround to avoid validator CrashLoopBackOff errors.