GPU Node Preparation (Nvidia)#
GPU nodes will need the following installed.
Nvidia Runtime Class
Check Container Toolkit#
To see if the container toolkit is correctly installed you can run the following and you should see the connected GPU devices.
docker run --rm --gpus all ubuntu nvidia-smi
#+-----------------------------------------------------------------------------------------+
#| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
#|-----------------------------------------+------------------------+----------------------+
#| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
#| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
#| | | MIG M. |
#|=========================================+========================+======================|
#| 0 NVIDIA GeForce GTX 1050 Ti Off | 00000000:1D:00.0 On | N/A |
#| 0% 40C P0 N/A / 90W | 1556MiB / 4096MiB | 0% Default |
#| | | N/A |
#+-----------------------------------------+------------------------+----------------------+
#
#+-----------------------------------------------------------------------------------------+
#| Processes: |
#| GPU GI CI PID Type Process name GPU Memory |
#| ID ID Usage |
#|=========================================================================================|
#+-----------------------------------------------------------------------------------------+
Install Nvidia Runtime Class#
Apply the following YAML
echo 'apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia' | kubectl apply -f -
Then you should see the Nvidia runtime class in your available runtime classes.
kubectl get runtimeclass
#NAME HANDLER AGE
#crun crun 13h
#lunatic lunatic 13h
#nvidia nvidia 13h
#nvidia-experimental nvidia-experimental 13h
#slight slight 13h
#spin spin 13h
#wasmedge wasmedge 13h
#wasmer wasmer 13h
#wasmtime wasmtime 13h
#wws wws 13h
Enabling GPU Support in Kubernetes#
Install the Nvidia K8s Device Plugin.
echo '[plugins.cri.containerd.runtimes.runc.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins.linux]
runtime = "nvidia-container-runtime"' > config.toml.tmpl
echo '[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true' > config.toml.tmpl
sudo mv config.toml.tmpl /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
sudo chmod 777 /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
sudo systemctl restart k3s
Test containerd - How?#
Device Plugin?#
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yaml
NVIDIA device plugin should have labelled the node as having an NVIDIA GPU:
kubectl describe node | grep nvidia.com/gpu
You should see some results.
Deploy a Test Workload#
echo 'apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
args: ["nbody", "-gpu", "-benchmark"]
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all' | kubectl apply -f -
If everything is working the pod will run and go to state completed.