GPU Node Preparation (Nvidia)

GPU Node Preparation (Nvidia)#

GPU nodes will need the following installed.

Cuda drivers
Nvidia Container Toolkit
Nvidia Runtime Class

Check Container Toolkit#

To see if the container toolkit is correctly installed you can run the following and you should see the connected GPU devices.

docker run --rm --gpus all ubuntu nvidia-smi
#+-----------------------------------------------------------------------------------------+
#| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
#|-----------------------------------------+------------------------+----------------------+
#| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
#| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
#|                                         |                        |               MIG M. |
#|=========================================+========================+======================|
#|   0  NVIDIA GeForce GTX 1050 Ti     Off |   00000000:1D:00.0  On |                  N/A |
#|  0%   40C    P0             N/A /   90W |    1556MiB /   4096MiB |      0%      Default |
#|                                         |                        |                  N/A |
#+-----------------------------------------+------------------------+----------------------+
#
#+-----------------------------------------------------------------------------------------+
#| Processes:                                                                              |
#|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
#|        ID   ID                                                               Usage      |
#|=========================================================================================|
#+-----------------------------------------------------------------------------------------+

Install Nvidia Runtime Class#

Apply the following YAML

echo 'apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia' | kubectl apply -f -

Then you should see the Nvidia runtime class in your available runtime classes.

kubectl get runtimeclass
#NAME                  HANDLER               AGE
#crun                  crun                  13h
#lunatic               lunatic               13h
#nvidia                nvidia                13h
#nvidia-experimental   nvidia-experimental   13h
#slight                slight                13h
#spin                  spin                  13h
#wasmedge              wasmedge              13h
#wasmer                wasmer                13h
#wasmtime              wasmtime              13h
#wws                   wws                   13h

Enabling GPU Support in Kubernetes#

Install the Nvidia K8s Device Plugin.

echo '[plugins.cri.containerd.runtimes.runc.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
[plugins.linux]
  runtime = "nvidia-container-runtime"' > config.toml.tmpl

echo '[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
  SystemdCgroup = true' > config.toml.tmpl

sudo mv config.toml.tmpl /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
sudo chmod 777 /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

sudo systemctl restart k3s

Test containerd - How?#

Device Plugin?#

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yaml

NVIDIA device plugin should have labelled the node as having an NVIDIA GPU:

kubectl describe node | grep nvidia.com/gpu

You should see some results.

Deploy a Test Workload#

echo 'apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all' | kubectl apply -f -

If everything is working the pod will run and go to state completed.