GPU Support

GPU Support

It’s been about 5 years since I last played with GPUs in any sort of container land, and decided that I’d like to deal with that inside of my cluster.

Let’s just assume you have a computer with a NVIDIA GPU installed and docker on the new node.

Setup Proceedure.

  1. Install the container tool kit. curl -fsSL https://nvidia.github.io/libnvidia-contain er/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

  2. Install your drivers. sudo apt install -y nvidia-container-runtime cuda-drivers-fabricmanager-550 nvidia-headless-550-server

  3. Configure your runtimes. (Only containerd is needed but I’m doing docker as well because the GPU computer is a desktop.) sudo nvidia-ctk runtime configure --runtime=docker && sudo nvidia-ctk runtime configure --runtime=containerd

  4. Test if docker can run on the GPU. sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Should return

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:0B:00.0  On |                  N/A |
|  0%   53C    P3             32W /  170W |     316MiB /  12288MiB |     36%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
  1. Install k3s on your node now. curl -sfL https://get.k3s.io | K3S_URL=https://${PRIMARY_K3S_IP}:6443 K3S_TOKEN=${K3S_CLUSTER_TOKEN} sh -s -

  2. Label the node as having a GPU. kubectl label nodes <your-node-name> gpu=true.

  3. Create a RunTimeClass object on the cluster.

---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
  1. Install the NVIDIA GPU Operator
---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator

---
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
  name: nvidia
  namespace: gpu-operator
spec:
  interval: 1h
  url: https://helm.ngc.nvidia.com/nvidia

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: gpu-operator
  namespace: gpu-operator
spec:
  valuesFrom:
    - kind: ConfigMap
      name: values
      valuesKey: values.yaml
  interval: 5m
  chart:
    spec:
      chart: gpu-operator
      version: 23.9.2
      sourceRef:
        kind: HelmRepository
        name: nvidia
        namespace: gpu-operator
      interval: 1m
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: values
  namespace: gpu-operator
data:
  values.yaml: |-
    toolkit:
      env:
        - name: CONTAINERD_CONFIG
          value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
        - name: CONTAINERD_SOCKET
          value: /run/k3s/containerd/containerd.sock
        - name: CONTAINERD_RUNTIME_CLASS
          value: nvidia
        - name: CONTAINERD_SET_AS_DEFAULT
          value: "true"
  1. Test your new toy with the nbody problem.
cat << EOF | kubectl create -f -                                                                                              ─╯
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
EOF

kubectl logs nbody-gpu-benchmark -n default

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance)
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation)
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA GeForce RTX 3060]
28672 bodies, total time for 10 iterations: 22.099 ms
= 372.001 billion interactions per second
= 7440.026 single-precision GFLOP/s at 20 flops per interaction