How-to-customize compute-pod affinity, nodeSelector and tolerations
Context
In some cases, you may want to spawn the compute functions on another node than the one hosting the worker. This could be done, for instance, when you want to dynamically provision a node with a GPU to run compute functions.
Warning
In the case where you want to spawn the compute functions on a node different than the one hosting the worker
, you need to have a provider that can provide volumes with read-mode RWX and set .Values.worker.accessModes
to ["ReadWriteMany"]
We provide a way to set nodeSelector
, affinity
and tolerations
through Helm values.
Default values
The default value for these fields are:
worker:
...
computePod:
nodeSelector: {}
## @param worker.computePod.tolerations Toleration labels for pod assignment
##
tolerations: []
## @param worker.computePod.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].labelSelector.matchExpressions[0].key Pod affinity rule defnition.
## @param worker.computePod.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].labelSelector.matchExpressions[0].operator Pod affinity rule defnition.
## @param worker.computePod.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].labelSelector.matchExpressions[0].values Pod affinity rule defnition.
## @param worker.computePod.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].topologyKey Pod affinity rule defnition.
##
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: statefulset.kubernetes.io/pod-name
operator: In
values:
- $(POD_NAME)
topologyKey: kubernetes.io/hostname
...
We can see that it sets empty values for nodeSelector
and tolerations
but defines an affinity rule that would force the pod to be spawned in the same pod than the worker.
Note
To allow flexibility in the way we define our nodeSelector
, affinity
and tolerations
, we provide the following environment variables to use as dependent environment variables:
POD_NAME
: provide the name of the worker spawning the compute functionNODE_NAME
: provide the name of the node on which the worker spawning the compute function is
On-demand GPU
In this section, we are using as an example a dedicated node-pool to run the functions pods, provisioning nodes only when required. This example also features using GPU sharing.
Activating GPU
For the following example, we will assume that you have 2 node-pools:
node-pool-default
, the node-pool without GPUnode-pool-gpu
, the node-pool with gpu
For our example, we assume node-pool-gpu
has the following:
- labels:
node-type=substra-tasks-gpu
node_pool=node-pool-gpu
- taints:
node-type=substra-tasks-gpu:NoSchedule
nvidia.com/gpu=true:Noschedule
In the value file, we add the following:
worker:
computePod:
affinity: null
nodeSelector:
node-type: substra-tasks-gpu
tolerations:
- effect: NoSchedule
key: node-type
operator: Equal
value: substra-tasks-gpu
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
...
We set explicitely affinity
to null
to force a null value (instead of using the default value that we saw before).
The nodeSelector
corresponds to the label we set to the node, and the tolerations corresponds to the taint we added to the node.
Activating GPU on Google Kubernetes Engine + Nvidia
Using Terraform:
module "<cluster_name>" {
source = "terraform-google-modules/kubernetes-engine/google//modules/private-cluster"
...
node_pools = [
...
{
...
name = "node-pool-gpu"
autoscaling = true
min_count = 0
max_count = 3
image_type = "COS_CONTAINERD"
auto_upgrade = true
accelerator_type = <accelerator_type>
gpu_driver_version = "LATEST"
accelerator_count = <accelerator_count>
...
},
...
]
...
}
with the following values:
<cluster_name>
: The name of the cluster<accelerator_type>
: The ID of the GPU to use, you can find the list here<accelerator_count>
: The number of GPU to attach to your node pool
We set autoscale: true
and min_count: 0
allow to have no node with GPU when not in use.
Warning
At this stage, your GPU will be available to 1 pod at the same time. In Substra, we keep the pods up until the end of the compute plan, and each function create a pod. If you want to share the GPU between pods, please read the following section.
Other cloud providers
For other cloud providers, we recommend reading directly the documentation from your provider. If you’re using a Nvidia GPU, you can read the reference on sharing GPU between pods (Time-slicing and Multiple instance GPU (MIG))
Other GPU providers
We did not test with other providers, but our understanding is that:
ROCm allow GPU-sharing between GPU without isolation out-of-the-box