Job concurrency in Kubernetes: LXD and CPU pinning to the rescue
Samuel Cozannet
on 27 March 2017
Tags: GPU , kubernetes , machine learning , TensorFlow
A few days ago, someone shared with me a project to run video transcoding jobs in Kubernetes.
During her tests, made on a default Kubernetes installation on bare metal servers with 40 cores & 512GB RAM, she allocated 5 full CPU cores to each of the transcoding pods, then scaled up to 6 concurrent tasks per node, thus loading the machines at 75% (on paper). The expectation was that these jobs would run at the same speed as a single task.
The result was a slightly underwelming: while concurrency was going up, performance on individual task was going down. At maximum concurrency, they actually observed 50% decrease on single task performance.
I did some research to understand this behavior. It is referenced in several Kubernetes issues such as #10570,#171, in general via a Google Search. The documentation itself sheds some light on how the default scheduler work and why the performance can be impacted by concurrency on intensive tasks.
There are different methods to allocate CPU time to containers:
- CPU Pining: each container gets a set of cores
If the host has enough CPU cores available, allocate 5 “physical cores” that will be dedicated to this pod/container
- Temporal slicing: considering the host has N cores collectively representing an amount of compute time which you allocate to containers. 5% of CPU time means that for every 100ms, 5ms of compute are dedicated to the task.
Temporal slicing, each container gets allocated randomly on all nodesObviously, pining CPUs can be interesting for some specific workloads but has a big problem of scale for the simple reason you could not run more pods than you have cores in your cluster.
As a result, Docker defaults to the second one, which also ensures you can have less than 1 CPU allocated to a container.
This has an impact on performance which also happens in HPC or any CPU intensive task.
Can we mitigate this risk? Maybe. Docker provides the cpuset option at the engine level. It’s not however leveraged by Kubernetes. However, LXD containers have the ability to be pined to physical cores via cpusets, in an automated fashion, as explained in this blog post by @stgraber.
This opens 2 new options for scheduling our workloads:
- Slice up our hosts in several LXD Kubernetes Workers and see if pining CPUs for workers can help up;
- Include a “burst” option with the native Kubernetes resource primitives, and see if that can help maximise compute throughput in our cluster.
Let’s see how these compare!
TL;DR
You don’t all have the time to read the whole thing so in a nutshell:
- If you always allocate less than 1 CPU to your pods, concurrency doesn’t impact CPU-bound performance;
- If you know in advance your max concurrency and it is not too high, then adding more workers with LXD and CPU pinning them always gets you better performance than native scheduling via Docker;
- The winning strategy is always to super provision CPU limits to the max so that every bit of performance is allocated instantly to your pods
Note: these results are in AWS, where there is a hypervisor between the metal and the units. I am waiting for hardware with enough cores to complete the task. If you have hardware you’d like to throw at this, be my guest and I’ll help you run the tests.
The Plan
In this blog post, we will do the following:
- Setup various Kubernetes clusters: pure bare metal, pure cloud, in LXD containers with strict CPU allocation.
- Design a minimalistic Helm chart to easily create parallelism
- Run benchmarks to scale concurrency (up to 32 threads/node)
- Extract and process logs from these runs to see how concurrency impacts performance per core
Requirements
For this blog post, it is assumed that:
- You are familiar with Kubernetes
- You have notions of Helm charting or of Go Templates, as well as using Helm to deploy stuff
- Having preliminary knowledge of the Canonical Distribution of Kubernetes (CDK) is a plus, but not required.
- Downloading the code for this post:
git clone https://github.com/madeden/blogposts
cd blogposts/k8s-transcode
Methodology
Our benchmark is a transcoding task. It uses a ffmpeg workload, designed to minimize time to encode by exhausting all the resources allocated to compute as fast as possible. We use a single video for the encoding, so that all transcoding tasks can be compared. To minimize bottlenecks other than pure compute, we use a relatively low bandwidth video, stored locally on each host.
The transcoding job is run multiple times, with the following variations:
- CPU allocation from 0.1 to 7 CPU Cores
- Memory from 0.5 to 8GB RAM
- Concurrency from 1 to 32 concurrent threads per host
- (Concurrency * CPU Allocation) never exceeds the number of cores of a single host
We measure for each pod how long the encoding takes, then look at correlations between that and our variables.
Charting a simple transcoder
Transcoding with ffmpeg and Docker
When I want to do something with a video, the first thing I do is call my friend Ronan. He knows everything about everything for transcoding (and more)!
So I asked him something pretty straightforward: I want the most CPU intensive ffmpeg transcoding one liner you can think of.
He came back (in less than 30min) with not only the one liner, but also found a very neat docker image for it, kudos to Julien for making this. All together you get:
docker run --rm -v $PWD:/tmp jrottenberg/ffmpeg:ubuntu \
-i /tmp/source.mp4 \
-stats -c:v libx264 \
-s 1920x1080 \
-crf 22 \
-profile:v main \
-pix_fmt yuv420p \
-threads 0 \
-f mp4 -ac 2 \
-c:a aac -b:a 128k \
-strict -2 \
/tmp/output.mp4
The key of this setup is the -threads 0 which tells ffmpeg that it’s an all you can eat buffet.
For test videos, HD Trailers or Sintel Trailers are great sources. I’m using a 1080p mp4 trailer for source.
Helm Chart
Transcoding maps directly to the notion of Job in Kubernetes. Jobs are batch processes that can be orchestrated very easily, and configured so that Kubernetes will not restart them when the job is done. The equivalent to Deployment Replicas is Job Parallelism.
To add concurrency, I initially use this notion. It proved a bad approach, making things more complicated than necessary to analyze the output logs. So I built a chart that creates many (numbered) jobs each running a single pod, so I can easily track them and their logs.
{{- $type := .Values.type -}}
{{- $parallelism := .Values.parallelism -}}
{{- $cpu := .Values.resources.requests.cpu -}}
{{- $memory := .Values.resources.requests.memory -}}
{{- $requests := .Values.resources.requests -}}
{{- $multiSrc := .Values.multiSource -}}
{{- $src := .Values.defaultSource -}}
{{- $burst := .Values.burst -}}
---
{{- range $job, $nb := until (int .Values.parallelism) }}
apiVersion: batch/v1
kind: Job
metadata:
name: {{ $type | lower }}-{{ $parallelism }}-{{ $cpu | lower }}-{{ $memory | lower }}-{{ $job }}
spec:
parallelism: 1
template:
metadata:
labels:
role: transcoder
spec:
containers:
- name: transcoder-{{ $job }}
image: jrottenberg/ffmpeg:ubuntu
args: [
"-y",
"-i", "/data/{{ if $multiSrc }}source{{ add 1 (mod 23 (add 1 (mod $parallelism (add $job 1)))) }}.mp4{{ else }}{{ $src }}{{ end }}",
"-stats",
"-c:v",
"libx264",
"-s", "1920x1080",
"-crf", "22",
"-profile:v", "main",
"-pix_fmt", "yuv420p",
"-threads", "0",
"-f", "mp4",
"-ac", "2",
"-c:a", "aac",
"-b:a", "128k",
"-strict", "-2",
"/data/output-{{ $job }}.mp4"
]
volumeMounts:
- mountPath: /data
name: hostpath
resources:
requests:
{{ toYaml $requests | indent 12 }}
limits:
cpu: {{ if $burst }}{{ max (mul 2 (atoi $cpu)) 8 | quote }}{{ else }}{{ $cpu }}{{ end }}
memory: {{ $memory }}
restartPolicy: Never
volumes:
- name: hostpath
hostPath:
path: /mnt
---
{{- end }}
The values.yaml file that goes with this is very very simple:
# Number of // tasks
parallelism: 8
# Separator name
type: bm
# Do we want several input files
# if yes, the chart will use source${i}.mp4 with up to 24 sources
multiSource: false
# If not multi source, name of the default file
defaultSource: sintel_trailer-1080p.mp4
# Do we want to burst. If yes, resource limit will double request.
burst: false
resources:
requests:
cpu: "4"
memory: 8Gi
max:
cpu: "25"
That’s all you need. Of course, all sources are in the repo for your usage, you don’t have to copy paste this.
Creating test files
Now we need to generate a LOT of values.yaml files to cover many use cases. The reachable values would vary depending on your context. My home cluster has 6 workers with 4 cores and 32GB RAM each, so I used
- 1, 6, 12, 18, 24, 48, 96 and 192 concurrent jobs (up to 32/worker)
- reverse that for the CPUs (from 3 to 0.1 in case of parallelism=192)
- 1 to 16GB RAM
In the cloud, I had 16 core workers with 60GB RAM, so I did the tests only on 1 to 7 CPU cores per task.
I didn’t do anything clever here, just a few bash loops to generate all my tasks. They are in the repo if needed.
Deploying Kubernetes
MAAS / AWS
The method to deploy on MAAS is the same I described in my previous blog about DIY GPU Cluster. Once you have MAAS installed and Juju configured to talk to it, you can adapt and use the bundle file in src/juju/ via:
juju deploy src/juju/k8s-maas.yaml
for AWS, use the k8s-aws.yaml bundle, which specifies c4.4xlarge as the default instances. When it’s done, download he configuration for kubectl then initialize Helm with
juju show-status kubernetes-worker-cpu --format json | \
jq --raw-output '.applications."kubernetes-worker-cpu".units | keys[]' | \
xargs -I UNIT juju ssh UNIT "sudo wget https://download.blender.org/durian/trailer/sintel_trailer-1080p.mp4 -O /mnt/sintel_trailer-1080p.mp4"
done
juju scp kubernetes-master/0:config ~/.kube/config
helm init
Variation for LXD
LXD on AWS is a bit special, because of the network. It breaks some of the primitives that are frequently used with Kubernetes such as the proxying of pods, which have to go through 2 layers of networking instead of 1. As a result,
- kubectl proxy doesn’t work ootb
- more importantly, helm doesn’t work because it consumes a proxy to the Tiller pod by default
- However, transcoding doesn’t require network access but merely a pod doing some work on the file system, so that is not a problem.
The least expensive path to resolve the issue I found is to deploy a specific node that is NOT in LXD but a “normal” VM or node. This node will be labeled as a control plane node, and we modify the deployments for tiller-deploy and kubernetes-dashboard to force them on that node. Making this node small enough will ensure no transcoding get ever scheduled on it.
I could not find a way to fully automate this, so here is a sequence of actions to run:
juju deploy src/juju/k8s-lxd-c-.yaml
This deploys the whole thing and you need to wait until it’s done for the next step. Closely monitor juju status until you see that the deployment is OK, but flannel doesn’t start (this is expected, no worries).
Then adjust the LXD profile for each LXD node must to allow nested containers. In a near future (roadmapped for 2.3), Juju will gain the ability to declare the profiles it wants to use for LXD hosts. But for now, we need to build that manually:
NB_CORES_PER_LXD=4 #This is the same number used above to deploy
for MACHINE in 1 2
do
./src/bin/setup-worker.sh ${MACHINE} ${NB_CORES_PER_LXD}
done
If you’re watching juju status you will see that flannel suddenly starts working. All good! Now download he configuration for kubectl then initialize Helm with
juju scp kubernetes-master/0:config ~/.kube/config
helm init
We need to identify the Worker that is not a LXD container, then label it as our control plane node:
kubectl label $(kubectl get nodes -o name | grep -v lxd) controlPlane=true
kubectl label $(kubectl get nodes -o name | grep lxd) computePlane=true
Now this is where it become manual we need to edit successively rc/monitoring-influxdb-grafana-v4, deploy/heapster-v1.2.0.1, deploy/tiller-deploy and deploy/kubernetes-dashboard, to add
nodeSelector:
controlPlane: “true”
in the definition of the manifest. Use
kubectl edit -n kube-system rc/monitoring-influxdb-grafana-v4
After that, the cluster is ready to run!
Running transcoding jobs
Starting jobs
We have a lot of tests to run, and we do not want to spend too long managing them, so we build a simple automation around them
cd src
TYPE=aws
CPU_LIST="1 2 3"
MEM_LIST="1 2 3"
PARA_LIST="1 4 8 12 24 48"
for cpu in ${CPU_LIST}; do
for memory in ${CPU_LIST}; do
for para in ${PARA_LIST}; do
[ -f values/values-${para}-${TYPE}-${cpu}-${memory}.yaml ] && \
{ helm install transcoder --values values/values-${para}-${TYPE}-${cpu}-${memory}.yaml
sleep 60
while [ "$(kubectl get pods -l role=transcoder | wc -l)" -ne "0" ]; do
sleep 15
done
}
done
done
done
This will run the tests about as fast as possible. Adjust the variables to fit your local environment
First approach to Scheduling
Without any tuning or configuration, Kubernetes makes a decent job of spreading the load over the hosts. Essentially, all jobs being equal, it spreads them like a round robin on all nodes. Below is what we observe for a concurrency of 12.
NAME READY STATUS RESTARTS AGE IP NODE
bm-12–1–2gi-0–9j3sh 1/1 Running 0 9m 10.1.70.162 node06
bm-12–1–2gi-1–39fh4 1/1 Running 0 9m 10.1.65.210 node07
bm-12–1–2gi-11–261f0 1/1 Running 0 9m 10.1.22.165 node01
bm-12–1–2gi-2–1gb08 1/1 Running 0 9m 10.1.40.159 node05
bm-12–1–2gi-3-ltjx6 1/1 Running 0 9m 10.1.101.147 node04
bm-12–1–2gi-5–6xcp3 1/1 Running 0 9m 10.1.22.164 node01
bm-12–1–2gi-6–3sm8f 1/1 Running 0 9m 10.1.65.211 node07
bm-12–1–2gi-7–4mpxl 1/1 Running 0 9m 10.1.40.158 node05
bm-12–1–2gi-8–29mgd 1/1 Running 0 9m 10.1.101.146 node04
bm-12–1–2gi-9-mwzhq 1/1 Running 0 9m 10.1.70.163 node06
The same spread is realized also for larger concurrencies, and at 192, we observe 32 jobs per host in every case. Some screenshots of kubeUI and Grafana of my tests
KubeUI showing 192 concurrent pods
Compute Cycles at different concurrencies
LXD pining Kubernetes Workers to CPUs
Aoutch! About 100% on the whole machine
Collecting and aggregating results
Raw Logs
This is where it becomes a bit tricky. We could use an ELK stack and extract the logs there, but I couldn’t find a way to make it really easy to measure our KPIs.
Looking at what Docker does in terms of logging, you need to go on each machine and look into /var/lib/docker/containers//-json.log
Here we can see that each job generates exactly 82 lines of log, but only some of them are interesting:
- First line: gives us the start time of the log
{“log”:”ffmpeg version 3.1.2 Copyright © 2000–2016 the FFmpeg developers\n”,”stream”:”stderr”,”time”:”2017–03–17T10:24:35.927368842Z”}
- line 13: name of the source
{“log”:”Input #0, mov,mp4,m4a,3gp,3g2,mj2, from ‘/data/sintel_trailer-1080p.mp4’:\n”,”stream”:”stderr”,”time”:”2017–03–17T10:24:35.932373152Z”}
- last line: end of transcoding timestamp
{“log”:”[aac @ 0x3a99c60] Qavg: 658.896\n”,”stream”:”stderr”,”time”:”2017–03–17T10:39:13.956095233Z”}
For advanced performance geeks, line 64 also gives us the transcode speed per frame, which can help profile the complexity of the video. For now, we don’t really need that.
Mapping to jobs
The raw log is only a Docker uuid, and does not help use very much to understand to what job it relates. Kubernetes gracefully creates links in /var/log/containers/ mapping the pod names to the docker uuid:
bm-1–0.8–1gi-0-t8fs5_default_transcoder-0-a39fb10555134677defc6898addefe3e4b6b720e432b7d4de24ff8d1089aac3a.log
So here is what we do:
- Collect the list of logs on each host:
for i in $(seq 0 1 ${MAX_NODE_ID}); do
[ -d stats/node0${i} ] || mkdir -p node0${i}
juju ssh kubernetes-worker-cpu/${i} "ls /var/log/containers | grep -v POD | grep -v 'kube-system'" > stats/node0${i}/links.txt
juju ssh kubernetes-worker-cpu/${i} "sudo tar cfz logs.tgz /var/lib/docker/containers"
juju scp kubernetes-worker-cpu/${i}:logs.tgz stats/node0${i}/
cd node0${i}/
tar xfz logs.tgz --strip-components=5 -C ./
rm -rf config.v2.json host* resolv.conf* logs.tgz var shm
cd ..
done
2. Extract import log lines (adapt per environment for nb of nodes…)
ENVIRONMENT=lxd
MAX_NODE_ID=1
echo "Host,Type,Concurrency,CPU,Memory,JobID,PodID,JobPodID,DockerID,TimeIn,TimeOut,Source" | tee ../db-${ENVIRONMENT}.csv
for node in $(seq 0 1 ${MAX_NODE_ID}); do
cd node0${node}
while read line; do
echo "processing ${line}"
NODE="node0${node}"
CSV_LINE="$(echo ${line} | head -c-5 | tr '-' ',')" # node it's -c-6 for logs from bare metal or aws, -c-5 for lxd
UUID="$(echo ${CSV_LINE} | cut -f8 -d',')"
JSON="$(sed -ne '1p' -ne '13p' -ne '82p' ${UUID}-json.log)"
TIME_IN="$(echo $JSON | jq --raw-output '.time' | head -n1 | xargs -I {} date --date='{}' +%s)"
TIME_OUT="$(echo $JSON | jq --raw-output '.time' | tail -n1 | xargs -I {} date --date='{}' +%s)"
SOURCE=$(echo $JSON | grep from | cut -f2 -d"'")
echo "${NODE},${CSV_LINE},${TIME_IN},${TIME_OUT},${SOURCE}" | tee -a ../../db-${ENVIRONMENT}.csv
done < links.txt
cd ..
done
Once we have all the results, we load to Google Spreadsheet and look into the results…
Results Analysis
Impact of Memory
Once the allocation is above what is necessary for ffmpeg to transcode a video, memory is a non-impacting variable at the first approximation. However, at the second level we can see a slight increase in performance in the range of 0.5 to 1% between 1 and 4GB allocated.
Nevertheless, this factor was not taken into account.
RAM does not impact performance (or only marginally)
Impact of CPU allocation & Pinning
Regardless of the deployment method (AWS or Bare Metal), there is a change in behavior when allocating less or more than 1 CPU “equivalent”.
Being below or above the line
Running CPU allocation under 1 gives the best consistency across the board. The graph shows that the variations are contained, and what we see is an average variation of less than 4% in performance.
Low CPU per pod gives low influence of concurrencyRunning jobs with CPU request Interestingly, the heatmap shows that the worse performance is reached when ( Concurrency * CPU Counts ) ~ 1. I don’t know how to explain that behavior. Ideas?
If total CPU is about 1 the performance is the worse.
Being above the line
As soon as you allocate more than a CPU, concurrency directly impacts performance. Regardless of the allocation, there is an impact, with concurrency 3.5 leading to about 10 to 15% penalty. Using more workers with less cores will increase the impact, up to 40~50% at high concurrency
As the graphs show, not all concurrencies are made equal. The below graphs show duration function of concurrency for various setups.
AWS with or without LXD, 2 cores / job
With 4 Cores
and 5 cores / jobWhen concurrency is low and the performance is well profiled, then slicing hosts thanks to LXD CPU pinning is always a valid strategy.
By default, LXD CPU-pinning in this context will systematically outperform the native scheduling of Docker and Kubernetes. It seems a concurrency of 2.5 per host is the point where Kubernetes allocation becomes more efficient than forcing the spread via LXD.
However, unbounding CPU limits for the jobs will let Kubernetes use everything it can at any point in time, and result in an overall better performance.
When using this last strategy, the performance is the same regardless of the number of cores requested for the jobs. The below graph summarizes all results:
All results: unbounding CPU cores homogenizes performance
Impact of concurrency on individual performance
Concurrency impacts performance. The below table shows the % of performance lost because of concurrency, for various setups.
Performance is impacted from 10 to 20% when concurrency is 3 or more
Conclusion
In the context of transcoding or another CPU intensive task,
- If you always allocate less than 1 CPU to your pods, concurrency doesn’t impact CPU-bound performance; Still, be careful about the other aspects. Our use case doesn’t depend on memory or disk IO, yours could.
- If you know in advance your max concurrency and it is not too high, then adding more workers with LXD and CPU pinning them always gets you better performance than native scheduling via Docker. This has other interesting properties, such as dynamic resizing of workers with no downtime, and very fast provisioning of new workers. Essentially, you get a highly elastic cluster for the same number of physical nodes. Pretty awesome.
- The winning strategy is always to super provision CPU limits to the max so that every bit of performance is allocated instantly to your pods. Of course, this cannot work in every environment, so be careful when using this, and test if it fits with your use case before applying in production.
These results are in AWS, where there is a hypervisor between the metal and the units. I am waiting for hardware with enough cores to complete the task. If you have hardware you’d like to throw at this, be my guest and I’ll help you run the tests.
Finally and to open up a discussion, a next step could also be to use GPUs to perform this same task. The limitation will be the number of GPUs available in the cluster. I’m waiting for some new nVidia GPUs and Dell hardware, hopefully I’ll be able to put this to the test.
There are some unknowns that I wasn’t able to sort out. I made the result dataset of ~3000 jobs open here, so you can run your own analysis! Let me know if you find anything interesting!
What is Kubernetes?
Kubernetes, or K8s for short, is an open source platform pioneered by Google, which started as a simple container orchestration tool but has grown into a platform for deploying, monitoring and managing apps and services across clouds.
Newsletter signup
Related posts
Canonical accelerates AI Application Development with NVIDIA AI Enterprise
Charmed Kubernetes support comes to NVIDIA AI Enterprise Canonical’s Charmed Kubernetes is now supported on NVIDIA AI Enterprise 5.0. Organisations using...
Canonical Kubernetes enhances AI/ML development capabilities with NVIDIA integrations
In recent years, Artificial Intelligence (AI) and Machine Learning (ML) have surged in importance. This rise can be attributed to a massive influx of data,...
How to build your first model using DSS
GenAI is transforming how we approach technology. This blog explores how you can use Canonical’s Data Science Stack (DSS) to set up your environment and dive...