Overview
While reviewing the implementation code of a library, I noticed its dependency on viewing the number of CPU cores being utilized by processes and then imposing certain restrictions based on that. Hence, I took a look at how Go implements viewing the number of CPU cores.
Golang Implementation
[root@liqiang.io]# cat main.go
runtime.NumCPU()
--> runtime/debug.go return int(ncpu)
Here, a global variable is returned. The initialization of this global variable is set when the process starts up. Therefore, as mentioned in the comment, once the process is running, changes to the CPU affinity configuration will not take effect:
[root@liqiang.io]# cat runtime/os_linux.go
func osinit() {
ncpu = getproccount()
---> func getproccount() int32 {
r := sched_getaffinity(0, unsafe.Sizeof(buf), &buf[0])
... ...
n := int32(0)
for _, v := range buf[:r] {
for v != 0 {
n += int32(v & 1)
v >>= 1
}
}
if n == 0 {
n = 1
}
return n
The sched_getaffinity
here is not a simple Go function, but a system call. So, we need to look into assembly code, which is quite simple; it pushes the function onto the stack and then calls the system call:
[root@liqiang.io]# cat runtime/sys_linux_amd64.s
TEXT runtime·sched_getaffinity(SB),NOSPLIT,$0
MOVQ pid+0(FP), DI
MOVQ len+8(FP), SI
MOVQ buf+16(FP), DX
MOVL $SYS_sched_getaffinity, AX
SYSCALL
MOVL AX, ret+24(FP)
RET
sched_getaffinity System Call
From the linux man page (https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html), it can be observed that this system call returns information about the cpuset in the form of a mask. In fact, the returned cpuset is a bitmap. If a bit is 1, it means that the CPU is available; if it’s 0, it means the CPU is unavailable. Therefore, as seen from the Go code, it iterates through the returned buffer and checks each bit one by one (this can actually be done with an efficient CPU instruction). Finally, it calculates the number of CPU cores that the current process can use.
We can also view the number of CPU cores that a container can use with a system command:
[root@liqiang.io]# lscpu | egrep -i 'core.*:|socket'
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4
Some Simple Caveats
When using Docker, I once wanted to limit the number of CPUs (Limit Docker Container CPU and Memory Usage) and found that Docker has two different options:
- Simple: Limiting the number of CPU cores, which is easy to understand.
- Usage:
[[email protected]]# docker run --cpus=4
means allowing the use of 4 cores.
- Usage:
- Complex: Based on CPU time slice limitation. Docker is based on CFS scheduling implementation, which was used in older versions. New versions recommend using the simpler method.
- Usage:
[[email protected]]# docker run --cpu-period=100000 --cpu-quota=200000
means each CPU uses 100 ms of time, and this container can use a maximum of 200ms (equivalent to limiting to 2 cores, but not absolute).
- Usage:
For the second method of limiting CPU time slices, within the container, you actually see all CPU cores, so Go will also assume that it has that many cores. This might be a pitfall to watch out for.
Some System-Level Knowledge
cgroups v2 limit cpuset
using cgroups, we can limit the cpuset simple, for example, here I use the cgroupv2 to demo how to limit the cpuset:
[root@liqiang.io]# cat /tmp/main.go
package main
import (
"fmt"
"runtime"
)
func main() {
fmt.Println(runtime.NumCPU())
}
[root@liqiang.io]# sudo cgcreate -g cpuset:/liqiang2
[root@liqiang.io]# sudo cgexec -g cpuset:/liqiang2 go run /tmp/main.go
16
[root@liqiang.io]# echo "0-1" | sudo tee /sys/fs/cgroup/liqiang2/cpuset.cpus
[root@liqiang.io]# sudo cgexec -g cpuset:/liqiang2 go run /tmp/main.go
2
NUMA
Okay, now we know that Go gets the number of available CPU cores through cpuset, so what is cpuset? Why does it exist? After understanding the concept of containers, it’s quite natural to think that resource isolation between containers is a normal feature, right? That’s correct, but actually, cpuset existed before the popularity of containers. Before that, it was commonly used in NUMA environments. In some high-performance servers, there are more than one CPU on a host. For example, on this machine:
[root@liqiang.io]# lscpu | grep -i numa
NUMA node(s): 2
NUMA node0 CPU(s):
0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
So, we know that CPUs and memory communicate through the Northbridge high-speed communication line. How would the line be designed with multiple CPUs? In fact, in the NUMA architecture, CPUs and memory have a relationship of proximity and distance. Each CPU has some local memory that is faster, and accessing the local memory of other CPUs will be slower. Therefore, for the efficiency of program execution, we have a demand to bind CPUs and memory. We can bind processes to specific CPUs and use corresponding memory, which ensures that our processes will not cross CPUs, thus eliminating the influence of NUMA architecture.