Overview

While reviewing the implementation code of a library, I noticed its dependency on viewing the number of CPU cores being utilized by processes and then imposing certain restrictions based on that. Hence, I took a look at how Go implements viewing the number of CPU cores.

Golang Implementation

  1. [root@liqiang.io]# cat main.go
  2. runtime.NumCPU()
  3. --> runtime/debug.go return int(ncpu)

Here, a global variable is returned. The initialization of this global variable is set when the process starts up. Therefore, as mentioned in the comment, once the process is running, changes to the CPU affinity configuration will not take effect:

  1. [root@liqiang.io]# cat runtime/os_linux.go
  2. func osinit() {
  3. ncpu = getproccount()
  4. ---> func getproccount() int32 {
  5. r := sched_getaffinity(0, unsafe.Sizeof(buf), &buf[0])
  6. ... ...
  7. n := int32(0)
  8. for _, v := range buf[:r] {
  9. for v != 0 {
  10. n += int32(v & 1)
  11. v >>= 1
  12. }
  13. }
  14. if n == 0 {
  15. n = 1
  16. }
  17. return n

The sched_getaffinity here is not a simple Go function, but a system call. So, we need to look into assembly code, which is quite simple; it pushes the function onto the stack and then calls the system call:

  1. [root@liqiang.io]# cat runtime/sys_linux_amd64.s
  2. TEXT runtime·sched_getaffinity(SB),NOSPLIT,$0
  3. MOVQ pid+0(FP), DI
  4. MOVQ len+8(FP), SI
  5. MOVQ buf+16(FP), DX
  6. MOVL $SYS_sched_getaffinity, AX
  7. SYSCALL
  8. MOVL AX, ret+24(FP)
  9. RET

sched_getaffinity System Call

From the linux man page (https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html), it can be observed that this system call returns information about the cpuset in the form of a mask. In fact, the returned cpuset is a bitmap. If a bit is 1, it means that the CPU is available; if it’s 0, it means the CPU is unavailable. Therefore, as seen from the Go code, it iterates through the returned buffer and checks each bit one by one (this can actually be done with an efficient CPU instruction). Finally, it calculates the number of CPU cores that the current process can use.

We can also view the number of CPU cores that a container can use with a system command:

  1. [root@liqiang.io]# lscpu | egrep -i 'core.*:|socket'
  2. Thread(s) per core: 1
  3. Core(s) per socket: 1
  4. Socket(s): 4

Some Simple Caveats

When using Docker, I once wanted to limit the number of CPUs (Limit Docker Container CPU and Memory Usage) and found that Docker has two different options:

For the second method of limiting CPU time slices, within the container, you actually see all CPU cores, so Go will also assume that it has that many cores. This might be a pitfall to watch out for.

Some System-Level Knowledge

cgroups v2 limit cpuset

using cgroups, we can limit the cpuset simple, for example, here I use the cgroupv2 to demo how to limit the cpuset:

  1. [root@liqiang.io]# cat /tmp/main.go
  2. package main
  3. import (
  4. "fmt"
  5. "runtime"
  6. )
  7. func main() {
  8. fmt.Println(runtime.NumCPU())
  9. }
  10. [root@liqiang.io]# sudo cgcreate -g cpuset:/liqiang2
  11. [root@liqiang.io]# sudo cgexec -g cpuset:/liqiang2 go run /tmp/main.go
  12. 16
  13. [root@liqiang.io]# echo "0-1" | sudo tee /sys/fs/cgroup/liqiang2/cpuset.cpus
  14. [root@liqiang.io]# sudo cgexec -g cpuset:/liqiang2 go run /tmp/main.go
  15. 2

NUMA

Okay, now we know that Go gets the number of available CPU cores through cpuset, so what is cpuset? Why does it exist? After understanding the concept of containers, it’s quite natural to think that resource isolation between containers is a normal feature, right? That’s correct, but actually, cpuset existed before the popularity of containers. Before that, it was commonly used in NUMA environments. In some high-performance servers, there are more than one CPU on a host. For example, on this machine:

  1. [root@liqiang.io]# lscpu | grep -i numa
  2. NUMA node(s): 2
  3. NUMA node0 CPU(s):
  4. 0-23,48-71
  5. NUMA node1 CPU(s): 24-47,72-95

So, we know that CPUs and memory communicate through the Northbridge high-speed communication line. How would the line be designed with multiple CPUs? In fact, in the NUMA architecture, CPUs and memory have a relationship of proximity and distance. Each CPU has some local memory that is faster, and accessing the local memory of other CPUs will be slower. Therefore, for the efficiency of program execution, we have a demand to bind CPUs and memory. We can bind processes to specific CPUs and use corresponding memory, which ensures that our processes will not cross CPUs, thus eliminating the influence of NUMA architecture.