Go 程序关于 PIPE 信号的一个坑

0. 概述

今天在定位一个公司的问题的时候，遇到了一个无头绪的问题，问题比较简单，就是我用 systemd 运行的应用异常退出了，然后那么多的服务就只有某一个应用异常，其他进程都是正常的。而这个异常的进程在 systemd 的状态居然是被 PIPE 信号 kill 掉的，因为觉得这个问题很不错，所以写下来记录一下。

1. 问题描述

在我司的产品中，几乎所有的服务都是通过 systemd 来运行代码的，所以这个进程也不例外，我的 systemd 的配置大概是这样：

[[email protected]]# cat /usr/lib/systemd/system/kube-proxy.service
[Unit]
Description=Oscar.
After=network.target

[Service]
EnvironmentFile=-/etc/default/kube-proxy
ExecStart=/opt/k8s/bin/kube-proxy --config=/etc/kubernetes/kube-proxy-config.yaml --logtostderr=true --v=2
Restart=on-failure

[Install]
WantedBy=multi-user.target

然后发现服务不正常之后，就通过 systemctl 查看进程的状态，然后就傻眼了：

[[email protected]]# systemctl status kube-proxy
● kube-proxy.service - Kubernetes Kube-Proxy Server
   Loaded: loaded (/etc/systemd/system/kube-proxy.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since  Mon 2020-10-19 10:11:49 CST; 2 days ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
  Process: 1024 ExecStart=/opt/k8s/bin/kube-proxy --config=/etc/kubernetes/kube-proxy-config.yaml --logtostderr=true --v=2 (code=killed, signal=PIPE)
 Main PID: 1024 (code=killed, signal=PIPE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

问题就是这样的，没有日志，也没有其他东西，然后就只能靠自己瞎猜了。

2. 发生原因

在查找资料的过程中，看过一篇文章：systemd-journald的SIGPIPE信号BUG，这虽然看上去像是最根本的原因，但是，这里的逻辑不够全面（或者我认为就是一个错误的原因），所以我没有就拿这篇文章作为结论回复客户。

再继续探究下去，直到我看到 Go 1.6 的发布说明中，对写 PIPE 的出错情况进行了修改更正，在 go 1.6（不包含）之前，如果连续 10 次写任何一个 PIPE 出错，都会抛出一个 SIGPIPE 的信号，在 1.6 中对此进行了改进，主要可以认为有两个改进：

对文件异常进行了一层判断和封装（os.PathError 和 os.SyscallError）；
对写 os.Stdout 和 os.Stderr 进行了特殊处理，如果写入的是这两个 IO，那么会抛出一个 SIGPIPE 信号；
如果写入的是非 os.Stdout 和 os.Stderr，并且是未识别的异常，改进成了返回 syscall.EPIPE 给调用者；

On Unix-like systems, when a write to os.Stdout or os.Stderr (more precisely, an os.File opened for file descriptor 1 or 2) fails due to a broken pipe error, the program will raise a SIGPIPE signal. By default this will cause the program to exit; this may be changed by calling the os/signal Notify function for syscall.SIGPIPE. A write to a broken pipe on a file descriptor other 1 or 2 will simply return syscall.EPIPE (possibly wrapped in os.PathError and/or os.SyscallError) to the caller. The old behavior of raising an uncatchable SIGPIPE signal after 10 consecutive writes to a broken pipe no longer occurs.

这样的话，似乎就释然了，因为这里的服务的日志是直接写入标准输出的，所以如果标准输出有异常（这里提示到了：Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.），那么这个用 Go 写的应用就会产生 PIPE 信号（而不是由 systemd-journald 发出来的），然后应用会退出，而 systemd 接受到 PIPE 信号之后会忽略，证据是：

[[email protected]]# systemctl show kube-proxy | grep PIPE
IgnoreSIGPIPE=yes

所以在 systemd 的状态记录中就是显示着这样的历史：Main PID: 1024 (code=killed, signal=PIPE)，这样的话，整个链条就说得通了。

3. 解决方法

既然 Go 自身对 SIGPIPE 有特殊的反馈，并且 SIGPIPE 在 systemd 中是被忽略的，所以自然无法重启，所以这里我就放弃了用 sytemd-journald 的日志，而是将日志输出到指定的文件，例如：

[[email protected]]# /bin/sh -c 'exec /opt/k8s/bin/kube-proxy --config=/etc/kubernetes/kube-proxy-config.yaml --logtostderr=true --v=2'

这样，最终我的 systemd 配置方式就是这样的：

[[email protected]]# cat /usr/lib/systemd/system/kube-proxy.service
[Unit]
Description=Oscar.
After=network.target

[Service]
EnvironmentFile=-/etc/default/kube-proxy
ExecStart=/bin/sh -c 'exec /opt/k8s/bin/kube-proxy --config=/etc/kubernetes/kube-proxy-config.yaml --logtostderr=true --v=2 &>> /var/log/kube/kube-proxy.log'
Restart=on-failure

[Install]
WantedBy=multi-user.target

4. 总结

重点：不要直接在 systemd 中运行 Go 程序，一定要重定向一下标准输出啊！！！

5. Ref

感谢以下参考资料：

Node Exporter killed by SIGPIPE

格物致知

All Posts