A Memo About The Live Incident For Kafka Migration

Overview

In this article, I will review an incident related to migrating Kafka that I encountered during my work. Although the issue fundamentally stemmed from a lack of rigor in the operation process, there are still some valuable lessons to be learned from this migration. Therefore, while I still remember some of the details, I want to take this opportunity to document the sequence of events and the reasons behind the problem.

Problem Description

The issue arose when we were deploying a new service. During the canary release, we noticed anomalies in the corresponding service. In fact, this release did not involve any business-related changes; we merely upgraded the Go version. Suspecting that the new Go version was problematic, we decided to roll back the deployment. However, the situation did not improve, and the rollback did not resolve the issue, so the failure persisted.

Troubleshooting Process

Under pressure, we began troubleshooting the issue. The process was not very complex. The first step was to check the error logs, which indicated an issue with connecting to Kafka. We immediately recalled the recent Kafka migration. However, the application was still using the old Kafka address. We promptly updated the application’s configuration, replacing the IP list with DNS records, and redeployed the service. After redeployment, the issue was resolved, and the anomalies were eliminated.

Cause of the Problem

Kafka Migration Process

Recently, we prepared to migrate a data center, which included a Kafka service. Therefore, we needed to migrate this service to another data center. We chose a seamless migration method that did not require a restart:

Create a new Kafka cluster (3 nodes) and add it to the old cluster (3 nodes). At this point, the client should detect the change in the cluster and obtain the latest cluster list (6 nodes).
Ensure the new cluster nodes take over the replicas from the old cluster. During this period, both the new and old clusters provide services simultaneously.
Gradually decommission the old cluster nodes. The client will detect the change and update the cluster list accordingly.
Finally, decommission all old cluster nodes to complete the migration.

The problem occurred in step 4. Although we had completed the Kafka cluster migration, we forgot to update the application’s Kafka address configuration. The application was still using the old cluster’s address. Consequently, when we redeployed or restarted the service, it continued to use the old Kafka cluster’s address, causing the failure.

Method of Accessing Kafka

Additionally, another issue was configuring the Kafka broker address using the cluster IP list. This approach required manual updates to the application’s configuration with each change. This time, we switched to using DNS records for the configuration. By doing so, we could leverage the company’s internal Kafka cluster management capabilities. Any cluster changes would automatically update the DNS records, allowing us to adapt to Kafka cluster changes without any manual intervention.

Summary

In this article, I reviewed an incident related to Kafka that I encountered at work. Although the issue was not very complex and stemmed from carelessness, the experience of migrating Kafka and managing application configurations offers some valuable lessons. Therefore, I decided to document it for future reference.

格物致知

All Posts