Resolving Critical Issues in a Customer’s ELK Cluster

Recently, we were called in to assist a customer whose business heavily depends on data served by an ELK (Elasticsearch, Logstash, Kibana) cluster. The customer’s 7-node ELK setup had started failing consistently, causing significant disruption to their operations. Our task was not only to restore functionality but also to ensure a long-term, scalable solution.

The Challenge

Upon investigation, we found that the cluster was overwhelmed with over 15,000 shards. The customer had attempted to implement an Index Lifecycle Management (ILM) policy to move cold data to 3 designated cold nodes. However, this process backfired, flooding the cluster with excessive CPU usage and triggering circuit breakers, ultimately leading to constant node failures.

The situation was critical: there were no valid backups, and the cluster was in no state to take a snapshot. The risk of data loss was high, and urgent action was required.

Our Approach

Given the urgency, we first took snapshots of the virtual machines (VMs) hosting the nodes as an initial safeguard. We then backed up the nodes, ensuring that we could recover the cluster in the event of further failures.

Step 1: Easing the Load

The next step was to reduce the pressure on the cluster. We temporarily removed replicas from the backed-up data, which provided immediate relief to the strained resources. This allowed the cluster to continue operating without being bogged down by excessive overhead.

Step 2: Recovery and Relocation

Once the load was reduced, we initiated the recovery process. We allowed Elasticsearch to begin the recovery and relocation of data, enabling the cluster to complete its pending tasks. Gradually, the nodes started rerouting operations and resuming normal functions.

Step 3: Restoration of Stability

After a period of stabilization, the cluster returned to a green and fully operational state. However, we knew that this was only a temporary fix. The root cause — an inefficient architecture of indexes and shards — still needed to be addressed to prevent future issues.

Looking Forward: Redesigning for Stability and Scalability

We are now working with the customer to redesign their index and shard architecture. The focus is on efficient resource allocation, ensuring that the cluster remains stable and operational even as the dataset grows. This includes:

Optimizing shard count: Reducing the total number of shards to avoid overwhelming the cluster.
Cluster fine tuning: Once the new architecture is implemented, the cluster will be fine tuned for better performance and cost.
Enhancing ILM: Revising the Index Lifecycle Management policy to ensure smooth transitions of data between hot, warm, and cold nodes without overloading the system.
Implementing robust backups: Setting up reliable snapshot strategies to avoid future issues related to data recovery.

Conclusion

This case underscores the importance of both preventative and reactive measures in maintaining critical infrastructure like an ELK cluster. While our immediate actions restored functionality, the long-term success lies in a well-architected solution that evolves with the customer’s data needs.

Health of 30/08:

{

“cluster_name”: “eb-elasticsearch-production-cluster”,

“status”: “red”,

“timed_out”: false,

“number_of_nodes”: 7,

“number_of_data_nodes”: 7,

“active_primary_shards”: 8899,

“active_shards”: 9420,

“relocating_shards”: 0,

“initializing_shards”: 16,

“unassigned_shards”: 7218,

“delayed_unassigned_shards”: 0,

“number_of_pending_tasks”: 198950,

“number_of_in_flight_fetch”: 1371,

“task_max_waiting_in_queue_millis”: 11890430, “active_shards_percent_as_number”: 56.56298787078179

}

Health of today!

{

“cluster_name”: “eb-elasticsearch-production-cluster”,

“status”: “green”,

“timed_out”: false,

“number_of_nodes”: 6,

“number_of_data_nodes”: 5,

“active_primary_shards”: 53,

“active_shards”: 103,

“relocating_shards”: 0,

“initializing_shards”: 0,

“unassigned_shards”: 0,

“delayed_unassigned_shards”: 0,

“number_of_pending_tasks”: 0,

“number_of_in_flight_fetch”: 0,

“task_max_waiting_in_queue_millis”: 0,

“active_shards_percent_as_number”: 100.0

}

The Challenge

Our Approach

Step 1: Easing the Load

Step 2: Recovery and Relocation

Step 3: Restoration of Stability

Looking Forward: Redesigning for Stability and Scalability

Conclusion

Next PostHigh Availability ELK cluster on Kubernetes

Information

Contact Info

Recent Posts

Certificates