Emergency Maintenance - NZ-AKL1-62PE07

Scheduled Maintenance Report for SiteHost

Postmortem

Timeline

On January 5th, at approximately 22:00 NZT, stability issues were detected on the hardware node NZ-AKL1-62PE07. This resulted in a degraded control plane, preventing standard management operations such as starting and stopping virtual machines (VMs).

Our team began investigating the issue immediately, though initial attempts to stabilise the node and regain control of the control plane were unsuccessful, we scheduled emergency maintenance for 00:30 on January 6th 2025.

At 00:30 on January 6th, 2025, we were forced to restart the node. Once the node was back online, the guests were powered back online, which was completed at approximately 01:00. However, some issues with the control plane persisted, which our team continued to investigate.

At approximately 06:40 we initiated a final restart of the node after resolving some lingering issues. Once the node came back up, all VMs were quickly brought online and the control plane was confirmed to be in a healthy state.

By 07:00 on January 6th, all issues had been resolved, and the control plane was fully operational again.

What Happened?

We believe an internal service locked up on NZ-AKL1-62PE07 caused our VM control plane to also seize up, effectively blocking our ability to control VMs on the affected node.

After rebooting the node the first time, we regained access to the control plane, though in a somewhat degraded state, leading to a delay in getting all VMs on the node up as quickly as we otherwise would have liked.

After working on stabilising the control plane, we needed to perform a final reboot later in the morning to get things back under control. After the second reboot, we had a fully functional control plane.

Improvements Made

Since the incident occurred we have hardened our monitoring systems to better detect early signs of instability or failure. This should improve our response times during similar events, ensuring issues can be addressed promptly and efficiently.

Our recovery processes have been refined to minimise downtime in future incidents. This includes the development of improved processes and targeted training to better equip our team to handle similar situations effectively.

Lastly, improvements have been made to our control plane, reducing the likelihood of a complete loss of management capabilities in the future.

Summary

We apologise for the disruption caused by this incident and appreciate your patience as we worked to resolve the issues.

Our team is committed to learning from this event and implementing the necessary changes to prevent recurrence. While outages are an inherent risk in managing a fleet of hypervisors, we strive to minimize their impact and respond effectively when they occur.

Thank you for your understanding.

Posted Jan 22, 2025 - 17:17 NZDT

Completed

The scheduled maintenance has been completed.

Posted Jan 06, 2025 - 01:00 NZDT

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Posted Jan 06, 2025 - 00:00 NZDT

Scheduled

We have experienced some hardware node stability issues with (NZ-AKL1-62PE07). As a result, we need to perform emergency maintenance to reduce the likelihood of an extended outage. This will involve a reboot of servers on this node.

Posted Jan 05, 2025 - 21:21 NZDT

This scheduled maintenance affected: Cloud Containers (NZ – AKL01) and Linux Virtual Servers (NZ – AKL01).