Hardware Node Outage (Auckland North Shore)

Incident Report for SiteHost

Postmortem

In the early hours of the morning on Friday 8 November, we performed a scheduled maintenance for some of our customers' servers that involved migrating them to a new hardware node - NZ-AKL2-62ROG0. This migration completed at 2:36am without incident.

At approximately 8:45am that morning, the destination node (NZ-AKL2-62ROG0) experienced an outage. Our monitoring picked up this outage and our engineers brought the node and guests back up again by 8:58am.

After this initial outage, our team began digging into the root cause. The outage presented itself as a full lockup of the node, and logs leading up to this point showed nothing untoward. Due to the migration to the node we had performed earlier that morning, our initial suspicion was the load resulting from the influx of guests had overloaded the server leading to the crash. Performance post-crash had improved and was looking normal.

At 2:23pm the node crashed again. We were able to recover the node by 2:29pm but due to the previous outage the decision was made to begin a partial rollback of the migration that had been performed the night before - we elected to keep Cloud Container servers that were migrated on the new node and move other servers back to their original node. This process was begun and emails were sent to affected customers, with a scheduled time of midnight to begin the switchover.

At 3:53pm the node went down again with the team bringing it back up by 4:01pm. A fourth outage occurred at 9:03pm with service restored by 9:11pm. The emergency maintenance then proceeded after midnight with the node and guests remaining stable since, hosting the Cloud Container servers that had previously been migrated across.

We're sorry for the downtime experienced as a result of these outages. We understand the impact an incident like this can have on your business and we're confident that the node will remain stable as it has since the partial rollback was completed. The team will be reassessing our maintenance processes to see if any improvements can be made, and we will continue investigating the root cause of the issue with further information provided as it comes to light.

If you have any questions or concerns related to this incident, please get in touch at support@sitehost.co.nz or give our team a call at 0800 484 537.

Posted Nov 12, 2019 - 17:48 NZDT

Resolved

Servers should now be up and running. We have scheduled emergency maintenance for some affected servers, with those affected receiving emails shortly.

Posted Nov 08, 2019 - 15:22 NZDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 08, 2019 - 14:36 NZDT

Investigating

We have identified a hardware node failure which is causing downtime for some customers, engineers are currently working to bring this node back online.

Posted Nov 08, 2019 - 14:26 NZDT

This incident affected: Cloud Containers (NZ – AKL02).