Unscheduled Windows Node Outage - Auckland City

Incident Report for SiteHost

Postmortem

On Monday 8th of January at 07:08 one of our Windows hardware nodes (NZ-AKL1-V021D1) suffered from multiple drive failures which in turn caused the storage array to stop working and brought customer servers offline.

As soon as our monitoring systems alerted our engineers they started investigating the cause. From their investigation the drives appeared to be healthy and the issue appeared to be software related. Based on this information the decision was made to re-add the drives and rebuild the storage array.

Rebuilding the array took longer than expected due to some complications but at approximately 08:54 the hardware node was back online and all customer servers were operating again – with the exception of one server that we worked with the customer directly to get back online shortly after.

After a few hours of operation the hardware node was considered to be stable. Unfortunately at 10:38 on the following day Tuesday 9th of January the same drives failed bringing the storage array and the same customer servers down.

Engineers quickly set about restoring the storage array excluding the problem drives and at 11:00 full service was restored. At this stage the decisions was made to move all impacted customers to a different hardware node to allow us to investigate further without risking customer outages.

Emergency maintenance was scheduled and that night at 00:00 we migrated all customers to a separate hardware node with the maintenance being complete by 01:01. In total these two outages incidents accounted for roughly three hours of downtime for the impacted customers.

While issues like this do happen, we believe we can do better. Most notably we are considering changes to our storage arrays to make them more resilient to drive failures. We will also be improving our internal training and processes to ensure our engineers can recover from these incidents faster.

Finally, we would like to apologise to the impacted customers. We know outages like this are disruptive and not how anyone wanted to start the new year. If you have any questions about this incident or our response please get in touch, we will be happy to help.

Posted Jan 11, 2018 - 16:01 NZDT

Resolved

We believe all customers impacted by this outage are back online. We will be conducting emergency maintenance to prevent further impact from this node and will be directly contacting impacting customers. We will also post a full post-mortem in due course for these outages.

Posted Jan 09, 2018 - 11:37 NZDT

Identified

Engineers have brought the node back online and are currently bringing customer servers back up. Engineers will now investigate steps to prevent further customer impact until the node is stable again.

Posted Jan 09, 2018 - 10:58 NZDT

Investigating

We have experienced an unscheduled outage on the same node as earlier this week - NZ-AKL1-V021D1 - in the Auckland City datacenter that is impacting a few customers. Our engineers are investigating this at the moment and we will update when we know more.

Posted Jan 09, 2018 - 10:44 NZDT