On Monday 8th of January at 07:08 one of our Windows hardware nodes (NZ-AKL1-V021D1) suffered from multiple drive failures which in turn caused the storage array to stop working and brought customer servers offline.
As soon as our monitoring systems alerted our engineers they started investigating the cause. From their investigation the drives appeared to be healthy and the issue appeared to be software related. Based on this information the decision was made to re-add the drives and rebuild the storage array.
Rebuilding the array took longer than expected due to some complications but at approximately 08:54 the hardware node was back online and all customer servers were operating again – with the exception of one server that we worked with the customer directly to get back online shortly after.
After a few hours of operation the hardware node was considered to be stable. Unfortunately at 10:38 on the following day Tuesday 9th of January the same drives failed bringing the storage array and the same customer servers down.
Engineers quickly set about restoring the storage array excluding the problem drives and at 11:00 full service was restored. At this stage the decisions was made to move all impacted customers to a different hardware node to allow us to investigate further without risking customer outages.
Emergency maintenance was scheduled and that night at 00:00 we migrated all customers to a separate hardware node with the maintenance being complete by 01:01. In total these two outages incidents accounted for roughly three hours of downtime for the impacted customers.
While issues like this do happen, we believe we can do better. Most notably we are considering changes to our storage arrays to make them more resilient to drive failures. We will also be improving our internal training and processes to ensure our engineers can recover from these incidents faster.
Finally, we would like to apologise to the impacted customers. We know outages like this are disruptive and not how anyone wanted to start the new year. If you have any questions about this incident or our response please get in touch, we will be happy to help.