On Wednesday 9th January 2019 at approximately 2:30pm we experienced an unscheduled outage on one of our hardware nodes (NZ-AKL2-V4YXX6) in our North Shore datacenter. The outage resulted in customer servers on the node going offline. This initially presented as a failed disk and our engineers were able to recover from this and restore service within 10 minutes with no data loss. Our engineers continued to monitor the node for further issues, and at approximately 3:30pm the outage reoccurred. Again this presented as a failed disk and we began the process of moving customer servers to a new node.
During this process NZ-AKL2-V4YXX6 remained unstable with periodic outages until approximately 7:30pm, at which stage our engineers were able to stabilise the node to a larger extent in order to complete migration throughout the evening and into the early morning of Thursday. By 3:40am Thursday morning the migration had completed for all customer servers on the affected node.
Once the customer impact had been negated, our engineers proceeded to dig into the cause of the outage. We were unable to replicate the fault under initial testing however when all disks were replaced with new disks we immediately saw the outage reoccur. We have now removed the node from the datacenter entirely while we perform more comprehensive tests on the hardware. In the meantime, customer servers continue to be served without issue from the new nodes.
This series of events has been unique in our experiences as in normal circumstances a loss of a disk won't affect the hardware node in any noticeable way. Our current working theory is that the SATA controller on the hardware node is faulty, though this has yet to be determined as a certainty. If this is indeed the case it will be the first time we have seen a SATA controller fail on our hardware.
We apologise for the downtime experienced by customers as a result of this incident. We consider the repeated outages and the length of time for a resolution as unacceptable and we will continue to evaluate our hardware fully before deployment and are working to improve the time taken to migrate servers between nodes in case of future incidents.
If you have any questions about this incident or need further detail, please get in touch at firstname.lastname@example.org .