Unscheduled Node Outage - North Shore
Incident Report for SiteHost
Postmortem

On Wednesday 9th January 2019 at approximately 2:30pm we experienced an unscheduled outage on one of our hardware nodes (NZ-AKL2-V4YXX6) in our North Shore datacenter. The outage resulted in customer servers on the node going offline. This initially presented as a failed disk and our engineers were able to recover from this and restore service within 10 minutes with no data loss. Our engineers continued to monitor the node for further issues, and at approximately 3:30pm the outage reoccurred. Again this presented as a failed disk and we began the process of moving customer servers to a new node.

During this process NZ-AKL2-V4YXX6 remained unstable with periodic outages until approximately 7:30pm, at which stage our engineers were able to stabilise the node to a larger extent in order to complete migration throughout the evening and into the early morning of Thursday. By 3:40am Thursday morning the migration had completed for all customer servers on the affected node.

Once the customer impact had been negated, our engineers proceeded to dig into the cause of the outage. We were unable to replicate the fault under initial testing however when all disks were replaced with new disks we immediately saw the outage reoccur. We have now removed the node from the datacenter entirely while we perform more comprehensive tests on the hardware. In the meantime, customer servers continue to be served without issue from the new nodes.

This series of events has been unique in our experiences as in normal circumstances a loss of a disk won't affect the hardware node in any noticeable way. Our current working theory is that the SATA controller on the hardware node is faulty, though this has yet to be determined as a certainty. If this is indeed the case it will be the first time we have seen a SATA controller fail on our hardware.

We apologise for the downtime experienced by customers as a result of this incident. We consider the repeated outages and the length of time for a resolution as unacceptable and we will continue to evaluate our hardware fully before deployment and are working to improve the time taken to migrate servers between nodes in case of future incidents.

If you have any questions about this incident or need further detail, please get in touch at support@sitehost.co.nz .

Posted 10 days ago. Jan 14, 2019 - 15:02 NZDT

Resolved
All servers on the affected node have been successfully migrated to new nodes and should be back up and running. If you're still experiencing issues as a result of this outage please contact us at support@sitehost.co.nz .
Posted 15 days ago. Jan 10, 2019 - 03:41 NZDT
Monitoring
The node continues to be stable and we are monitoring performance closely, as well as continuing to migrate servers to new nodes.
Posted 15 days ago. Jan 09, 2019 - 22:28 NZDT
Update
While the node has remained stable for the last hour the migration of servers to new nodes continues to ensure availability of customer servers while we restore the node to full working order.
Posted 15 days ago. Jan 09, 2019 - 20:39 NZDT
Update
As we are still experiencing issues with the node we are moving customer servers to a new node in order to reduce the downtime customers are experiencing. This may result in additional short outages while the migrations are completed.
Posted 15 days ago. Jan 09, 2019 - 19:37 NZDT
Update
We're still experiencing instability with the node and are working on fixing the issue as quickly as possible, with this as our team's top priority.
Posted 15 days ago. Jan 09, 2019 - 18:37 NZDT
Update
The NZ-AKL2-V4YXX6 node is still experiencing some issues and we are working to resolve this as soon as possible. We apologise for the downtime here and will provide a full post-mortem once the situation has been fully resolved. If you have any questions please get in touch at support@sitehost.co.nz .
Posted 15 days ago. Jan 09, 2019 - 17:01 NZDT
Identified
We are seeing a resurgence in issues on the node and are working on a resolution.
Posted 15 days ago. Jan 09, 2019 - 15:30 NZDT
Monitoring
All servers are back up and running on the affected node and we are continuing to monitor for further issues.
Posted 15 days ago. Jan 09, 2019 - 14:46 NZDT
Investigating
We have experienced an unscheduled outage on one of our nodes - NZ-AKL2-V4YXX6 - in the North Shore datacenter that is impacting some customers. Our engineers are investigating this at the moment and we will update when we know more.
Posted 15 days ago. Jan 09, 2019 - 14:38 NZDT