At approximately 08:00 on Friday 23rd of June one of our Windows VPS nodes (NZ-AKL2-V5L1DJ) suffered from multiple drive failures, bringing some customer servers offline.
Our engineers were notified immediately of the outage and attempted to restore service by repairing the RAID array on the node, however this ultimately failed and the decision was made to restore on fresh drives using our most recent backup.
This is when we discovered that the drives had started failing earlier that morning - during our standard backup window. Ultimately this resulted in backups that were either incomplete or corrupt for roughly half of the impacted customers. As such we had to restore their backups from the previous night, causing up to 24 hours of data loss in some cases.
By 10:05 we had restored the first batch of customer servers and all customers were back online by 14:07.
——
Incidents such as this are a reality of what we do. We are picky about the hardware that we use in our nodes but parts fail – how we recover is the true test.
In this situation we want to do better. We lost more time than we would have liked simply waiting on data to transfer from our backup servers to the new drives.
As such we would like to improve this as part of a larger backup project that we are currently undertaking.
Finally we would like to say sorry to the impacted customers, especially to those whose servers had to be recovered from an older backup. You were all extremely understanding during the outage and for that we’re thankful.
As always if you have any questions regarding this outage, or want to discuss disaster recovery options please get in touch and we will be happy to help.
Quintin Russ