On September 1st at approximately 1:00pm NZ time we experienced an outage on one of our nodes - NZ-AKL2-61EY00 - in our Auckland North Shore location. A restart of the node was required and servers were back up and running within approximately 30 minutes. As the cause of the instability was not immediately apparent and this incident was part of a pattern we had seen on this node in recent weeks we scheduled an emergency maintenance for early the next morning (September 2nd) to decommission this hardware node for further investigation. A new hardware node - NZ-AKL2-6Y23RO - was commissioned as a destination for this migration.
The maintenance began as scheduled early on September 2nd with servers being migrated to NZ-AKL2-6Y23RO. During the maintenance our engineers observed a storage device failure on the new node at approximately 5am, at which point we failed out the suspect storage device for replacement. Later that morning at approximately 10:50am we experienced a full outage on NZ-AKL2-6Y23RO resulting in an hour of downtime. Our engineers determined that this was an extended storage hardware failure on the new node and after stabilising the server and completing a fill suite of filesystem checks the VMs were booted. We then scheduled another emergency maintenance to migrate back off of this node onto an existing and stable hardware node in the pool - NZ-AKL2-V8R4ZG - for early on September 3rd. This second maintenance was completed without issue early this morning and servers have remained stable since. Our team have since removed the two problematic hardware nodes from our provisioning pools completely.
These two problematic hardware nodes each come from different generations which are both widely deployed in our network and we consider them to be very stable. They were both exhibiting very different faults, so it is fair to say we have been bitten twice here. We have begun fully investigating the hardware issues on both NZ-AKL2-61EY00 and NZ-AKL2-6Y23RO. Additionally as a result of these outages we will also be reviewing our testing processes to see if there is more we could have done before NZ-AKL2-6Y23RO went into production and ensure we are doing everything possible to prevent this happening again in the future.
We are sorry for the repeated and extended outages you experienced this week, although the issues only affected a relatively small number of customers this hasn’t been up to our standard and we will continue to work hard to restore the trust you have placed in us.
If you have any questions about the incident or need further information please get in touch with us at firstname.lastname@example.org .