Unscheduled Network Outage
Incident Report for SiteHost
Resolved
At approximately 07:20, Friday the 4th of May, our monitoring alerted our operations engineers that there were problems accessing our Auckland City network which is colocated in the ICONZ datacenter, and investigation started straight away. Initial indications suggested a power failure on multiple core network switches causing a complete network outage. Availability of these devices was restored at approximately 07:40 and most customers would have noticed service restored at this point. Once service was restored some customers may have noticed intermittent packet loss causing slow network performance which was corrected at approximately 08:40.

The intermittent packet loss was due to a complete loss of communications between the high availability systems in our routers, causing high load & several other network related anomalies. Once these network issues were resolved, our engineers then began investigating the root cause. It was discovered that our colocation provider had some scheduled maintenance between 05:00 - 09:00 that morning, which our engineering team were advised was network related and due to the redundant nature of the SiteHost network was not expected to have any significant impact.

Unknown to our engineers the maintenance also included restarting a single (1) access switch which is located inside our non-shared cabinets. It is understood following several conversations that their technician mistakenly restarted multiple (2) SiteHost managed switches which are not in any way owned or operated by our colocation provider. This basic oversight had significant consequences for us as while our team has worked hard building a fault tolerant network it was not possible for the network to recover from the removal of multiple devices.

SiteHost have strict processes in place for access of equipment in our cabinets & have been let down by our provider in communicating what work was being carried out. We have reiterated to our suppliers that we expect to be informed of any & all access to our locked cabinets & will be looking at any changes we can make to ensure this occurs in the future.

In addition to this, our team has in the last few months been upgrading our core network, this work has been completed at our North Shore facility and we expect to be completed with this upgrade in the next few weeks. This upgrade will see our core network components housed in separate cabinets which when combined with a number of other planned improvements should make it significantly harder to repeat the service interruption you experienced last Friday.

We would like to take the time to appologise for this extremely inconvenient service interruption. While no-one can guarantee their service will never be affected by downtime, what we can promise is that whenever something like this occurs, we will have a plan for restoring service, and we will carry this out as quickly as possible, while keeping you informed via email / twitter (please feel free to follow @sitehostnz on twitter for regular updates) in a timely fashion.

SiteHost sincerely apologise to all impacted customers for any inconvenience caused.
Posted May 04, 2012 - 07:20 NZST