Unscheduled Network Outage
Incident Report for SiteHost
Resolved
At approximately 23:36:43 on Tuesday the 21st of May the network at one of our Auckland City Data Centers experienced a complete loss of service.

SiteHost engineers were on-site conducting what they considered to be a low risk network change which involved installing a line card into each of our redundant aggregation switches. This upgrade is a small part of a larger network upgrade we are undertaking to increase our network capacity and improve our reliability.

Almost immediately following the installation of the first line card our aggregation switches began exhibiting issues which did not fail-over automatically and as a result engineers on-site immediately began manually intervening to resolve the connectivity issues with service being restored at approximately 00:08:20.

According to the vendors supporting documentation, these line cards are hot swappable and other similar deployments showed no indication of issue when installing them in a production environment. Our experience overnight differs greatly with what has been documented and we are working with our vendor to identify the root cause of this issue and to help explain why we experienced a loss of service in this situation.

It quickly became apparent that as a result of the impact to the aggregation switches, loop detection mechanisms in our access switches had disabled their uplink ports. In order to restore service our engineers needed to login to every switch and re-enable these ports. We are investigating further improvements that can be made in this area.

This was a fairly major event for us, our last being just over six months ago, and as always we will work hard to ensure we learn from this and make sure it doesn’t happen again. That being said sometimes even with the best planning and intentions things don’t go as smoothly as we would like. We are confident that these network upgrades are well worth any short term pain though. The increased speed and capacity of our network has enabled us to offer better national links and lower costs for international bandwidth by 80%, as well as the increased reliability we expect to see when this work is complete.

Even so, we are very sorry for the inconvenience this caused and we appreciate the support from our customers as we continue to work hard improving the services we offer you.

Quintin Russ
Technical Director - SiteHost
Posted May 21, 2014 - 00:08 NZST