Unscheduled Network Outage
Incident Report for SiteHost
Resolved
At approximately 01:08, Thursday the 7th of March, our monitoring alerted our operations engineers that there were problems affecting our network. We began investigating and discovered network issues resulting in intermittent packet loss and at times complete network loss affecting all customers at our Auckland City facility.

At the time SiteHost engineers spoke to our upstream suppliers support staff who advised us the issues were potentially related to a single upstream network provider having issues on their network causing a range of issues across these diverse upstream networks. SiteHost maintain an out-of-band management network and access to this was also affected making diagnostics more difficult than we would have liked.

Despite some connectivity being restored around 04:00 our engineers observed issues via some paths which we began attempting to diagnose. While diagnosing these issues we continued to experience brief periods where connectivity was degraded or completely lost. After some troubleshooting, our engineers restored connectivity at approximately 06:00.

Upon further investigation the following day, it was discovered that at least 2 completely separate network devices on the SiteHost network were running but not allowing traffic to pass through. We are still working with our partners to understand why this happened. We are also still waiting on an incident report from one of our upstream suppliers to help us understand why their network went down.

The SiteHost network is fault tolerant, using diverse paths and will fail over in the event a network device completely fails, unfortunately it was not possible for our network to failover when this behaviour was observed. Restarting the affected process restored these devices to be fully functional again, we have been monitoring this since and these issues have not reoccurred.

This was perhaps the most challenging and certainly the longest network incident we have experienced since we began managing our own network 4 years ago, for the following reasons:

- Multiple upstream failed concurrently, delaying access to begin diagnosing issues.
- SiteHost out-of-band management network access was affected.
- SiteHost network devices malfunctioned silently.

We have learnt a lot from this incident and have a number of changes planned to improve the diversity, monitoring and ultimately the reliability of our network - some of these changes were easy and have already been done, others will take us several weeks to complete.

We would like to take the time to apologise for this extremely inconvenient service interruption. We would like to remind customers to please feel free to follow @sitehostnz on twitter for regular updates in the event of any network issues in the future.

If you made it this far, thank you for reading and thank you for your continued support.

Quintin Russ
Technical Director - SiteHost
Posted Mar 07, 2013 - 06:00 NZDT