Infrastructure Outage

Incident Report for SiteHost

Postmortem

Timeline

On November 14th at approximately 11:00 a network event occurred within our AKL01 facility. This event resulted in simultaneous loss of service across four network devices that make up our distribution network inside the AKL01 facility.

The issue was observed immediately and our team began investigating. We confirmed access to servers in both AKL02 and AKL03, however no servers were accessible at AKL01.

After a short period of investigation we discovered that OSPF (a routing protocol we use within our network) was seen to disconnect from all AKL01 distribution switches simultaneously. The physical links that OSPF runs over were still up and this is not normal behaviour.

Upon learning this it was decided to restart dist1 followed by dist2 shortly after.

At approximately 11:33 the reboot of dist1 completed successfully. This resulted in a temporary resolution of issues with all network services resuming correctly, except our IPMI network which runs on a separate pair of distribution switches: idist1 and idist2. This segment of our network is used for out of band access for customers to access and reboot their physical servers when they are unreachable.

Approximately five minutes later at 11:38 the reboot of dist2 completed successfully and began communicating both with the non functional idist layer, and dist1. This causes a network loop or similar behaviour in the distribution layer again.

We quickly discuss that the issue has only occurred since dist2 came back online and a decision is made to shutdown dist2 and isolate it from the network. This results in network service being restored at approximately 11:44.

Our team temporarily shifted to clearing our monitoring alerts and identifying and correcting secondary issues as a result of the outage such as incorrectly configured default gateways and VPN services that did not restart correctly.

Finally, between 00:00 and 02:00 on the 15th of November all redundant network services were brought back into service again with additional control plane protection. The network has continued operating normally since.

What happened?

We believe that a switch in the distribution layer of AKL01 began forwarding all packets in an uncontrolled manner. This caused a loop in the network and resulted in the failure of critical control plane services. All distribution switches in AKL01 were affected which included our out-of-band management access via console.

Further investigation from our logs has allowed us to observe two separate OSPF neighbours go almost simultaneously down, and other observed events imply that 4 separate switches managed to lose all control plane functionality within a very small window of time (likely seconds). This is not typical behaviour.

For some context here, the affected networking equipment at AKL01 has been operational and stable for over 4 years with no major topology or architecture changes since deployment. We are operating hardware and software from two separate vendors. There were also no recent config changes made around the time of the event.

As a result of the events on Monday, we have already made a number of small changes to further help protect our network's control plane. We will be making further changes after we have completed a thorough review of both our hardware and software configurations in conjunction with our networking vendors.

Areas for improvement

As mentioned this was the first outage of its type for us in a long time and as a result it tested our response in a number of unique ways. This meant we had a few other unforeseen challenges that you may or may not be aware of. We want to highlight these and tell you what we’re going to be changing to ensure we can respond better in the future.

Firstly, our status page - www.sitehost-status.net - which, while being completely externally hosted, did not automatically update itself to highlight an issue. Having accurate and timely information from our status page is an incredibly important communication tool and this was below the level we strive for.

Upon investigation, this appears to be due to a change in the configuration of the endpoint alerts our monitoring service sends to our status page provider. This has now been fixed, but we are still investigating how this change occurred in the first place. We will also be expanding the level of detail around some services and infrastructure on our status page.

We also believe there is room for improvement in manually updating the status page sooner than we did. Our incident response plan is being reviewed to ensure this and other communication channels are more clearly assigned and prioritised.

Our office Internet connection went offline. We have another separate and out of band fibre connection in the building which worked throughout the incident however this event has highlighted the need to have the ability to quickly failover our office between connections to assist operationally.

Our phone system was also affected by the network issue and failed over as we expected it to, however the volume of calls were far higher than expected (in part due the status page not updating automatically) and this caused some delays in customers being able to reach us. We now have far more customers than we had when this failure mode was designed and it will need rethinking.

This was the first major outage for us in a number of years and a number of staff had not seen how we respond to outages of this scale. For one person it was their first day at SiteHost! Our team is now routinely split between in-office and remote working, which inherently slows down communication. There are a number of improvements we have highlighted here which we will be working into our incident response plan.

Summary

Finally, I would like to apologise for the outage and thank you all for the understanding you showed. We’ve worked hard throughout our history to build a reputation for being a partner you can rely on and days like this are where we put that reputation on the line. Outages are part and parcel of being a cloud provider and it’s how we handle them that should be the true measure. While we have plenty to learn and improve upon as always, we hope we showed your trust in us is well placed.

‌Thanks for reading,

Quintin Russ

Posted Nov 16, 2022 - 15:52 NZDT

Resolved

All infrastructure and services have continued to be stable for the last few hours and we're confident this issue is resolved.

Thank you everyone for your patience today and we are sorry for the disruption. We'll be writing up a full incident post-mortem to explain exactly what happened and what we've learned later this week.

Posted Nov 14, 2022 - 15:55 NZDT

Update

We are continuing to monitor for any further issues.

Posted Nov 14, 2022 - 13:00 NZDT

Monitoring

Services continue to be stable and engineers are monitoring the situation closely.

Posted Nov 14, 2022 - 12:42 NZDT

Identified

Services have been restored and impacted customers should be back online but we're still working to ensure things stay stable.

Posted Nov 14, 2022 - 12:25 NZDT

Update

Engineers are still working to resolve this outage, we appreciate your patience. We are also impacted by the outage in some forms so ticket replies and answering calls will take longer than expected.

Posted Nov 14, 2022 - 12:00 NZDT

Update

We are continuing to investigate this issue.

Posted Nov 14, 2022 - 11:27 NZDT

Investigating

We are currently experiencing an outage impacting a variety of infrastructure and customer servers. Engineers are investigating and we'll update here regularly.

Posted Nov 14, 2022 - 11:19 NZDT

This incident affected: API, Control Panel, Windows Virtual Servers (NZ – AKL01), Dedicated Servers (NZ – AKL01), Cloud Containers (NZ – AKL01), and Linux Virtual Servers (NZ – AKL01).