For the past 15 months we have been working extremely hard on patching a completely new class of vulnerability. When we deal with security issues they are usually software in nature, however, more recently a number of vulnerabilities have been found within Intel CPUs released in the last 10 years. Like most cloud providers this impacts almost every single piece of hardware we have in production. There have been plenty of write ups on the internet around this but essentially because these vulnerabilities are inherently hardware in nature the fixes have often resulted in pretty severe performance regressions.
This has been an extremely difficult process for the team and our customers because the performance overhead has varied greatly depending on the generation of equipment, operating system, and the kernel running on the virtual server. Even if we replaced all of our equipment (a hugely expensive undertaking) it still doesn’t guarantee a resolution to this issue – as with any new class of vulnerability, there have been a number of lookalike issues found. Seven new variants were found as recently as four months ago. These were the variants we were attempting to patch in early March and from experience I would be surprised if more issues were not uncovered in the near future.
As with any change of this magnitude we conducted significant testing internally and with a small number of customers ahead of a wider rollout. Unfortunately despite this a number of customers saw a large performance regression after we applied the update. This affected customers in different ways, and despite our attempts to work through the issues one by one we ultimately had to roll these patches back and continue analysing the data collected during that time.
The timeline of events was as follows:
13th – 14th March: Security patches rolled out to customer hardware nodes.
15th March: Intermittent reports of performance degradation, internal testing begins.
18th March: After more reports began to surface over the weekend, testing continues.
19th March: Security patches identified as source of degradation, workarounds tested.
20th March: Decision made to roll back patches, emergency maintenance scheduled.
21st – 22nd March: Hardware Nodes rolled back to previous build.
It appears what was missed when reviewing the contents of the security updates more closely was just how expensive the mitigation was around the security patch, and in order to fix fully we need to do the following:
1) Update the Server BIOS in order to ship a new microcode for the CPU.
2) Update the Hypervisor / Virtualization Software to ensure your guest is safe from its neighbours within the environment.
3) Update the VM kernel to a version that is patched so it does not try to perform the execution paths that on a patched hypervisor are now severely slow.
It is worth mentioning that we have been running a number of AMD hardware nodes which are not vulnerable to this specific bug. We have been very happy with the performance of these hardware nodes but as with any new build we have been taking a measured approach to rolling these into production. After reviewing all of the above it is clear this is our best option currently, so we will be increasing the rate at which we rack these new hardware nodes. In the next few weeks we will be migrating some customers onto the new AMD hardware nodes and working through migrating the remaining customers onto hardware nodes with updated BIOSes and hypervisor software.
We are taking this approach to ensure that we have a pathway back for any customers who have not updated their kernel, and are also working with a number of the worst affected customers to ensure we do not experience the same performance regression again.
In the meantime, if you are a VPS customer please make sure you’re up to date with your patching to ensure you’re not affected by the security fixes when they are rolled out. Cloud Container customers need not worry, we will be updating the kernel version for your servers as part of a scheduled maintenance in the near future.
We apologise for the inconvenience we caused with the above work. While the security of our platforms and your servers will always be our top priority, it's clear we missed the mark here and we will do better in the future.