Cloudflare, one of the world’s largest cloud network platforms, experienced an outage for around 30 minutes yesterday that led to websites for some of the world’s biggest companies going down and visitors receiving 502 errors which caused by a massive spike in CPU utilisation on their network. Cloudflare claim to provide services to over 16,000,000 “internet properties”.
Starting at 13:42 UTC on Tuesday the global outage across the Cloudflare network resulted in visitors to Cloudflare-proxied domains being shown 502 errors (“Bad Gateway”), according to a post on the company blog. “The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules.”
It’s the second major outage in the last 10 days with ThousandEyes reporting that “for roughly two hours between 07:00 and 09:00 U.S. Eastern Time on June 24th, ThousandEyes detected a significant route leak event that impacted users trying to access services fronted by CDN provider Cloudflare, including gaming platforms Discord and Nintendo Life. The route leak also affected access to some AWS services.” In this case, analysis by ThousandEyes found that Verizon was the prime offender but many Cloudflare users were impacted.
“Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%.”
It was, says Cloudflare, an unprecedented and novel CPU exhaustion event as it was the first global CPU exhaustion Cloudflare had experience.
The outages was caused following a software deployment, which is something they constantly deploy across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.
At 1402 UTC Cloudflare understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.
Cloudflare then reviewed the offending pull request, rolled back the specific rules, tested the change to ensure that they were 100% certain they had the correct fix, and re-enabled the WAF Managed Rulesets at 1452 UTC.
Cloudflare admitted their testing processes were insufficient in this case and they are reviewing and making changes to their testing and deployment process to avoid incidents like this in the future.