Global internet services provider Cloudflare had trouble, and when it has problems, the internet has trouble, too. For about an hour, websites around the globe went down with 502 error messages.
The problem has now been fixed, and the service appears to be normally running. It’s still not entirely clear what happened.
In a short blog post, Cloudflare CTO John Graham-Cumming explained:
“For about 30 minutes today, visitors to Cloudflare sites received 502 errors caused by a massive spike in CPU utilization on our network. This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.”
Cloudflare CEO Matthew Prince subsequently explained the failure happened because:
“[A] bug on our side caused Firewall process to consume excessive CPU. Initially appeared like an attack. We were able to shut down process and get systems restored to normal. Putting in place systems so never happens again.”
Both Graham-Cumming and Prince emphasized this service disruption was not caused by an attack. Nor, Prince tweeted, was this a repeat of the Verizon Border Gateway Protocol network problem, which troubled Cloudflare and the internet last week.
How could this simple mistake cause so many problems? Cloudflare operates an extremely popular content delivery network (CDN). When it works right, its services protect website owners from peak loads, comment spam attacks, and Distributed Denial of Service (DDoS) attacks. When it doesn’t work right, well, we get problems like this one.
Cloudflare CDN works by optimizing the delivery of your website resources to your visitors. Cloudflare does this by delivering visitors to your website’s static from its global data centers. Your web server only delivers dynamic content. In addition, generally speaking, Cloudflare’s global network provides a faster route to your site than a visitor going directly to your site.
Its CDN is the most popular such service with 34.55% of the market. Amazon CloudFront is second with 28.84%. With over 16 million Cloudflare-protected sites, including BuzzFeed, Sling TV, Pinterest, and Dropbox, when Cloudflare has trouble, many of these websites are knocked off the internet.
Prince admitted this problem was the biggest ever internal Cloudflare problem. Prince tweeted:
“This was unique in that it impacted primary and all fail-over systems in a way we haven’t seen before. Will ensure better isolation and backstops in the future. Still getting to the bottom of the root cause.”
The problem also affected Cloudflare’s DNS service and its CDN.
To Cloudflare’s credit, the company is taking the blame and being transparent about what went wrong. At the same time, the episode emphasizes how much the internet now depends on a few important companies instead of many peer-to-peer businesses and institutions.