Security Bits logo - a green padlock with the words Security Bits to the right and in tiny letters below ithat it says 10101010 indicating a digital lock

Security Bits — 23 November 2025

Feedback & Followups

Deep Dive — that Cloudflare Outage

There was quite a bit of disruption on the internet for a few hours this week when Cloudflare suffered a system-wide outage on some of its services. Big-name websites were affected, but so were lots of smaller websites like bartbusschots.ie and podfeet.com. Some people are assuming we Cloudflare users must really be regretting our choice, but I, for one, am absolutely not. If anything, the way the company responded to this incident has strengthened my trust in them and their services.

I think a lot of the criticism comes down to measuring against the wrong yardstick. The question isn’t “is Cloudflare perfect?”, but “is Cloudflare better than I could achieve alone?”. When I measure my reliance on the service by that metric, I can give a full-throated ”YES!” in response!

Cloudflare offer many services, but their three most prominent are:

  1. Authoritative DNS hosting
  2. Website proxying, caching, and protection
  3. Public DNS resolution (1.1.1.1)

The 1.1.1.1 public DNS resolver was not affected by this outage at all, so we can ignore that for the remainder of this discussion.

When you own your own domain, two or more DNS servers somewhere on the internet need to act as the authoritative source of DNS records for that domain. For your domain to continue to exist on the internet, the control panel powering those servers needs to be secure, and the servers themselves need to be secure and resilient.

A quick DNS query shows that both myself and Allison use Cloudflare’s authoritative DNS service for our domains:

$ dig +short bartbusschots.ie NS
aaron.ns.cloudflare.com.
savanna.ns.cloudflare.com.
$ dig +short podfeet.com NS
elmo.ns.cloudflare.com.
pat.ns.cloudflare.com.
$

Like the public DNS resolver, this authoritative DNS service didn’t go down either. In fact, I’ve never experienced an outage on this service. You also never hear about Cloudflare security flaws compromising people’s domains or anything like that. The simple fact is that Cloudflare are the authoritative DNS provider for many major websites because they’ve earned a stellar reputation!

That leaves just one other major service — their website proxy service. This is where this week’s disruption happened.

When you run a website, you need to put the content on a web server somewhere on the internet so browsers can access it. The simplest thing to do is to run your own server, which myself and Allison have done for decades now. Like just about everyone else, we published our websites to the world directly from our servers for years, but today we don’t. Neither of our websites is accessed directly from the servers powering them; instead, the DNS records for our websites point to Cloudflare IP addresses, adding them as an intermediary between the internet and our websites:

$ dig +short www.bartbusschots.ie
172.67.198.200
104.21.13.74
$ whois 172.67.198.200 | grep 'Organization:'
Organization:   Cloudflare, Inc. (CLOUD14)
$ whois 104.21.13.74 | grep 'Organization:'
Organization:   Cloudflare, Inc. (CLOUD14)
$

We’ve clearly complicated things for ourselves by adding this extra layer, so why did we make that choice?

The big reason for me is that the internet is now awash with resource-hogging bots, and only some of them are coded ethically. The ethical ones respect the site owner’s bot policies as expressed in the site’s robots.txt file, but the unethical ones don’t. This literally costs us site owners money as our servers get overloaded, forcing us to upgrade to beefier, more expensive servers, or move our sites behind proxy services like Cloudflare’s.

Proxy services Cloudflare’s save server resources in two ways:

  1. They just block the worst of the bots, period!
  2. They cache our server’s responses, so lots of requests get answered by Cloudflare without ever being sent to our servers at all

As a bonus extra, Cloudflare also act as a Web Application Firewall (WAF), blocking malicious requests including those matching the most common vulnerabilities catalogued in the OWASP top 10.

Also, if you’re unfortunate enough to be targeted by a denial of service (DOS/DDOS) attack, Cloudflare is invaluable because their infrastructure can soak up even the biggest attacks and save your site and your server from being blasted off the net!

Finally, if you choose to put in a bit more work and enable enough caching you can actually mask server outages for a while, but that’s a more advanced feature I don’t use with my Bartificer Creations hat on, and not something Allison has invested in for this site either, but lots of organisations use Cloudflare as part of their disaster response (DR) plan.

The bigger your site, the bigger the wins, which explains why John Gruber is also not even slightly tempted to move his daringfireball.net site off Cloudflare:

“DF’s overall uptime and the frequency of any sort of performance problems went from good to great when I started relying on Cloudflare as a proxy. Also, in recent years, bot traffic has exploded. (Thanks, AI.) I’m pretty sure my server could handle those bursts of traffic on its own, but I sleep better not having to worry about it, because Cloudflare handles mind-boggling amounts of traffic.”John Gruber — daringfireball.net/…

So this week, Cloudflare’s web proxying service went down for a while. That’s really rare! The key takeaway is not that Cloudflare are not reliable, it’s that they’re so reliable most of us can’t remember the last time they suffered a global outage like this! (They say their last global outage was in 2019)

Just like no software can possibly be bug-free, no infrastructure can possibly be perfect. What matters is your provider’s competency relative to their rivals, and their response when things go wrong.

There’s no one better than Cloudflare at doing what Cloudflare does. They have a stellar reputation, and they’ve earned that through decades of hard work. Cloudflare get an A+ on relative competence!

Because they’re so good at what they do, there are very few opportunities to judge Cloudflare by how they respond to problems. This week, we got one of those rare opportunities, so how did they do?

  1. They had regular updates on their status page throughout the outage
  2. Their error screen very clearly showed the problem was on their side, not the client or server ends, so users knew they were not the problem, and site owners knew they didn’t need to start troubleshooting their stuff.
  3. They got their most critical services back quite quickly, about 90 minutes of downtime for most sites, and everything was back within just a few hours.
  4. They had a detailed postmortem published within 12 hours that was simply excellent — blog.cloudflare.com/…
    1. It starts with a human-friendly but accurate description of what actually happened
    2. It gives technical people all the detail they could want
    3. It ends with a genuine apology
    4. It was published and signed by the CEO, no book-passing!

When I say the postmortem started with a human-friendly explanation, I really do mean it, these are the opening few paragraphs:

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able to stop the propagation of the larger-than-expected feature file and replace it with an earlier version of the file. Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.

That’s followed by this frank and direct apology from the CEO:

We are sorry for the impact to our customers and to the Internet in general. Given Cloudflare’s importance in the Internet ecosystem any outage of any of our systems is unacceptable. That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team. We know we let you down today.

The technical detail is interesting for those into that kind of thing, and surprisingly revealing for such a major provider.

The post ends with the section I was most interested in — Cloudflare’s reaction to this outage:

Now that our systems are back online and functioning normally, work has already begun on how we will harden them against failures like this in the future. In particular we are:

  • Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
  • Enabling more global kill switches for features
  • Eliminating the ability for core dumps or other error reports to overwhelm system resources
  • Reviewing failure modes for error conditions across all core proxy modules

Today was Cloudflare’s worst outage since 2019. We’ve had outages that have made our dashboard unavailable. Some that have caused newer features to not be available for a period of time. But in the last 6+ years we’ve not had another outage that has caused the majority of core traffic to stop flowing through our network.

An outage like today is unacceptable. We’ve architected our systems to be highly resilient to failure to ensure traffic will always continue to flow. When we’ve had outages in the past it’s always led to us building new, more resilient systems.

I can’t imagine a better response. Few companies manage to get a response like that together in a week; getting that published in less than 12 hours is astonishing. They get a resounding A+ from me in this metric too!

If every major provider responded this well to an outage, the internet would be a much more reliable place!

Finally, some levity courtesy of Randall Munroe at XKCD: Service Outage — xkcd.com/…

Links

❗ Action Alerts

Worthy Warnings

  • 🧯 Public anxiety about Apple’s Digital ID greatly overstates actual risks — appleinsider.com/…
  • There is an uptick in scams targeting people whose iPhones have been stolen – cyberinsider.com/…
    • The attack is impactful because the attackers really do have the victim’s lost/stolen phones
    • The problem is they’re not trying to help, they’re trying to trick the victims into releasing the activation lock so they can resell the stolen phone as new
    • If you’re unlucky enough to lose your phone, don’t follow any instructions in any unsolicited messages, and definitely don’t reply with any codes or passwords or log into any websites you are asked to!
    • Contact Apple support directly yourself before doing anything!
  • An ingenious Apple Service hoax is convincing users their account is under attack — appleinsider.com/…
    • Many of the alerts were genuine Apple alerts because the attackers were trying to use Apple’s actual account recovery features to take over the victim’s account
    • The clever part was how the attackers intermixed scam SMS messages with the legitimate Apple messages (they were triggering the Apple messages so they could control the timings of everything)
    • The SMS messages from a random Atlanta number and the non-Apple domain name should have been red flags.
  • Beware cheap Android photo frames 🙁 — www.bleepingcomputer.com/…
    • > “Uhale Android-based digital picture frames come with multiple critical security vulnerabilities and some of them download and execute malware at boot time.”
    • > “It is recommended that consumers only buy electronic devices from reputable brands that use official Android images without firmware modifications, Google Play services, and built-in malware protections.”
  • WhatsApp flaw allowed researchers to scrape data of 3.5 billion users — cyberinsider.com/…
    • > “the company claimed the exposed data was already public and emphasized that message content remained protected by encryption. Nonetheless, the researchers argue that the ability to generate a global user database, including cryptographic keys, poses substantial risks to user safety, especially in repressive regimes.”
    • Real takeaway is that if you make any part of your profile public, it really is public, even if you don’t think of WhatsApp as social messaging
    • If you need privacy, use Signal!
  • Google are starting to train their AI on user email content, and in much of the world, it’s opt-out! — appleinsider.com/…
    • The US is opt-out
    • Privacy laws make a real difference, though, because in the EU, Japan, Switzerland, and the UK it’s opt-in (as it should be everywhere!)
    • Allison’s guide (made with Folge) to opting out of Google’s AI training on Gmail and other Google Workspace services – share with your friends and family: https://www.podfeet.com/misc/gmail-opt-out-ai-training-A.pdf

Notable News

Top Tips

Interesting Insights

Palate Cleansers

Legend

When the textual description of a link is part of the link, it is the title of the page being linked to, when the text describing a link is not part of the link, it is a description written by Bart.

Emoji Meaning
🎧 A link to audio content, probably a podcast.
A call to action.
flag The story is particularly relevant to people living in a specific country, or, the organisation the story is about is affiliated with the government of a specific country.
📊 A link to graphical content, probably a chart, graph, or diagram.
🧯 A story that has been over-hyped in the media, or, “no need to light your hair on fire” 🙂
💵 A link to an article behind a paywall.
📌 A pinned story, i.e. one to keep an eye on that’s likely to develop into something significant in the future.
🎩 A tip of the hat to thank a member of the community for bringing the story to our attention.
🎦 A link to video content.

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top