Security Bits Logo no alpha channel

Security Bits — 17 October 2021

Feedback & Followups

Deep Dive — Facebook’s Very Bad Day

Unless you’ve been living under a rock, you know Facebook went down for 6 hours recently. While the outage was going on there were all kinds of speculation about what might be happening, not helped by the fact that the outage roughly coincided with a whistleblower giving evidence to the US congress.

My gut feeling was either a rogue employee making a point and/or a dramatic exit, or, a sysadmin having a really bad day. Turns out it was a sysadmin making a small mistake that lead to a cascade of failures that was extremely difficult to recover from.

At the root of the outage is some automation around one of the back-bone technologies underpinning the core of the internet — BGP, the border gateway protocol. This is the so-called routing algorithm that allows the routers that actually power the internet to build up a map of how the actual cables that carry internet traffic are connected to each other, and which IP addresses are where.

BGP is the absolutely work-horse of the internet, but it flies under the radar of most regular folk most of the time because there is no equivalent of it within our home networks. Routing within a typical home network is trivial, even if you set up three routers in a Y-shaped configuration to segregate off your IoT devices. What makes it trivial is that there is exactly one path between any two devices on the network, and between the internet and any device on the network. There are no choices to make, and there is no possibility of a loop.

The core of the internet is much more complicated, it’s made up of a massively interconnected grid of routers. Each router connects to many other routers, and there are many possible paths between any two routers, and bad routing decisions could easily set up loops trapping traffic. What’s worse is that routers come and go constantly as cables are added, removed, taken offline for maintenance, or break, cut by machinery, eaten by rodents (that happens a lot!), snapped by underwater landslides, or cut through by errant ship anchors.

No human could manage the chaos, so the routers have to figure it all out for themselves. This is the problem BGP solves, and a big part of the solution is that all routers are effectively gossips, telling all their neighbours everything they know. This means that information ripples through the internet as salacious news does through a village!

The source of all this gossip is announcements from routers with responsibility for specific blocks of IP addresses advertising (telling everyone that’s listening) that they’ve just come online and are ready to accept packets for their IP ranges, or, that they’ve changed their minds, and no longer want packets for those IPs (retractions).

Finally, we think of IPs as belonging to single devices, but out of the internet, that’s not true. Large CDNs use BGP to offer multiple possible end-points for a given IP address. This is how content delivery networks (CDNs) allow for fast downloads — the DNS for the servers map the name of the content-hosting server to a given IP, and BGP then offers lots of possible paths to that IP, each leading to a different server in a different part of the world that has a copy of the content. Each router uses the shortest path it knows about, so Irish customers end up at a server in one of the data centres ringing Dublin, and someone in Australia ends up talking to a server in Sidney or Melbourne etc.

When you have multiple servers powering a single IP, you need to update your advertised routes as servers are added to the pool, or removed from it.

Facebook decided to automate this process through some automation running on their DNS infrastructure, and through a whoopsie, accidentally caused their DNS servers to send out BGP advertisements retracting all routes to the IP addresses of their DNS servers. This means their DNS servers took themselves off the internet, and all Facebook domains became impossible to translate from human-friendly name to IP address, including the internal DNS records powering the infrastructure employees needed to securely connect from home. In effect, Facebook knocked themselves off the internet in such a way that the only solution was to physically get into the data centres, connect directly to the routers, and send out updated BGP advertisements. Because so many people are working from home, and because Facebooks data centres need superb physical security, it took hours to figure out what happened, get physically to the data centres, get into the right rooms, and get the routes published.

Basically, it was a cascading failure. It reminded me of the worst day of my professional career when a swan had an even worse day and shorted some high voltage cables shorting out electricity in most of our county. That cascaded with a battery failure in a UPS, that took down our entire infrastructure for the first time in years, and when we went to power up our private cloud we discovered its startup procedures depended on our DNS VMs which were hosted on our private cloud. One circular dependency, one very bad day! (The fix was a few hard-coded /etc/hosts files based on a document someone found on their computer that references the IP addresses belonging to the critical DNS names.)

I felt really sorry for the Facebook sysadmins — they now have one heck of a war story to regale fellow nerds with in the pub at tech conferences 🙂

Links

❗ Action Alerts

Worthy Warnings

  • A massive infrastructure provider for phone carriers around the world (about 235), including big names like AT&T, T-Mobile, Verizon, Vodafone, and China Mobile has revealed that hackers were active in their systems from May 2016 until May this year. Attackers could see call metadata like who called who for how long, and, the contents of SMS messages. Yet another reason to avoid SMS for 2FA when you have other options — www.vice.com/…
  • Huge Twitch leak exposes source code, passwords – what you need to do — www.imore.com/…
  • 🇺🇸 A breach at Verizon carrier Visible has resulted in fraudulent orders of iPhones being charged to people’s connected payment methods — www.imore.com/…

Notable News

Top Tips

Interesting Insights

Palate Cleansers

  • The NASA Astronaut Shane Kimbrough has been tweeting up a storm from the ISS, including some lovely photos of the earth at night. Two personal highlights:
    • 🇮🇪 Dublin — twitter.com/…
    • 🇧🇪 Brussels — twitter.com/…
    • A great explanation of what the change of rain percentage on your weather apps actually means, and why it can say 100%, you can stay totally dry, and the app can still be completely correct — www.macobserver.com/…

Legend

When the textual description of a link is part of the link it is the title of the page being linked to, when the text describing a link is not part of the link it is a description written by Bart.

Emoji Meaning
🎧 A link to audio content, probably a podcast.
A call to action.
flag The story is particularly relevant to people living in a specific country, or, the organisation the story is about is affiliated with the government of a specific country.
📊 A link to graphical content, probably a chart, graph, or diagram.
🧯 A story that has been over-hyped in the media, or, “no need to light your hair on fire” 🙂
💵 A link to an article behind a paywall.
📌 A pinned story, i.e. one to keep an eye on that’s likely to develop into something significant in the future.
🎩 A tip of the hat to thank a member of the community for bringing the story to our attention.

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top