mdlbear: (technonerdmonster)
[personal profile] mdlbear

If you're sensible enough not to use Facebook, WhatsApp, or Instagram, or to have set up "log in with Facebook" on any site you use regularly, you might not have noticed that they all disappeared from the internet for about six hours yesterday. Or if you noticed, you might not have cared. But you might have read some of the news about it, and wondered what the heck BGP and DNS are, and what they had to do with it all.

And if not, I'm going to tell you anyway.

You're more likely to have heard of DNS: that's the Internet's phone book. Your web browser, and every other program that connects to anything over the Internet, uses the Domain Name System to look up a "domain name" like, say, "www.facebook.com", and find the numerical IP address that it refers to. DNS works by splitting the name into parts, and looking them up in a series of "name servers". First it looks in a "root server" to find the address of the Top-Level Domain (TLD) server that holds the lookup table for the last part of the name, e.g., "com". From the TLD server it gets the address of the "authoritative name server" that holds the lookup table for the next part of the name, e.g., facebook, and looks there for any subdomains (e.g. "www").

(When you buy a "domain name", what you're actually buying is a line in the TLD servers that points to the DNS server for your domain. You also have to get somebody to "host" that server; that's usually also the company that hosts your website, but it doesn't have to be.)

All this takes a while, so the network stack on your computer passes the whole process off to a "caching name server" which remembers every domain name it looks up, for a time which is called the name's "time to live" (TTL). Your ISP has a caching name server they would like you to use, but I'd recommend telling your router (if you have full control over it) to use Cloudflare's or Google's nameserver, at the IP address 1.1.1.1 or 8.8.8.8 respectively. Your router will also keep track of the names of the computers attached to your local network.

Finally, we get to the Border Gateway Protocol (BGP). If DNS is the phone book where you look up street addresses, BGP is the road map that tells your packets how to get there from your house, and in particular what route to take.

The Internet is a network of networks, and it's split up into "autonomous systems (AS), each of which is a large pool of routers belonging to a single organization. Each AS exchanges messages with its neighbors, using BGP to determine the "best" route between the itself and every other AS in the Internet. (The best route isn't always the shortest; the protocol can also take things like the cost of messages into account.) BGP isn't entirely automatic -- there's some manual configuration involved.

What happened yesterday was that somebody at Facebook accidentally gave a command that resulted in all the routes leading to Facebook's data centers being withdrawn. In less than a minute Facebook's DNS servers noticed that their network was "unhealthy", and took themselves offline. At that point Facebook had basically shot themselves in the foot with a cannon.

Normally, engineers can fix server configuration problems like this by connecting to the servers over the internet. But Facebook's servers weren't connected to the internet anymore. To make matters worse, the computers that control access to Facebook's buildings -- offices as well as data centers -- weren't able to connect to the database that told them whose badges were valid.

Meanwhile, computers that wanted to look up Facebook or any of its other domains (like WhatsApp and Instagram), kept getting DNS failures. There isn't a good way for an app or a computer to determine whether a DNS lookup failure is temporary or permanent, so they keep re-trying, sometimes (as Cloudflare's blog post puts it) "aggressively". Users don't usually take an error for an answer either, so they keep reloading pages, restarting their browsers, and so on. "Sometimes also aggressively." Traffic to Facebook's DNS servers increased to 30 times normal, and traffic to alternatives like Signal, Twitter, Telegram, and Tiktok nearly doubled.

Altogether a nice demonstration of Facebook's monopoly power, and great fun to read about if you weren't relying on it.

Resources

Another fine post from The Computer Curmudgeon (also at computer-curmudgeon.com).
Donation buttons in profile.

Date: 2021-10-06 11:58 pm (UTC)
madfilkentist: My cat Florestan (gray shorthair) (Default)
From: [personal profile] madfilkentist
I like the part where the computers to let people into the building to fix the problem couldn't function. It's as close to an "I can't let you do that, Dave" scenario as I've heard of yet.

Date: 2021-10-07 04:55 am (UTC)
melchar: (zombies)
From: [personal profile] melchar
*BIG grin* That was my favorite part, too.

Date: 2021-10-07 08:56 am (UTC)
From: [personal profile] spiffyvoxel

I have never been more glad that I no longer use Facebook or Instagram. (I have WhatsApp for family stuff, but don't need to rely on it thankfully.) The sad part is that it will probably happen again because the solution would involve reversing the centralisation of all Facebook services, which in turn would make it easier for parts of the empire to be divested, either willingly or unwillingly.

Date: 2021-10-07 10:35 am (UTC)
freyjaw: (communicator)
From: [personal profile] freyjaw
I'm glad I never used any of those.

Date: 2021-10-07 04:22 pm (UTC)
dreamshark: (Default)
From: [personal profile] dreamshark
Thanks for the detailed explanation.

I do know what BGP is, but I hadn't heard that it was the start of the cascading clusterfuck. If all you have to do to disable that entire core network is briefly shut down BGP, I'm surprised this is the first time this has happened. I mean, you do have to upgrade the router firmware occasionally, which means shutting things down. I can only assume that the correct maintenance procedure was to shift the backbone traffic to an alternative backbone before shutting down the main one, and somebody typed the commands in the wrong order? And it sounds like they were performing the maintenance remotely OVER THE INTERNET which seems like a TERRIBLE idea. Sometimes you really do need to be within arms reach of the equipment, just in case.

The "I can't let you do that, Dave" scenario is kind of hilarious but in some ways it reminds me more of "Mr. Robot." When I watched that show I kept thinking (hoping) that their scenarios were unrealistic because who would design a major core network in such a way that the internal maintenance and building security functions were on the same network as the data traffic?? Well, apparently Facebook, for one. Yikes. I hope they are seriously rethinking their security philosophy.

Date: 2021-10-07 06:16 pm (UTC)
dreamshark: (Default)
From: [personal profile] dreamshark
What I find really incomprehensible is that there was no failure mode for letting people into the building if the network was down. What if there was a fire???

Even on Mr. Robot, where they were setting data centers on fire by hacking into the building systems from the Internet, I don't think the fire department ended up locked out of the buildings.

jesse_the_k: room full of women keypunching (keypunchers)
From: [personal profile] jesse_the_k

I'll admit my initial response was smugness that I rarely use those sites.

But then Nancy Lebovitz on Metafilter linked me to an excellent blog post by Jim Wright at Stonekettle. I was reminded that the infrastructure Facebook makes available is crucial for folks our age who didn't happen to learn computers when they were young.

andrewducker: (Default)
From: [personal profile] andrewducker
Same here. It'd be lovely to go back to a pre-Facebook world, but alas the network effects have well and truly kicked in.

Most Popular Tags

Style Credit

Page generated 2026-01-07 07:32 am
Powered by Dreamwidth Studios