What happened to facebook yesterday?
2021-10-06 04:06 pmIf you're sensible enough not to use Facebook, WhatsApp, or Instagram, or to have set up "log in with Facebook" on any site you use regularly, you might not have noticed that they all disappeared from the internet for about six hours yesterday. Or if you noticed, you might not have cared. But you might have read some of the news about it, and wondered what the heck BGP and DNS are, and what they had to do with it all.
And if not, I'm going to tell you anyway.
You're more likely to have heard of DNS: that's the Internet's phone
book. Your web browser, and every other program that connects to anything
over the Internet, uses the Domain Name
System to look up a "domain name" like, say,
"www.facebook.com", and find the numerical IP address that it
refers to. DNS works by splitting the name into parts, and looking them up in a
series of "name servers". First it looks in a "root server" to find the address of the Top-Level Domain (TLD) server that holds the lookup table for the
last part of the name, e.g., "com". From the TLD server it
gets the address of the "authoritative name server" that holds the lookup
table for the next part of the name, e.g., facebook, and
looks there for any subdomains (e.g. "www").
(When you buy a "domain name", what you're actually buying is a line in the TLD servers that points to the DNS server for your domain. You also have to get somebody to "host" that server; that's usually also the company that hosts your website, but it doesn't have to be.)
All this takes a while, so the network stack on your computer passes the whole process off to a "caching name server" which remembers every domain name it looks up, for a time which is called the name's "time to live" (TTL). Your ISP has a caching name server they would like you to use, but I'd recommend telling your router (if you have full control over it) to use Cloudflare's or Google's nameserver, at the IP address 1.1.1.1 or 8.8.8.8 respectively. Your router will also keep track of the names of the computers attached to your local network.
Finally, we get to the Border Gateway Protocol (BGP). If DNS is the phone book where you look up street addresses, BGP is the road map that tells your packets how to get there from your house, and in particular what route to take.
The Internet is a network of networks, and it's split up into "autonomous systems (AS), each of which is a large pool of routers belonging to a single organization. Each AS exchanges messages with its neighbors, using BGP to determine the "best" route between the itself and every other AS in the Internet. (The best route isn't always the shortest; the protocol can also take things like the cost of messages into account.) BGP isn't entirely automatic -- there's some manual configuration involved.
What happened yesterday was that somebody at Facebook accidentally gave a command that resulted in all the routes leading to Facebook's data centers being withdrawn. In less than a minute Facebook's DNS servers noticed that their network was "unhealthy", and took themselves offline. At that point Facebook had basically shot themselves in the foot with a cannon.
Normally, engineers can fix server configuration problems like this by connecting to the servers over the internet. But Facebook's servers weren't connected to the internet anymore. To make matters worse, the computers that control access to Facebook's buildings -- offices as well as data centers -- weren't able to connect to the database that told them whose badges were valid.
Meanwhile, computers that wanted to look up Facebook or any of its other domains (like WhatsApp and Instagram), kept getting DNS failures. There isn't a good way for an app or a computer to determine whether a DNS lookup failure is temporary or permanent, so they keep re-trying, sometimes (as Cloudflare's blog post puts it) "aggressively". Users don't usually take an error for an answer either, so they keep reloading pages, restarting their browsers, and so on. "Sometimes also aggressively." Traffic to Facebook's DNS servers increased to 30 times normal, and traffic to alternatives like Signal, Twitter, Telegram, and Tiktok nearly doubled.
Altogether a nice demonstration of Facebook's monopoly power, and great fun to read about if you weren't relying on it.
Resources
- Understanding How Facebook Disappeared from the Internet
- Update about the October 4th outage - Facebook Engineering
- What Happened to Facebook, Instagram, & WhatsApp? – Krebs on Security
- More details about the October 4 outage - Facebook Engineering
- Beginner's Guide to Understanding BGP
- What is DNS? | How DNS works | Cloudflare
Another fine post from
The Computer Curmudgeon (also at
computer-curmudgeon.com).
Donation buttons in profile.
no subject
Date: 2021-10-06 11:58 pm (UTC)no subject
Date: 2021-10-07 04:55 am (UTC)no subject
Date: 2021-10-07 08:56 am (UTC)I have never been more glad that I no longer use Facebook or Instagram. (I have WhatsApp for family stuff, but don't need to rely on it thankfully.) The sad part is that it will probably happen again because the solution would involve reversing the centralisation of all Facebook services, which in turn would make it easier for parts of the empire to be divested, either willingly or unwillingly.
no subject
Date: 2021-10-07 01:54 pm (UTC)I use Facebook, carefully, for keeping in touch with family and friends who aren't on DW. It sucks, but... Wouldn't touch WhatsApp with a barge pole -- I use Signal for text.
no subject
Date: 2021-10-07 10:35 am (UTC)no subject
Date: 2021-10-07 04:22 pm (UTC)I do know what BGP is, but I hadn't heard that it was the start of the cascading clusterfuck. If all you have to do to disable that entire core network is briefly shut down BGP, I'm surprised this is the first time this has happened. I mean, you do have to upgrade the router firmware occasionally, which means shutting things down. I can only assume that the correct maintenance procedure was to shift the backbone traffic to an alternative backbone before shutting down the main one, and somebody typed the commands in the wrong order? And it sounds like they were performing the maintenance remotely OVER THE INTERNET which seems like a TERRIBLE idea. Sometimes you really do need to be within arms reach of the equipment, just in case.
The "I can't let you do that, Dave" scenario is kind of hilarious but in some ways it reminds me more of "Mr. Robot." When I watched that show I kept thinking (hoping) that their scenarios were unrealistic because who would design a major core network in such a way that the internal maintenance and building security functions were on the same network as the data traffic?? Well, apparently Facebook, for one. Yikes. I hope they are seriously rethinking their security philosophy.
no subject
Date: 2021-10-07 05:07 pm (UTC)Once an organization gets to the point where it needs multiple data centers, it's no longer possible to work on the servers hands-on. Long before that point, actually -- all of the servers for my personal websites are hosted, at Dreamhost and GitHub. But it does show a serious lack of redundancy on FB's part.
But any sufficiently large system is likely to have cascading failure modes. Beginner's Guide to Understanding BGP has a good example, where Pakistan tried to ban YouTube by abusing BGP. And of course it's not just BGP; the same kind of thing happens in [https://en.wikipedia.org/wiki/Northeast_blackout_of_2003](electrical grids) and (as we're seeing now) international shipping.
There are lots more examples in Cascading failure - Wikipedia.
no subject
Date: 2021-10-07 06:16 pm (UTC)Even on Mr. Robot, where they were setting data centers on fire by hacking into the building systems from the Internet, I don't think the fire department ended up locked out of the buildings.
no subject
Date: 2021-10-07 09:46 pm (UTC)One of the FB blog posts says that they did have a way to get access, but it was slow. My guess is that it involved finding someone with a key. That was certainly required for their office buildings -- there were posts to that effect.
Fire alarms often unlock the doors automatically, and I think there's usually a master key that fire and police departments can use.
Ahh, I knew you'd explain this in a way I could comprehend.
Date: 2021-10-09 09:45 pm (UTC)I'll admit my initial response was smugness that I rarely use those sites.
But then Nancy Lebovitz on Metafilter linked me to an excellent blog post by Jim Wright at Stonekettle. I was reminded that the infrastructure Facebook makes available is crucial for folks our age who didn't happen to learn computers when they were young.
Re: Ahh, I knew you'd explain this in a way I could comprehend.
Date: 2021-10-10 05:07 am (UTC)Yeah -- I have a foot in both worlds. Sure, my main social medium is DW, but if I want to keep up with what the people in my family are doing, I have to use FB, because it's all they know. Sucks.
Re: Ahh, I knew you'd explain this in a way I could comprehend.
Date: 2021-10-11 11:01 am (UTC)