7

Did Facebook just implement some web crawler? My website has been crashing a couple times over the past few days, severely overloaded by IPs that I've traced back to Facebook.

I have tried googling around but can't find any definitive resource regarding controling Facebook's crawler bot via robots.txt. There is a reference on adding the following:

User-agent: facebookexternalhit/1.1 Crawl-delay: 5

User-agent: facebookexternalhit/1.0 Crawl-delay: 5

User-agent: facebookexternalhit/* Crawl-delay: 5

But I can't find any specific reference on whether Facebook bot respects the robots.txt. According to older sources, Facebook "does not crawl your site". But this is definitely false, as my server logs showed them crawling my site from a dozen+ IPs from the range of 69.171.237.0/24 and 69.171.229.115/24 at the rate of many pages each second.

And I can't find any literature on this. I suspect it is something new that FB just implemented over the past few days, due to my server never crashing previously.

Can someone please advice?

Mongrel Jedi
  • 747
  • 2
  • 7
  • 22
  • Yes, something has recently changed as it starting crashing us for the first time in the 8 years we have been around. Supposedly they are "updating their opengraph". However, looking at our pages it is requesting (very old obscure pages), I wondering if a legit bot is executing javascript, and pulling in the like buttons, triggering a FB OpenGraph update. That is just a hunch... – Stickley Nov 09 '12 at 17:00
  • Related questions: http://stackoverflow.com/questions/11521798/excessive-traffic-from-facebookexternalhit-bot?lq=1 and http://stackoverflow.com/questions/7716531/facebook-and-crawl-delay-in-robots-txt?lq=1 – Stickley Nov 09 '12 at 17:07
  • Thanks for your suggestions and references, Hank. In a twist of event, my site was overwhelmed with dozens of accesses per second, for a couple of hours on Nov 8th or 9th. But this time - it wasn't Facebook, but Amazon. It suddenly started massively spidering a huge bunch of links within the site, but there doesn't seem to be any obvious patterns - some pages accessed are obscure/old pages, while some are latest ones. Wonder if they are refreshing their own search engine database. – Mongrel Jedi Nov 11 '12 at 08:14
  • The same fix will work for amazon, as well as facebookexternalhit. See http://stackoverflow.com/questions/11521798/excessive-traffic-from-facebookexternalhit-bot/13276722#13276722 and just add some conditional OR's to check for a couple of user agents. – Stickley Nov 13 '12 at 05:14
  • Thank you, Hank. Btw, perhaps you could further optimize the code by removing the need to read/write into the log file, and directly use/update the file's timestamp for comparison. – Mongrel Jedi Nov 13 '12 at 08:22

3 Answers3

3

As discussed in in this similar question on facebook and Crawl-delay, facebook does not consider itself a bot, and doesn't even request your robots.txt, much less pay attention to it's contents.

You can implement your own rate limiting code as shown in the similar question link. The idea is to simply return http code 503 when you server is over capacity, or being inundated by a particular user-agent.

It appears those working for huge tech companies don't understand "improve your caching" is something small companies don't have budgets to handle. We are focused on serving our customers that actually pay money, and don't have time to fend off rampaging web bots from "friendly" companies.

Community
  • 1
  • 1
Stickley
  • 4,561
  • 3
  • 30
  • 29
1

We saw the same behaviour at about the same time (mid October) - floods of requests from Facebook that caused queued requests and slowness across the system. To begin with it was every 90 minutes; over a few days this increased in frequency and became randomly distributed.

The requests appeared not to respect robots.txt, so we were forced to think of a different solution. In the end we set up nginx to forward all requests with a facebook useragent to a dedicated pair of backend servers. If we were using nginx > v0.9.6 we could have done a nice regex for this, but we weren't, so we used a mapping along the lines of

    map $http_user_agent $fb_backend_http {
             "facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)"
                    127.0.0.1:80;
     }

This has worked nicely for us; during the couple of weeks that we were getting hammered this partitioning of requests kept the heavy traffic away from the rest of the system.

It seems to have largely died down for us now - we're just seeing intermittent spikes.

As to why this happened, I'm still not sure - there seems to have been a similar incident in April that was attributed to a bug http://developers.facebook.com/bugs/409818929057013/ but I'm not aware of anything similar more recently.

annaken
  • 373
  • 1
  • 3
  • 9
  • Thank you for sharing. I am using Apache - hopefully they have a similar approaching to re-mapping requests by user-agent. But that would presume I have another good server to offload these dynamic accesses to as they are not static pages, or else I'll have to discard the requests entirely and hope that FB doesn't think treat my site as invalid. Similar to what you observed, the incident stopped shortly thereafter. It could be some haywire FB process - but it is certainly a bad practice at their end not to respect robots.txt. – Mongrel Jedi Nov 04 '12 at 06:31
0

Whatever facebook invented you definitely need to fix your server as it is possible to crash it with external requests.

Also, just a first hit on google for facebookexternalhit: http://www.facebook.com/externalhit_uatext.php

Serge
  • 6,088
  • 17
  • 27
  • Thanks. I did check out that FB uatext page, although it didn't offer anything specific. The pages that are crashing my server is from the Wordpress blog section which contains a few thousand posts. Unfortunately, the engine is not efficient enough even with all the tweaks and quickcache installed, and the only way I could think of as a quick fix is to implement robots.txt crawl delay, but I don't know if FB respects it. I've not had problems with Google crawl though as it is spread throughout the day. FB pounces on the tons of pages all at one go and kills the server. – Mongrel Jedi Oct 14 '12 at 09:19
  • I got one more reason why I don't like FB :) – Serge Oct 14 '12 at 09:23