Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders?
-
1robots.txt has nothing to do with what the "site accepts". It's just a posted list of rules that well-behaved agents are expected to obey. You're only recourse for breaking the rules is to use a different mechanism to ban by ip or user-agent. – Eclipse Mar 22 '09 at 20:09
-
I agree with you: I couldn't express the concept better dued to my quite poor english. – Mar 29 '09 at 08:08
-
3As spiders provoke much activity in your server, I'm interested in allowing access only to those from the major SE (mainly Google) that bring visits to my website. The reason is that I'm going to start an Amazon EC2 VPS and don't want to pay for the traffic and cpu usage that all those so many spiders can cause. Maybe it's not significative, but the idea seems quite reasonable for me. – Jan 30 '13 at 22:30
-
@user2027230 You have clearly not grasped the intent of the internet, which is to make your data publicly avaliable (to all). – Marcus Aug 25 '16 at 18:04
-
@Marcus not to those who scrape your site, who consume your server resources, who crash your server and render your site unusable. – yenren Oct 01 '16 at 05:30
-
I respect the bots that respect my `robots.txt` . Don't block the good people who respect your `robots.txt` because the first thing that bad people do is to *ignore* your `robots.txt` – Accountant م Aug 22 '18 at 14:37
5 Answers
User-agent: * Disallow: / User-agent: Googlebot Allow: / User-agent: Slurp Allow: / User-Agent: msnbot Disallow:
Slurp is Yahoo's robot

- 8,092
- 4
- 27
- 28
-
2Google, MSN, and Yahoo have other spiders that you may want to `Allow` as well ( eg. msnbot-media, bingbot ). Also, bingbot is the Microsoft spider that I see the most in logs for sites I operate. – T. Brian Jones Sep 07 '13 at 19:39
-
-
My website is getting more visit per day, I thought it is a bot visit. I want to block this visits from bots so this above robots.txt code can block all other visit and give permission to visit google, yahoo and msn ? Is this work for me ? – Bhavin Dec 03 '18 at 05:22
Why?
Anyone doing evil (e.g., gathering email addresses to spam) will just ignore robots.txt. So you're only going to be blocking legitimate search engines, as robots.txt compliance is voluntary.
But — if you insist on doing it anyway — that's what the User-Agent:
line in robots.txt is for.
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /
With lines for all the other search engines you'd like traffic from, of course. Robotstxt.org has a partial list.

- 49,731
- 15
- 94
- 124
-
2"I'm only OK with big players scraping my site" is not nice to the smaller, up-and-coming players. I wish I could upvote your "Why?" a thousand times more. I mean, if you're fine with the current state of things, i.e. everyone's in Google's lap, then by all means, go ahead and exclude all other crawlers. – Marcus Aug 25 '16 at 17:37
-
4I have to disagree, the thing is, there is many up-coming players and it puts too much pressure on bandwidth especially if you have a large website with thousands of new links everyday... then you may want to get rid of those who barely make 1% of internet searches and go with the big 3 instead – jjj Jul 21 '17 at 18:28
-
@jjj if some particular bot is scraping your site to aggressively, you can use robots.txt to ask them to stop. And of course if it's just one site blocking everyone but Google, no one will care. But if a notable portion of sites followed your advice, then robots.txt would become the standard for locking in Google's monopoly, and every other bot would either ignore it or alternatively pretend to be Googlebot. – derobert Jul 21 '17 at 19:42
There are more than 3 major search engines depending on which country you are talking. Facebook seem to be doing a good job listing only legitimate ones: https://facebook.com/robots.txt
So your robots.txt can be something like:
User-agent: Applebot
Allow: /
User-agent: baiduspider
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Facebot
Allow: /
User-agent: Googlebot
Allow: /
User-agent: msnbot
Allow: /
User-agent: Naverbot
Allow: /
User-agent: seznambot
Allow: /
User-agent: Slurp
Allow: /
User-agent: teoma
Allow: /
User-agent: Twitterbot
Allow: /
User-agent: Yandex
Allow: /
User-agent: Yeti
Allow: /
User-agent: *
Disallow: /

- 19,976
- 6
- 58
- 55
As everyone know, the robots.txt is a standard to be obeyed by the crawler and hence only well-behaved agents do so. So, putting it or not doesn't matter.
If you have some data, that you do not show on the site as well, you can just change the permission and improve the security.

- 13,221
- 16
- 72
- 112
Crawl-Delay could also help if bandwidth is an issue
User-agent: *
Disallow: /
Crawl-Delay: 10
Sitemap: https://yoursite.com/sitemapindex.xml
User-agent: Googlebot
Allow: /
User-agent: Slurp
Allow: /
User-Agent: msnbot
Allow: /
User-agent: Applebot
Allow: /
User-agent: baiduspider
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Facebot
Allow: /
User-agent: Twitterbot
Allow: /
Disallow:

- 3,611
- 2
- 36
- 34