10

Is it possible in robots.txt to give one instruction to multiple bots without repeatedly having to mention it?

Example:

User-agent: googlebot yahoobot microsoftbot
Disallow: /boringstuff/
elhombre
  • 2,839
  • 7
  • 28
  • 28
  • I posted a very similar question on Webmasters.stackexchange because I figured it would be more appropriate there. Then I saw this question was already asked here, so just figured I'd backlink in case anybody else wants to read additional responses: http://webmasters.stackexchange.com/questions/59560/combine-user-agents-in-robots-txt?noredirect=1#comment62106_59560 – davidcondrey Mar 20 '14 at 00:53

5 Answers5

17

Note: since this answer was originally written, Google's description has been substantially rewritten and no longer has any ambiguity on this topic. Moreover, there is finally a formal standard in the form of RFC 9309.

The conclusion below stands: there is a recognised way of grouping user agents, but you might wish to use the simplest format possible in case of unsophisticated crawlers.

Original answer follows.


It's actually pretty hard to give a definitive answer to this, as there isn't a very well-defined standard for robots.txt, and a lot of the documentation out there is vague or contradictory.

The description of the format understood by Google's bots is quite comprehensive, and includes this slightly garbled sentence:

Muiltiple start-of-group lines directly after each other will follow the group-member records following the final start-of-group line.

Which seems to be groping at something shown in the following example:

user-agent: e
user-agent: f
disallow: /g

According to the explanation below it, this constitutes a single "group", disallowing the same URL for two different User Agents.

So the correct syntax for what you want (with regards to any bot working the same way as Google's) would then be:

User-agent: googlebot
User-agent: yahoobot
User-agent: microsoftbot
Disallow: /boringstuff/

However, as Jim Mischel points out, there is no point in a robots.txt file which some bots will interpret correctly, but others may choke on, so it may be best to go with the "lowest common denominator" of repeating the blocks, perhaps by dynamically generating the file with a simple "recipe" and update script.

IMSoP
  • 89,526
  • 13
  • 117
  • 169
  • 1
    I don't have time to rewrite the answer right now, but for anyone finding this, note that [Google's description has been substantially rewritten](https://developers.google.com/search/reference/robots_txt) and [there is finally an attempt to make a formal standard](https://webmasters.googleblog.com/2019/07/rep-id.html)! – IMSoP Mar 09 '20 at 11:38
7

I think the original robots.txt specification defines it unambiguously: one User-agent line can only have one value.

A record (aka. a block, a group) consists of lines. Each line has the form

<field>:<optionalspace><value><optionalspace>

User-agent is a field. It’s value:

The value of this field is the name of the robot the record is describing access policy for.

It’s singular ("name of the robot"), not plural ("the names of the robots").

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If several values would be allowed, how could parsers possibly be liberal? Whichever the delimiting character would be (,, , ;, …), it could be part of the robot name.

The record starts with one or more User-agent lines

Why should you use several User-agent lines if you could provide several values in one line?

In addition:

  • the specification doesn’t define a delimiting character to provide several values in one line
  • it doesn’t define/allow it for Disallow either

So instead of

User-agent: googlebot yahoobot microsoftbot
Disallow: /boringstuff/

you should use

User-agent: googlebot
User-agent: yahoobot
User-agent: microsoftbot
Disallow: /boringstuff/

or (probably safer, as you can’t be sure if all relevant parsers support the not so common way of having several User-agent lines for a record)

User-agent: googlebot
Disallow: /boringstuff/

User-agent: yahoobot
Disallow: /boringstuff/

User-agent: microsoftbot
Disallow: /boringstuff/

(resp. of course User-agent: *)

unor
  • 92,415
  • 26
  • 211
  • 360
  • Aha, I skim read that doc and missed the crucial phrase - "The record starts with one or more User-agent lines" - which is actually a lot clearer than the Google document I quoted. – IMSoP Dec 01 '13 at 15:21
2

According to the original robots.txt exclusion protocol:

User-agent

The value of this field is the name of the robot the record is describing access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

I have never seen multiple bots listed in a single line. And it's likely that my web crawler would not have correctly handled such a thing. But according to the spec above, it should be legal.

Note also that even if Google were to support multiple user agents in a single directive, or the multiple user agents as described in IMSoP's answer (interesting find, by the way ... I didn't know that one), not all other crawlers will. You need to decide if you want to use the convenient syntax that very possibly only Google and Bing bots will support, or use the more cumbersome and simpler syntax that all polite bots support.

Community
  • 1
  • 1
Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • I read through that before I left my answer, and was frankly none the wiser: On the one hand, it could be describing multiple user agents on one line, although if so, it doesn't say how to separate them (whitespace? commas?). On the other hand, "record" here could be what Google calls a "group", and "more than one User-agent field" could mean "more than one line beginning 'User-agent:'". – IMSoP Nov 30 '13 at 14:51
  • I would have to agree that the "lowest common denominator" approach is the sensible one here, though. I'm somewhat surprised there's still no better definition out there; [this "better documentation" from Bing](http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx) is laughable. – IMSoP Nov 30 '13 at 14:52
0

You have to put each bot on a different line.

http://en.wikipedia.org/wiki/Robots_exclusion_standard

SamV
  • 7,548
  • 4
  • 39
  • 50
  • The only thing I read there was "It is also possible to list multiple robots **with their own rules.**)", but it doesn't tell me if I can specify one rule for all robots without having to repeat myself with the same directive for each robot – elhombre Nov 29 '13 at 23:12
-1

As mentioned in the accepted answer, the safest approach is to add a new entry for each bot.

This repo has a good robots.txt file for blocking a lot of bad bots: https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt

Code on the Rocks
  • 11,488
  • 3
  • 53
  • 61
  • I would not recommend using that robots.txt file (or any other huge list of bad bot robots.txt). Just add problematic bots to robots.txt as they cause problems for your site. There will probably only be a few. – Stephen Ostermiller Nov 02 '22 at 17:18
  • Is there any technical downside to using this one? I think most of these bots are notoriously bad – Code on the Rocks Nov 02 '22 at 19:04
  • 1
    1. Bad bots don't typically obey robots.txt, so it isn't usually the right tool for the job. Only the good bots actually honor it, so listing bad bots in it is typically pointless. – Stephen Ostermiller Nov 03 '22 at 12:42
  • 1
    2. Most bots (including search engine bots) have a line or size limit for how much robots.txt they are willing to process. Having a huge robots.txt makes the entire file invalid to many bots. – Stephen Ostermiller Nov 03 '22 at 12:43
  • 1
    3. That list has some dubious entries. For example TurnitinBot crawls the web to identify plagiarism. If you block the bot, plagiarism from your website won't be detected by their tool. You may want to allow them to crawl to make it harder to copy your website undetected. You should know the reason for each and every bot you block and what the benefit and cost will be. – Stephen Ostermiller Nov 03 '22 at 12:48
  • It also lists several SEO analysis tools that won't work if you try to use them with their crawler blocked. – Stephen Ostermiller Nov 03 '22 at 12:49
  • Wow thank you for the detailed response. I am convinced that I should just add problematic bots - thanks! – Code on the Rocks Nov 03 '22 at 13:05