1

It seems like some bots are not following my robots.txt file, including MJ12bot which is the one from majestic.com and is supposed to follow the instructions.

The file looks like this:

User-agent: google
User-agent: googlebot
Disallow: /results/
Crawl-Delay: 30

User-agent: *
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 30

What I aim to tell the bots is that:

  • Only google can crawl any url containing /travel/, /viajar/ or /reisen/.
  • None of them should access any url containing /results/.
  • The time-span between 2 queries should be at least 30secs.

However, MJ12bot is crawling urls containg /travel/, /viajar/ or /reisen/ anyway, and in addition, it does not wait 30secs between queries.

mydomain.com/robots.txt is showing the file as expected.

Is there anything wrong with the file?

J0ANMM
  • 7,849
  • 10
  • 56
  • 90

1 Answers1

1

Your robots.txt is correct.

For example, the MJ12bot should not crawl http://example.com/reisen/42/, but it may crawl http://example.com/42/reisen/.

If you checked that the host is the same (https vs. http, www vs. no www, same domain name), you could consider sending Majestic a message:

We are keen to see any reports of potential violations of robots.txt by MJ12bot.

If you don’t want to wait, you could try if it works when targeting MJ12bot directly:

User-agent: MJ12bot
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 20

(I changed the Crawl-Delay to 20 because that’s the maximum value they support. But specifying a higher value should be no problem, they round it down.)

Update

Why might they crawl http://example.com/42/reisen/? That might be actually my problem, since the url has the form example.com/de/reisen/ or example.com/en/travel/... Should I change to */travel/ then?

A Disallow value is always the beginning of the URL path.

If you want to disallow crawling of http://example.com/de/reisen/, all of the following lines would achieve it:

Disallow: /
Disallow: /d
Disallow: /de
Disallow: /de/
Disallow: /de/r

etc.

In the original robots.txt specification, * has no special meaning in Disallow values, so Disallow: /*/travel/ would literally block http://example.com/*/travel/.

Some bots support it, though (including Googlebot). The documentation about the MJ12bot says:

Simple pattern matching in Disallow directives compatible with Yahoo's wildcard specification

I don’t know the Yahoo spec they refer to, but it seems likely that they’d support it, too.

But if possible, it would of course be better to rely on the standard features, e.g.:

User-agent: *
Disallow: /en/travel/
Disallow: /de/reisen/
unor
  • 92,415
  • 26
  • 211
  • 360
  • 1
    Why might they crawl `http://example.com/42/reisen/`? That might be actually my problem, since the url has the form `example.com/de/reisen/` or `example.com/en/travel/`... Should I change to `*/travel/` then? – J0ANMM Jul 10 '18 at 10:32
  • I found some more information [here](https://geoffkenyon.com/how-to-use-wildcards-robots-txt/). It seems that the proper way would be `/*/travel/`. Did I get it correctly? If so, you can adapt the answer and I will gladly mark it as accepted. – J0ANMM Jul 10 '18 at 10:41