Your robots.txt is correct.
For example, the MJ12bot should not crawl http://example.com/reisen/42/
, but it may crawl http://example.com/42/reisen/
.
If you checked that the host is the same (https
vs. http
, www
vs. no www
, same domain name), you could consider sending Majestic a message:
We are keen to see any reports of potential violations of robots.txt by MJ12bot.
If you don’t want to wait, you could try if it works when targeting MJ12bot directly:
User-agent: MJ12bot
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 20
(I changed the Crawl-Delay
to 20 because that’s the maximum value they support. But specifying a higher value should be no problem, they round it down.)
Update
Why might they crawl http://example.com/42/reisen/
? That might be actually my problem, since the url has the form example.com/de/reisen/
or example.com/en/travel/
... Should I change to */travel/
then?
A Disallow
value is always the beginning of the URL path.
If you want to disallow crawling of http://example.com/de/reisen/
, all of the following lines would achieve it:
Disallow: /
Disallow: /d
Disallow: /de
Disallow: /de/
Disallow: /de/r
etc.
In the original robots.txt specification, *
has no special meaning in Disallow
values, so Disallow: /*/travel/
would literally block http://example.com/*/travel/
.
Some bots support it, though (including Googlebot). The documentation about the MJ12bot says:
Simple pattern matching in Disallow directives compatible with Yahoo's wildcard specification
I don’t know the Yahoo spec they refer to, but it seems likely that they’d support it, too.
But if possible, it would of course be better to rely on the standard features, e.g.:
User-agent: *
Disallow: /en/travel/
Disallow: /de/reisen/