Ignore URLs in robot.txt with specific parameters?

Question

I would like Google to ignore URLs like this:

http://www.mydomain.example/new-printers?dir=asc&order=price&p=3

In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?

Book Of Zeus · Accepted Answer · 2012-02-05T15:42:50.063

167

Here's a solutions if you want to disallow query strings:

Disallow: /*?*

or if you want to be more precise on your query string:

Disallow: /*?dir=*&order=*&p=*

You can also add to the robots.txt which url to allow

Allow: /new-printer$

The $ will make sure only the /new-printer will be allowed.

More info:

http://code.google.com/web/controlcrawlindex/docs/robots_txt.html

http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/

edited Feb 05 '12 at 15:42

answered Feb 05 '12 at 14:17

Book Of Zeus

49,509
18
174
171

this will disallow new-printers I only want to disorder the querystring part – Luis Valencia Feb 05 '12 at 15:02
so you want to allow `/new-printer` but not `/new-printers?dir=*&order=*&p=*?`? – Book Of Zeus Feb 05 '12 at 15:05
1

Are those advanced wildcards and the allow directive supported well? – Tony McCreath Jan 15 '13 at 14:34
11

According to http://www.robotstxt.org/robotstxt.html - "there is no "Allow" field" – Jamie Edwards Apr 22 '13 at 09:38
Taking the new-printers example a bit further, what if different combinations and orders of parameters are possible on that file. Can you specify in a single query that a specific file should be disallowed if any kind of parameters are added to it without explicitly specifying them? Would... Disallow: /new-printer?* work? – AdamJones Aug 27 '14 at 21:20
@AdamJones the last command should work. It will follow the same logic as the first condition. I never tried it so I can't guarantee it will work. – Book Of Zeus Aug 27 '14 at 22:25
@JamieEdwards it's true that "Allow" is technically speaking not part of the standard, but most of the popular search engines do support it. Allow lines should be *before* Disallow lines though. – Andy Madge Oct 06 '14 at 16:36
@BookOfZeus Will the page will be crawled or not? If we add the said condition in `robots` – Pranav Bilurkar Aug 02 '17 at 08:11
There is now (as of 2019) a proposed standard undergoing ratification, and it does include Allow lines - https://datatracker.ietf.org/doc/html/draft-koster-rep - perhaps surprisingly, it appears there was no formal "standard" previous to this, and search engines were left to their own devices to operate "by convention" with a "de facto" standard that led to spotty support for Allow lines except for the big ones (eg Google and Bing). – Matt Wagner Jun 04 '21 at 13:56

score 38 · Answer 2 · answered May 04 '15 at 17:51

38

You can block those specific query string parameters with the following lines

Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=

So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.

answered May 04 '15 at 17:51

Nick Rolando

25,879
13
79
119

Does this means that the whole page will not be crawled as long as the above condition is satisfied. – Pranav Bilurkar Aug 02 '17 at 08:10
2

Beware: this will also block parameters which partially match expression, so not only `example.com?p=test` but also `example.com?top=test`. – rob006 Dec 01 '19 at 16:32
2

If you would like to ignore those parameters regardless their position in the URL (first position or next) you can try that : `Disallow: /*?dir=* Disallow: /*?order=* Disallow: /*?p=* Disallow: /*&dir=* Disallow: /*&order=* Disallow: /*&p=*` – lboix Feb 20 '20 at 18:21
Can the `?` be ignored? – Time Killer Nov 04 '21 at 09:20
If I `Disallow: /*?*order=`, will it also disallow requests that contain `reorder=`? – Szczepan Hołyszewski Jan 21 '22 at 10:27
dont forget User-agent: * at start of file – Hayden Thring Feb 18 '22 at 03:33

score 0 · Answer 3 · answered Feb 05 '12 at 15:03

0

Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters.

Site Configuration -> URL Parameters

You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.

answered Feb 05 '12 at 15:03

Tony McCreath

2,882
1
14
21

3

While the original question mentions Google specifically, it's important to note that the Google WebMaster Tools would only block Google. Adding the Disallow rules in the robots.txt file would address other search engines as well. – Matt V. Jan 14 '13 at 20:37
1

True. It should also be clarified that robots.txt does not stop Google indexing pages but stops it reading their content. The best solution is using the robots meta tag on the page itself. This is supported by all systems. – Tony McCreath Jan 15 '13 at 14:35
4

Note that this doesn't work anymore since they removed that functionality, see https://developers.google.com/search/blog/2022/03/url-parameters-tool-deprecated – Joël Aug 04 '22 at 08:15

Ignore URLs in robot.txt with specific parameters?

3 Answers3

Linked