1

I would like for Google to ignore URLs like this:

https://www.example.com/blog/category/web-development?page=2

As my links are getting indexed in Google I need to stop indexing them. What code should I use to not index them?

This is my curet robots.txt file:

Disallow: /cgi-bin/
Disallow: /scripts/
Disallow: /privacy
Disallow: /404.html
Disallow: /500.html
Disallow: /tweets
Disallow: /tweet/

Can I use this to disallow them?

Disallow: /blog/category/*?*
Paul Roub
  • 36,322
  • 27
  • 84
  • 93
  • 1
    @Machavity: I don’t think that this questions ask for SEO advice. It’s plain specification-based question (to answer the question, only the robots.txt spec + Google’s extension of it are relevant). – unor Jul 11 '18 at 15:50
  • 1
    @Machavity it's rare that I disagree with you, but... what unor said. – Paul Roub Jul 11 '18 at 15:55
  • 2
    Close vote retracted – Machavity Jul 11 '18 at 15:59

1 Answers1

0

With robots.txt, you can prevent crawling, not necessarily indexing.

If you want to disallow Google to crawl URLs

  • whose paths start with /blog/category/, and
  • that contain a query component (e.g., ?, ?page, ?page=2, ?foo=bar&page=2 etc.)

then you can use this:

Disallow: /blog/category/*?

You don’t need another * at the end because Disallow values represent the start of the URL (beginning from the path).

But note that this is not supported by all bots. According to the original robots.txt spec, the * has no special meaning. Conforming bots would interpret the above line literally (* as part of the path). If you were to follow only the rules from the original specification, you would have to list every occurrence:

Disallow: /blog/category/c1?
Disallow: /blog/category/c2?
Disallow: /blog/category/c3?
unor
  • 92,415
  • 26
  • 211
  • 360