4

My intention is to fetch the link with PHP and maybe with Simple PHP DOM parser (or something similar) parse the content and look for H1-H6 tags. But prior to that I would need to find out if the page is being indexed at all.

Other than parsing the content and searching for <meta name="robots" content="noindex"> or similar, is there a way I could check if a page is set to noindex also in robots.txt?

Ivan Topić
  • 3,064
  • 6
  • 34
  • 47

1 Answers1

2

There are two ways pages specify noindex: via a meta HTML tag in the section (as you noted), or via HTTP header in the response.

On top of that, there are two ways to specify noindex: one is "noindex", and the other is "none" (which is the equivalent of "noindex, nofollow").

HTML tags can target several crawlers, and could look like this:

<meta name="robots" content="noindex" />

or

<meta name="googlebot" content="noindex" />

or

<meta name="AdsBot-Google" content="noindex" />

or others.

Google has a pretty good writeup here

So the way to check for noindex is to do both:

  1. Check for an X-Robots-Tag containing "noindex" or "none" in the HTTP responses (try curl -I https://www.example.com to see what they look like)
  2. Get the HTML and scan meta tags in for "noindex" or "none" in the content attribute
mhandis
  • 41
  • 1
  • 5