How can I check if page has noindex?

Question

My intention is to fetch the link with PHP and maybe with Simple PHP DOM parser (or something similar) parse the content and look for H1-H6 tags. But prior to that I would need to find out if the page is being indexed at all.

Other than parsing the content and searching for <meta name="robots" content="noindex"> or similar, is there a way I could check if a page is set to noindex also in robots.txt?

Why don't you just load and parse robots.txt to see if the page is set to noindex? — Reeno, Feb 14 '17 at 22:54

score 2 · Answer 1 · answered Aug 17 '20 at 09:15

There are two ways pages specify noindex: via a meta HTML tag in the section (as you noted), or via HTTP header in the response.

On top of that, there are two ways to specify noindex: one is "noindex", and the other is "none" (which is the equivalent of "noindex, nofollow").

HTML tags can target several crawlers, and could look like this:

<meta name="robots" content="noindex" />

or

<meta name="googlebot" content="noindex" />

or

<meta name="AdsBot-Google" content="noindex" />

or others.

Google has a pretty good writeup here

So the way to check for noindex is to do both:

Check for an X-Robots-Tag containing "noindex" or "none" in the HTTP responses (try curl -I https://www.example.com to see what they look like)
Get the HTML and scan meta tags in for "noindex" or "none" in the content attribute

How can I check if page has noindex?

1 Answers1