3

I'm trying to set X-Robots-Tag to allow Googlebot to index my website. I don't have a robots.txt file and I don't have any meta tags relating to X-Robots-Tag in any of my html files. The Apache server is returning a header with X-Robots-Tag set to "noindex, nofollow". How do I unset this tag by editing the .htaccess file?

This is what I get when using the Chrome addon "Robots Exclusion Checker":

X-Robots status BLOCKED noindex,nofollow.

Date: Thu, 23 Jul 2020 20:27:46 GMT
Content-Type: text/html
Content-Length: 1272
Connection: keep-alive
Keep-Alive: timeout=30
Server: Apache/2
X-Robots-Tag: noindex, nofollow
Last-Modified: Fri, 09 Mar 2018 19:26:43 GMT
ETag: "ae0-xxxxxxxxxx-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Cache-Control: max-age=3600
Expires: Thu, 23 Jul 2020 21:27:46 GMT

Contents of my .htaccess file:

# compress text, html, javascript, css, xml:
AddOutputFilterByType DEFLATE text/plain
AddOutputFilterByType DEFLATE text/html
AddOutputFilterByType DEFLATE text/xml
AddOutputFilterByType DEFLATE text/css
AddOutputFilterByType DEFLATE application/xml
AddOutputFilterByType DEFLATE application/xhtml+xml
AddOutputFilterByType DEFLATE application/rss+xml
AddOutputFilterByType DEFLATE application/javascript
AddOutputFilterByType DEFLATE application/x-javascript

# Or, compress certain file types by extension:
<files *.html>
SetOutputFilter DEFLATE
</files>

Header onsuccess unset X-Robots-Tag
Header always set X-Robots-Tag "index,follow"

I've tried adding this to the bottom of the .htaccess file:

<files *.html>
Header set X-Robots-Tag "index,follow"
</files>

I then get this response from the Chrome extension:

X-Robots BLOCKED noindex,nofollow,index,follow.

(Notice it appears twice in the list below.)

Date: Thu, 23 Jul 2020 20:39:42 GMT
Content-Type: text/html
Content-Length: 1272
Connection: keep-alive
Keep-Alive: timeout=30
Server: Apache/2
X-Robots-Tag: noindex, nofollow
Last-Modified: Fri, 09 Mar 2018 19:26:43 GMT
ETag: "ae0-xxxxxxxxxxxxx-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Cache-Control: max-age=3600
Expires: Thu, 23 Jul 2020 21:39:42 GMT
X-Robots-Tag: index,follow

Is there a way to delete the original X-Robots-tag header and replace it with the new one? I tried Header unset X-Robots-Tag, but no go (still shows "BLOCKED noindex,nofollow").


Solution: What has worked for me was to include a robots.txt file and to ensure all hyperlinks end with a trailing slash. It seems without the trailing slash I get a 301 redirect, which includes the offending noindex,nofollow header.

TheMarkster
  • 33
  • 1
  • 1
  • 5
  • 2
    "How do I unset this tag by editing the .htaccess file?" - you really shouldn't need to - you need to find where this header is being set in the first place. "The apache server is returning a header" - yes, but it's likely that it's your application that is _setting_ this header. – MrWhite Jul 23 '20 at 21:29
  • My index.html page is very, very simple and only hyperlinks inside the body to other parts of the site. Main Page ... – TheMarkster Jul 25 '20 at 02:06
  • What's in your server config? The `X-Robots-Tag` does not set itself - it must be explicitly set somewhere in your config. What kind of hosting do you have? – MrWhite Jul 25 '20 at 12:21
  • @MrWhite The site is hosted on freeyellow. Here is a link to the server information: https://mwganson.freeyellow.com/cgi-bin/server_information.php but I don't see anything in it related to x-robots-tag. I searched all my files for "robots", "noindex" and "nofollow" but found nothing. – TheMarkster Jul 25 '20 at 14:11

1 Answers1

4

My index.html page is very, very simple and only hyperlinks inside the body to other parts of the site.
The site is hosted on ...

As noted in comments, you should really identify the source that is setting this header in the first place, rather than trying to override (or unset) it. This is not something Apache does by default, this header must be explicitly set somewhere.

If you are not setting this header (in your server-side script or any .htaccess file along the filesystem path - even above the document root) then it must be set in the vHost/server config. If you don't have access to the server config then you should contact your webhost to see what's wrong.

<files *.html>
Header set X-Robots-Tag "index,follow"
</files>

This would ordinarily "work", unless the header had previously been set on the always table of response headers. In which case, you would need to do the same. For example:

Header always set X-Robots-Tag "index,follow"

You shouldn't need the <Files> wrapper - unless you specifically want to target requests that only map to *.html files? I would imagine the "noindex,nofollow" header is being set on every request (eg. images and other static resources).

However, you don't need to explicitly set "index,follow" - since this is the default behaviour that search engines perform, whether the header is set or not. So, in this case you just need to unset the header (as you also suggest), but again, you'll need to use the always table of headers (if that was the table on which the header was set to begin with). For example:

Header always unset X-Robots-Tag

The "always" table is perhaps a bit misleadingly named, as the above looks (to the casual reader) that the header is perhaps always unset (as opposed to sometimes) - but that is not the case. There are two separate groups/tables of response headers: "always" and "onsuccess" (the default). The two are mutually exclusive. The difference being that the "always" group are always applied - even on errors and internal rewrites/subrequests. The default group is not.

Reference:
https://httpd.apache.org/docs/2.4/mod/mod_headers.html#header

MrWhite
  • 43,179
  • 8
  • 60
  • 84
  • Incidentally, you have also set an "index,follow" header on your "server information page" - this is something you obviously don't want indexed (in fact, it shouldn't be public at all). Whilst the page also has a "noindex" HTML meta tag, the HTTP response header will take priority. – MrWhite Jul 27 '20 at 11:57
  • I tried the always set and always unset suggestions, but neither worked. I noticed the chrome extension I'm using to test with showed the robots.txt file to not be well formatted. I didn't even have a robots.txt file, so I added one. That seems to have been good enough for googlebot, but I'm waiting to see the results of google's search console validation. The chrome extension still shows x-robots status as blocked. If the validation is successfull I'll come back and mark as solved. – TheMarkster Jul 27 '20 at 16:14
  • Still not working. Got this when requesting indexing with google's URL Inspection from search console: Indexing allowed?  "No: 'noindex' detected in 'X-Robots-Tag' http header" It must be something out of my control. – TheMarkster Jul 31 '20 at 20:11
  • Which URL specifically are you testing/submitting? Is this related to the site you posted above in a comment to your question? `robots.txt` is a separate thing - you don't strictly need a `robots.txt` file if you want a site to be indexed, but you'll get a splattering of 404s if you don't have one. A single `Disallow:` (no slash) directive is preferable to `Allow: /` - but it doesn't really matter. – MrWhite Aug 01 '20 at 00:32
  • Same site, yes. I figured not having robots.txt and not having any meta tags was the way to go, but adding the robots.txt file has at least made the chrome extension happy. But google still insists there is a X-Robots-Tag set to 'noindex' when I try to submit for indexing. – TheMarkster Aug 01 '20 at 20:10
  • "google still insists there is a X-Robots-Tag set to 'noindex'" - But which URL? You can check all the response headers using the built-in browser tools. I've checked many of your pages (including the home page) and they all appear to "correctly" show `X-Robots-Tag: index,follow` - there is no "noindex" tag on the actual pages. Google has also indexed many (if not all) of your pages as you can find them in Google search. – MrWhite Aug 01 '20 at 22:48
  • However, your internal links are incorrect. You are linking to the non-canonical URL _without_ a trailing slash. This triggers a 301 redirect to append the trailing slash (a feature of mod_dir on Apache). This 301 redirect response does contain a "noindex, nofollow" `X-Robots-Tag` header together with your "index,follow" header. This isn't strictly correct, but it should not cause a problem. How _exactly_ are you setting the header? Are you still using a `` wrapper? You can also try setting it twice, both with and without the `always` argument. – MrWhite Aug 01 '20 at 22:59
  • I think you have it with the bit about the trailing slashes. When I enter a url in the inspection tool without the slash it reports noindex,nofollow, but if I enter it with a trailing slash it accepts the url for indexing. I'll do some research on fixing those 301 redirects. – TheMarkster Aug 02 '20 at 04:16
  • "fixing those 301 redirects." - Ideally, you would avoid the 301 redirect by appending the slash to the end of your internal URLs. You can avoid the trailing slash and still avoid the redirect, but this does require more work as it requires some URL rewriting with mod_rewrite. – MrWhite Aug 02 '20 at 13:51