26

I have a github page from my repository username.github.io

However I do not want Google to crawl my website and absolutely do not want it to show up on search results.

Will just using robots.txt in github pages work? I know there are tutorials for stop indexing Github repository but what about the actual Github page?

user2961712
  • 469
  • 1
  • 7
  • 17

4 Answers4

32

I don't know if it is still relevant, but google says you can stop spiders with a meta tag:

<meta name="robots" content="noindex">

I'm not sure however if that works for all spiders or google only.

Gumbo
  • 1,716
  • 1
  • 15
  • 22
  • 1
    This is super useful when you don't have root access to the server, as in the case of Github pages. Thanks @Gumbo! – zool Oct 19 '16 at 12:30
13

Short answer:

You can use a robots.txt to stop indexing of your users GitHub Pages by adding it in your User Page. This robots.txt will be the active robots.txt for all your projects pages as the project pages are reachable as subdirectories (username.github.io/project) in your subdomain (username.github.io).


Longer answer:

You get your own subdomain for GitHub pages (username.github.io). According to this question on MOZ and googles reference each subdomain has/needs its own robots.txt.

This means that the valid/active robots.txt for project projectname by user username lives at username.github.io/robots.txt. You can put a robots.txtfile there by creating a GitHub Pages page for your user.

This is done by creating a new project/repository named username.github.io where username is your username. You can now create a robots.txt file in the master branch of this project/repository and it should be visible at username.github.io/robots.txt. More information about project, user and organization pages can be found here.

I have tested this with Google, confirming ownership of myusername.github.io by placing a html file in my project/repository https://github.com/myusername/myusername.github.io/tree/master, creating a robot.txt file there and then verifying that my robots.txt works by using Googles Search Console webmaster tools (googlebot-fetch). Google does indeed list it as blocked and Google Search Console webmaster tools (robots-testing-tool) confirms it.

To block robots for one projects GitHub Page:

User-agent: *
Disallow: /projectname/

To block robots for all GitHub Pages for your user (User Page and all Project Pages):

User-agent: *
Disallow: /

Other options

olavimmanuel
  • 319
  • 2
  • 9
6

Will just using robots.txt in github pages work?

If you're using the default GitHub Pages subdomain, then no because Google would check https://github.io/robots.txt only.

You can make sure you don't have a master branch, or that your GitHub repo is a private one, although, as commented by olavimmanuel and detailed in olavimmanuel's answer, this would not change anything.

However, if you're using a custom domain with your GitHub Pages site, you can place a robots.txt file at the root of your repo and it will work as expected. One example of using this pattern is the repo for Bootstrap.

However, bmaupin points out, from Google's own documentation:

A robots.txt file tells search engine crawlers which URLs the crawler can access on your site.

This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google.

To keep a web page out of Google, block indexing with noindex or password-protect the page."

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • 4
    Actually, it seems that placing `robots.txt` in a subdomain will [work](https://developers.google.com/search/reference/robots_txt#examples-of-valid-robotstxt-urls) according to Google's documentation, unless it's severely outdated. I've noticed a lot of web developers who use Github Pages and Jekyll to create their blogs that have `robots.txt` in their repositories even if they don't use custom domains. I haven't verified that this works, but it seems the evidence is in favor of it working as intended, at least for Google's crawler. – mechalynx Aug 07 '17 at 02:37
  • Thank you VonC, I am using github.io domain, I only have a master branch and my repo is public. But I still cannot search my blog from google? Is there anything else I have to confirm? – Summer Sun Sep 05 '17 at 07:59
  • "I still cannot search my blog from google": this question is about *not* searching a blog through Google. So it seems to be working in your case. – VonC Sep 05 '17 at 08:22
  • @VonC I don't believe making "sure you don't have a master branch, or that your GitHub repo is a private one" will make a difference to GitHub Pages, just the repository. According to GitHubs Help: ["Pages are always publicly accessible when published, even if their repository is private."](https://help.github.com/articles/user-organization-and-project-pages/) A Project Page can be published from a [variety of sources](https://help.github.com/articles/configuring-a-publishing-source-for-github-pages/). It would be weird if the existence of a master brach affected robots.txt or meta tags. – olavimmanuel Dec 05 '17 at 11:38
  • For now `https://github.io/robots.txt` is redirected to `https://pages.github.com/` and does not work. – Darren Ng May 01 '21 at 13:39
  • From [Google's own documentation](https://developers.google.com/search/docs/advanced/robots/intro): "A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, [block indexing with `noindex`](https://developers.google.com/search/docs/advanced/crawling/block-indexing) or password-protect the page." – bmaupin Jul 29 '21 at 18:13
5

Google doesn't recommend using a robots.txt file to not index a website(GitHub page in this case). In fact most of the time it does get indexed even if you block the google bot.

Instead, you should add the following in your page head, which should be easy to control even if you are not using a custom domain.

<meta name='robots' content='noindex,nofollow' />

This will tell Google to NOT index it. Where if you only block google bot to access your website it will still index like 90% of the time just won't show meta description.

Gurpreet Singh
  • 141
  • 2
  • 5
  • Hi and thanks for this updated info about how it works in practice! Is this related to only blocking the google bot or all blocking via robots.txt? Do you have a source for this Google recommendation? – olavimmanuel Sep 09 '19 at 10:25
  • Here is a video from Google Webmasters official youtube channel. https://www.youtube.com/watch?v=KBdEwpRQRD0 – Gurpreet Singh Sep 15 '19 at 08:17