78

I use Github to store the text of one of my web sites, but the problem is Google indexing the text in Github as well. So the same text will show up both on my site and on Github. e.g. this search The top hit is my site. The second hit is the Github repository.

I don't mind if people see the sources but I don't want Google to index it (and maybe penalize for duplicate content.) Is there any way, besides taking the repository private, to tell Google to stop indexing it?

What happens in the case of Github Pages? Those are sites where the source is in a Github repository. Do they have the same problem of duplication?

Take this search the top most hit leads to the Marpa site but I don't see the source listed in the search result. How?

szabgab
  • 6,202
  • 11
  • 50
  • 64
  • 7
    Looking at the robots.txt of Github, I see the blobs in the master branch are allowed but all the other branches are disabled. That is probably the explanation for the Marpa content not being indexed. So maybe if I use a different branch, and remove the master branch from the repository, the indexing will stop. – szabgab Apr 05 '13 at 23:14
  • [robots.txt directives summarized][1] [1](http://antezeta.com/news/avoid-search-engine-indexing) – LAFK 4Monica_banAI_modStrike Apr 06 '13 at 01:34

5 Answers5

91

The https://github.com/robots.txt file of GitHub allows the indexing of the blobs in the 'master' branch, but restricts all other branches. So if you don't have a 'master' branch, Google is not supposed to index your pages.

How to remove the 'master' branch:

In your clone create a new branch - let's call it 'main' and push it to GitHub

git checkout -b main
git push -u origin main

On GitHub change the default branch (see in the Settings section of your repository) or here https://github.com/blog/421-pick-your-default-branch

Then remove the master branch from your clone and from GitHub:

git branch -d master
git push origin :master

Get other people who might have already forked your repository to do the same.

Alternatively, if you'd like to financially support GitHub, you can go private https://help.github.com/articles/making-a-public-repository-private

Dale
  • 5,607
  • 2
  • 22
  • 21
szabgab
  • 6,202
  • 11
  • 50
  • 64
  • 2
    Thanks. I followed the steps but I made it directly from github.com – Gabriel Apr 18 '14 at 09:40
  • 1
    Interesting. I deleted the master branch for on my Github website repos for hygenic reasons, not realizing it would have this nice side effect. – Jeffrey Kegler Mar 13 '15 at 17:54
  • How do you keep github pages rendering correctly if there is no master branch? – Bevan Jun 16 '16 at 21:08
  • @Bevan as far as I know the github pages are served from the gh-pages branch if it exists. https://help.github.com/articles/creating-project-pages-manually/ Nothing to do with the master branch. – szabgab Jun 24 '16 at 20:04
  • 2
    @szabgab the `username.github.io` repository is served if on `master` branch. Project repositories like `username;github.io/project-one` are served based on the `gh-pages` branch. See https://help.github.com/articles/user-organization-and-project-pages/ – David Jacquel Aug 08 '16 at 19:52
  • @Bevan, David Jacquel: you can pick any branch to served as github pages for a while now on the github web UI. – Csaba Toth Jul 01 '17 at 23:47
  • @CsabaToth https://github.com/user/user.github.io/settings says "User pages must be built from the master branch." – olavimmanuel Dec 05 '17 at 14:23
  • Extra advice, if the repository contains product names or keywords you don't want to get hit for: be sure that the repository name and the description does not contain these search keywords you want to exclude yourself from – Csaba Toth Apr 10 '18 at 01:45
  • Be careful, this solution is not safe. Even if you replace master with main, this does not prevent someone from creating a fork that will be hosted on another website (with a link to your original repository or website) and indexed by search engines. So the best solution is not to include in your source code the keywords you don't want to be indexed by search engines like Google. Anyway, someone could do a search directly on GitHub instead of Google. In addition, GitHub recently changed their default branch from master to main. – baptx May 27 '21 at 11:10
  • 6
    I don't believe this answer is correct anymore. – Michael Mior Aug 05 '21 at 16:43
  • 6
    @MichaelMior Looking at the robots.txt archive, it seems that this answer is no longer correct since around June 2020. – Nicolas Oct 23 '21 at 15:30
7

I can think of two solutions that work at the present time:

  1. rename your repo to start with tags. So for example, instead of my-repo, rename it to tags-my-repo. OR:
  2. Create a new branch, but don't make that default. Then, on the default branch, delete all files. This has the side effect of a) making the default branch useless beyond hiding from crawler while remaining public, and b) forcing you to use the new branch as master. You can still rename the now-useless default branch and the de-facto new branch whatever you want.

Why I think the older solutions in this thread no longer work: https://github.com/robots.txt has changed since then. At the time of the original question in 2013, robots.txt looked liked this:

User-agent: Googlebot
Allow: /*/*/tree/master
Allow: /*/*/blob/master
Disallow: /ekansa/Open-Context-Data
Disallow: /ekansa/opencontext-*
Disallow: /*/*/pulse
Disallow: /*/*/tree/*
...

whereas now there are no Allows but only Disallows:

User-agent: *

Disallow: /*/pulse
Disallow: /*/tree/
Disallow: /gist/
Disallow: /*/forks
...
Disallow: /*/branches
Disallow: /*/tags
...

If you simply create a new branch, make that default, and delete the old one, the URL https://github.com/user-name/repo-name will simply show your new default branch and remain crawl-able under the current robots.txt.

How my solutions above work: (they are based on how Google currently interprets robots.txt)

Solution 1 would make your repo's URL match Disallow: /*/tags, thereby excluding it from crawling. So as a matter of fact you can prefix your repo name with any single word from disallow paths of the form /*/word without ending slash (so tree doesn't work since Disallow: /*/tree/ ends with a slash).

Solution 2 simply ensures that the default branch, which is the only branch crawled, doesn't contain stuff that you don't want crawled. In other words, it "moves" all relevant stuff to a branch, so they're in https://github.com/user-name/repo-name/tree/branch-name, which won't be crawled due to Disallow: /*/tree/.

Disclaimers

  • Obviously, my solutions depend heavily on what robots.txt looks like at any given point in time.
  • This doesn't guarantee it won't show up in search results.
  • This should be obvious: Since your repo is public, people who already know your user name can always navigate to your stuff. This fact has no bearing on the problem at hand, but I thought I should put this out there.
bsoo
  • 86
  • 1
  • 1
  • Thanks for the answer. Do you have any idea how to prevent Github from redirecting my old username into a new one? I mean, my repo name remains the same, but the username has changed. However, when Googling with my old username, it exposes my new Github account through that repo. Is it also related to robots.txt? –  Jan 29 '22 at 01:33
0

If want to stick to the master branch there seems to be no way around using a private repo (and upselling your GitHub account) or using another service that offers private repos for free like Bitbucket.

iltempo
  • 15,718
  • 8
  • 61
  • 72
  • I already (about an hour ago) removed the 'master' branch and now I have a 'main' branch but I wonder, is this enough? – szabgab Apr 06 '13 at 08:41
  • Excellent analysis, I picked a word 'revisions' instead of 'tags' as I keep there revisions of online books I don't want indexed. Fingers crossed for this to continue working in the future – fmalina Jul 16 '22 at 23:39
0

simple answer: make your repo private.

https://help.github.com/articles/making-a-public-repository-private

xero
  • 4,077
  • 22
  • 39
-6

Short awnser. Yes you can with robots.txt.

If you want to prevent Googlebot from crawling content on your site, you have a number of options, including using robots.txt to block access to files and directories on your server.

You need a robots.txt file only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file (not even an empty one).

While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.

Sources:

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93708 http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449

Carlos Neves
  • 67
  • 1
  • 2
  • 12
    The robots.txt file needs to be in the root of the web site and I don't have write access to http://github.com/robots.txt Crawling can be restricted in the HTML header as well, but I don't think I can alter the pages generated by Github for my source code. – szabgab Apr 06 '13 at 06:11
  • 1
    In case someone looking to disallow robots on their built GitHub Pages: People using GitHub Pages can add a robots.txt file to their User Page repository and use it to control robots on all the built pages (username.github.io/*). They can however not hide the source for their User Page as it must be in ```master```. For project repositories, ```master``` can be deleted and another branch can be used for GitHub Pages. None of this applies to OP as szabgab says he doesn't use Github Pages. – olavimmanuel Dec 05 '17 at 14:37