Set up a GitHub mirror repository without duplicating search results

Question

When I search a file from my repository, I get a random mirror as first result, while the official location (old URL 301s) and even the official GitHub mirror do not appear in search results.

I know GitHub used to help with mirroring but I'm not sure they still do. Did we do something wrong with our repository browser, or with the mirror?

Does it matter that the official GitHub mirror doesn't have a "master" branch and should the other mirror rename master? Can we do more to "Syndicate carefully"? Our GitHub mirror links back to the official mirror, but only indirectly and only from the main repository page.

There is now a way to add a "mirror" tag to some repositories, as seen on https://github.com/GNOME/gnome-desktop . That doesn't seem to alter the canonical URL but I suppose it's a first step. — Nemo, Sep 03 '20 at 15:11

score 1 · Answer 1 · answered Aug 18 '16 at 15:33

This is an issue with Search Engine Optimisation.

The reason you'll get that random copy of your repository top of a random file search is because it has better metrics than your main repository does. You need to gain more backlinks / visibility not just to the main repository's page but to the individual files.

When searching for operations-puppet, I do indeed get the wikimedia github repository. The separate site you've set up (mediawiki.org) will need more backlinks and other ranking metrics in order to increase it's visibility. Github is simply a far more authoritative site.

If Github won't assist with canonical linking then you'll have to gather backlinks and attention via other methods.

score 1 · Answer 2 · answered Jan 28 '20 at 23:50

I respectfully believe this is an expectations issue. You say that you want to "syndicate carefully", but open-source software is basically the antithesis of that - allowing anyone to syndicate your code anywhere, outside of your control, restricted only by the terms of the OSS license.

When you search for something on Google, they return what they believe to be the most authoritative, relevant result for your query, not necessarily the original source of it. Google isn't smart enough yet to know for sure what the "official" or "original" source of a piece of information is, short of using a lot of educated guesses (first-seen date, backlinks, site authority), which can sometimes result in the wrong answer.

Even if Google were to know which repository/webpage were the "official" source for the info, it might have reasons to link to an alternate source that the algorithm perceives as more "usable" or "fresh" (e.g. a recently updated repo compared to an abandoned repo, a repo with less backlinks, a read-only archive, a repo on a less popular repo-hosting site, etc).

If this were proprietary code, the solution would be to DMCA takedown the unofficial copies of your code, either at the source or with Google. But since this code's license presumably allows it to be copied freely, you have no control over syndication, and what Google perceives as the most useful result may not be the official source.

Did we do something wrong with our repository browser, or with the mirror?

There's no reason to believe that, afaik. This rankings issue is a classic foray into the strange world of SEO.

My advice is to not worry too much about where searches of random files in your project take you. Your GitHub mirror is already the top result for "wikimedia puppet", which is what I'd expect most users to search first if they needed to look at the up-to-date version of any files in your repo.

I don't think this is a licensing issue. Even if the source code didn't have a free license, there can still be reasons for it to be available from multiple locations, and it's expected that there are ways to identify the canonical location. "Syndicate carefully" is just a quotation from Google's guidelines on duplicate content https://support.google.com/webmasters/answer/66359 and duplicate consolidation https://support.google.com/webmasters/answer/139066 . — Nemo, Jan 29 '20 at 17:20
@Nemo While it's true that proprietary code can be available in multiple *locations*, it typically has only one *owner*, so the owner has full control over which location should be considered canonical. In contrast, I would argue that open source software has no canonical location by definition, since anyone's fork is just as "valid" as anyone else's fork, including the original project. — Maximillian Laumeister, Jan 29 '20 at 17:29
@Nemo Take OpenOffice as an extreme example - I bet if you googled a source code file from that project, you would probably come up with a result from the LibreOffice repo since it's a more popular fork, even though OpenOffice is the "canonical" repo by your definition. Yet the LibreOffice result would probably be more useful to a user, and that's why Google would return it. — Maximillian Laumeister, Jan 29 '20 at 17:30
@Nemo For proprietary code, where it's assumed that you control the distributon of said code, the way to fix this would be to use canonical links or `noindex` directives to prevent every non-canonical source code location from showing up in Google results. But when you don't control where the source code pops up on the internet, you likewise can't control what Google indexes and thinks is relevant to the user. — Maximillian Laumeister, Jan 29 '20 at 17:33

Set up a GitHub mirror repository without duplicating search results

2 Answers2