How do I get shared history back after a git repository has been copied?

Question

A long long time ago, in an office far far away, someone copied a github repository and uploaded it to Visual Studio Team Services (VSTS). We developers happily coded away, developing features and fixing bugs in VSTS. Now it's time to release our code back into the loving arms of the open source community...

Unfortunately our VSTS repository doesn't have a shared history with the github repository because it's a copy, not a clone. While we can add the github repository as a remote, merging our code back into the main branches is a nasty snarl of conflicts. Entire folder structures have been moved or renamed, and open source developers have committed changes to those files in the github repository.

Is there a way I can hook our branches back up to where they came from? Something like rebasing our entire branch tree onto the last commit that was on github when the repository was copied?

The best I've come up with is cherrypicking every CL in VSTS onto github, and that sounds like some serious detective work figuring out where to insert the renames.

score 2 · Accepted Answer · answered Aug 09 '18 at 04:57

This—combining a non-clone with an actual clone—is difficult in general.

Let's write up a theoretical example, using git://github.com/repo as the original. Let's assume ssh://example.com/copy.git will represent the repo you set up using the following command sequence:

<download tarball or zip file from github.com/repo>
<extract tarball or zip file into directory D>
$ cd D
$ git init
$ git add .
$ git commit -m initial -m "" -m "imported from github.com/repo.git"

after which you created the --bare repository that lives at ssh://example.com/repo.git from this independent repository.

It's now some time later and you have realized that you would like to be working with an actual clone of github.com/repo.git. Alas, your ssh://example.com/repo.git has no shared history—no commits in common—with git://github.com/repo.git. Running:

$ git clone ssh://example.com/repo.git combine
$ cd combine
$ git remote add public git://github.com/repo.git
$ git fetch public

gets you all of the public commits, but trying to merge public/master with your own private master is a mess.

In some very specific cases, it's actually not too hard to fix this. The trick lies in comparing the root commit now sitting in your combine repository, reachable from your master, to all the commits in your combine repository reachable from all the public/* remote-tracking names. If you are lucky, exactly one commit's tree exactly matches your own root commit's tree because the tarball-or-zip-file you got produced an identical tree.

If you are not lucky, there is no such commit. In this case, you can perhaps find a commit that's "sufficiently close". But let's assume that you did find a commit, reachable from public/master, that exactly matches your own root commit:

A--B--...--o--o   <-- master (HEAD), origin/master
        \
         ... (there may be other branches)

C--...--R--...--o   <-- public/master

Here, the uppercase letter A stands in for the actual hash ID of your own root commit—the one you made from the downloaded tarball or zip file—and B is the commit just after that one. C stands for the (or some) root commit reachable from public/master and is mainly in the drawing just for illustration: all we know for certain is that there is at least one more such root (parentless) commit. The letter R stands in for the commit that exactly matches your commit A and this is the most interesting commit at the moment.

What we would like to do now is pretend that the parent of the second-most interesting commit, B, is commit R rather than commit A. We can do this! Git has a facility called git replace. What git replace does is to copy an object while making some change. In our case, what we want is to copy commit B to some new commit B' that looks almost exactly like B, but has one thing changed: its parent. Instead of listing the hash ID of commit A as B''s parent, we want B' to list the hash ID of commit R.

In other words, we will have:

A---------B--...--o--o   <-- master (HEAD), origin/master

          B'
         /
C--...--R--...--o   <-- public/master

Now all we have to do is convince Git that when it looks up commit B, it should notice that there's this replacement commit, B', and quickly avert its eyes from B to look instead at B'. That's the rest of what git replace does. So having found commits R and B, we run:

git replace --graft <hash-of-B> <hash-of-R>

and now Git pretends that the graph reads:

          B'-...--o--o   <-- master (HEAD), origin/master
         /
C--...--R--...--o   <-- public/master

(well, Git pretends this unless we run git --no-replace-objects to see the reality).

The big, or maybe small, drawback

Aside from the rather tough job of locating commit R—finding A and B is very easy, they are the last two hash IDs listed by git rev-list --topo-order master—this git replace trick has a flaw. The replacement commit B' exists in our repository now, but it is located via a special name, refs/replace/hash, where hash is the hash ID of the original commit B. This replacement object (and its name) is not sent to new clones by default.

You can make clones that do have the replacement object and its name, and work with them, and everything works. But this means that every time someone clones your combine repository, they must run:

git config --add remote.origin.fetch '+refs/replace/*:refs/replace/*'

or similar (this particular rule just slaves your clone's refs/replace/ namespace to origin's, which is crude but effective).

Alternatively, you can declare a flag day and run git filter-branch or similar to cement the replacement in place. I have described this elsewhere, though the best I can find at the moment is my answer to How can I attach an orphan branch to master "as-is"? Essentially, you make a new repository that has B' instead of B, does not have A, and has new copies of every commit that is a descendant of B' (with the same contents except for the parent hash ID). Then you have all of your users switch from the old repo.git to the new one. This is painful, but only one time.

If you don't plan to keep using the combined repository very long, this may not matter.

Besides the above, you can also use the grafted history to produce merges—Git commands in general will follow the replacements—after which you may not need the replacement graft commit. In this case, the drawback is short-lived: it lasts only until you get your code merged.

I ran git replace --graft and the command completed but didn't print anything. I did it again and it said the ref already exists, so it sounds like it worked. Unfortunately when I try to merge between branches I still get "fatal: refusing to merge unrelated histories". Git GUI seems to show that the branches are related. My internal branch is the green one below, and the red one goes to the rest of the github changes: https://i.imgur.com/txWVpzb.jpg — Ecnassianer, Aug 09 '18 at 21:22
Hm, the replacement object looks right (from what I can see and guess) so the histories *should* be related now. Running `git merge-base --all ` would print the hash ID all merge bases, and if Git is complaining that they're not related, this would print nothing. Is that a `gitk` snapshot? I could try reproducing it with a [mcve] I make up... — torek, Aug 09 '18 at 22:02
Ok, I ended up showing this to a coworker and we found the mistake! There's a THIRD (and FOURTH!) root in the combined repository, because there were two roots in the original github repository. The commit I picked as R was actually part of the third root, not the two I wanted to join! So git was not lying when it said there was no shared history between the first and second! I found a better R and everything worked out! Thank you a hundred times! — Ecnassianer, Aug 11 '18 at 00:55
Aha! Good, I have not had time to fuss with an example all day and I would not have created extra root commits. They do happen, though, especially if you combine repositories... — torek, Aug 11 '18 at 01:01

score 0 · Answer 2 · answered Aug 09 '18 at 04:53

Assuming the VSTS repo is a Git repo, you can:

clone your GitHub repo
make a new branch from the right commit
override the worktree content with a mirror copy of the first commit of your VSTS branch (to avoid any conflict resolution). Then add and commit.
git cherry-pick from VSTS (added as a remote and fetched) all commits of your VSTS master branch onto the new local branch (no conflicts)
push back the new branch to the GitHub repo

How do I get shared history back after a git repository has been copied?

2 Answers2

The big, or maybe small, drawback