Merge two distinct git repositories by interlacing commits

Question

We have two repositories that evolved in parallel: one for the code of our project, and one for the tests of this project. I would like to merge these two repositories in one repository, in such a way that, when I go back in history, I still have both directory structures.

Suppose that our current structure is the following, where project and tests are two separate git repositories:

project
    /src
    /include
tests
    /short
    /long

I would like to end up with one git repository that has two directories project and tests.

I can't simply merge these two repositories using the techniques described in this answer, this one, or this site: they result in repositories that have two distinct histories before the merge, and when checking out a past commit, you have either src and include, or short and long, but you don't have all four of them as they appeared at that time.

If I checkout a commit that was created in project 4 months ago, I would like to see project/src and project/include as they appeared in this commit, but I would like also to have tests/short and test/long as they were at the same time in the (then separate) test repository.

I understand that the ordering of the commits between both repositories will only depend on time, and may not be very precise. But that's good enough for me. And of course I know that I can't keep the original git ids from each repo. That's fine, because these two repos are actually fresh imports from another RCS, and so there is no git id that was ever recorded anywhere.

It should be doable to checkout one by one all the commits from each repo, ordered by time across repositories, and commit the resulting files. Is there already a tool that would do this?

torek · Accepted Answer · 2019-04-30T05:32:01.583

Edit: for a date-based approach that makes this pretty easy but assumes one of the two repositories is going to be "in control" of which commits come from the other repository, see jthill's answer. You end up with a commit history that exactly matches the "project" history, possibly squashing some of the "tests" history. The answer below is more appropriate if you need to add a prefix to both sets of histories, or want to interleave them (e.g., need two different "tests" updates for the same "project" commit).

phd's answer is fine, but if I were doing this myself and wanted to make it really neat and clean, I would use a different approach.

If the trees for the two repositories don't overlap, it's certainly possible to do this—and by bypassing the usual Git mechanisms, going straight to underlying git read-tree commands, you can automate it. (This is where VonC's recent comment rejecting my claim that Git and Mercurial are very much alike is true: if you bypass the top level Git commands, you get something you cannot get nearly as easily in Mercurial.)

Just as in phd's answer, you would start this process by combining the two repository commit databases via git fetch. (You can do this in a third repo, which I'd recommend since it makes it easier to restart the process from scratch if you decide you want to tweak some parameters, or by adding either repo A to repo B, or repo B to repo A.) But after that, everything diverges.

You now have two disjoint commit DAGs:

        D--...--K
       /         \
A--B--C           M--N   <-- repoA/master
       \         /
        E--...--L

O--P--Q--...--Z   <-- repoB/master

(If repoA and repoB both have more than one branch tip, draw whatever simplified diagram of their commits is more appropriate.)

Your next step is to enumerate all the commits in each of the two disjoint DAGs, using git rev-list --topo-order --reverse and whatever other sorting options you like. When and whether --topo-order is required depends on the topology and other sorting information, but in general you will want a parent commit listed before any of its children.

Given these two linearized lists of commit hash IDs, you now have the hard part: constructing the graph of new, combined trees you wish to commit. Every new commit will be made by combining one commit from each of the two old graphs. If one of the graphs is complex (as for repoA above) with branches and merges, and one isn't (as for repoB above), this can be particularly tricky.

I've made my own setup for this, where I have a very simple graph:

A--B   <-- A/master

O--P   <-- B/master

In my simplified setup, I'd like to make my first commit on my new master be commit C that combines the trees of A and O:

C   <-- master

Then I'd like to make, as my second commit on master, the combination of A and P (not A and O and not B and O either), and as my last commit, the combination of B and P, so that I end up with:

C--D--E   <-- master

with:
    C = A+O
    D = A+P
    E = B+P

So, here we are in a new empty repository, except that we've read in projects A and B:

$ git log --all --graph --decorate --format='%h%d %s' --name-status | sed '/^[| ] $/d'
* 7b9921a (B/master) commit-P
| A B/another
* 51955b1 commit O
  A B/start
* 69597d3 (A/master) commit-B
| A A/new
* ff40069 commit-A
  A A/file

(I accidentally didn't hyphenate commit O, but did hyphenate all the others. The sed is to remove some blank lines that don't really help reading, in this case.)

$ git status
On branch master

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Now we build the new commits, one at a time, using git read-tree to populate the index to make the commits. We start with an empty index (which we have right now):

$ git status
On branch master

No commits yet

nothing to commit (create/copy files and use "git add" to track)

We want our first commit to combine A and O, so let's read those two commits into the index now. If we had to add a prefix to the tree in A we could do that here:

$ git read-tree --prefix= ff40069
$ git ls-files --stage
100644 7a1c6130c652b6ea92f4d19183693727e32c9ac4 0       A/file
$ git read-tree --prefix= 51955b1
$ git ls-files --stage
100644 7a1c6130c652b6ea92f4d19183693727e32c9ac4 0       A/file
100644 f6284744575ecfc520293b33122d4a99548045e4 0       B/start

We can make the commit we need now:

$ git commit -m combine-A-and-O
[master (root-commit) 7c629d8] combine-A-and-O
 2 files changed, 2 insertions(+)
 create mode 100644 A/file
 create mode 100644 B/start

Now we need to make the next commit, which means we need to build up the correct tree in the index. To do that we first have to clean it out; otherwise the next git read-tree --prefix will fail with a complaint about overlapping files and Cannot bind. So now we empty the index, then read commits A and P:

$ git read-tree --empty
$ git read-tree --prefix= ff40069
$ git read-tree --prefix= 7b9921a

If you like, you can examine the result using git ls-file --stage again:

$ git ls-files --stage
100644 7a1c6130c652b6ea92f4d19183693727e32c9ac4 0       A/file
100644 d7941926464291df213061d48784da98f8602d6c 0       B/another
100644 f6284744575ecfc520293b33122d4a99548045e4 0       B/start

In any case they can now be committed as the new commit:

$ git commit -m 'combine A and P'
[master eb8fa3c] combine A and P
 1 file changed, 1 insertion(+)
 create mode 100644 B/another

(you can see now how I end up with inconsistent hyphenation :-) ). Last, we repeat the process by emptying the index, reading in the two desired commits (B+P), and committing the result:

$ git read-tree --empty
$ git read-tree --prefix= A/master
$ git read-tree --prefix= B/master
$ git ls-files --stage
100644 7a1c6130c652b6ea92f4d19183693727e32c9ac4 0       A/file
100644 8e0c97794a6e80c2d371f9bd37174b836351f6b4 0       A/new
100644 d7941926464291df213061d48784da98f8602d6c 0       B/another
100644 f6284744575ecfc520293b33122d4a99548045e4 0       B/start
$ git commit -m 'combine B and P'
[master fad84f8] combine B and P
 1 file changed, 1 insertion(+)
 create mode 100644 A/new

(I used symbolic names here to get the last two commits, but hash IDs from git rev-list would of course work well.) We can now see the three commits, all on master:

$ git log --decorate --oneline --graph
* fad84f8 (HEAD -> master) combine B and P
* eb8fa3c combine A and P
* 7c629d8 combine-A-and-O

and it's now safe to delete the A/master and B/master references (and the two remotes). There's one peculiarity: since we did all the work directly in the index, without bothering with a work-tree, the work-tree is still completely empty:

$ ls
$ git status -s
 D A/file
 D A/new
 D B/another
 D B/start

To fix that at the end, we should just run git checkout HEAD -- .:

$ git checkout HEAD -- .
$ git status -s
$ git status
On branch master
nothing to commit, working tree clean

How to write your own automation script

In practice, you will probably want to use git write-tree and git commit-tree, rather than git commit, to make the new commits. You would write a little script (in whatever language you like) to run git rev-list to collect the hashs IDs of commits to combine. The script must inspect those commits—e.g., by looking at authorship and dates, or file contents, or whatever—to decide how to interweave the commits. Then, having made the decisions about interweaving and what branch-and-merge structures to provide, the script can begin the process of repeatedly doing these steps:

Empty the index.
Yank in a tree from a commit in the sub-graph from repo-A, with whatever --prefix option is appropriate—in your case this is the --prefix=, i.e., the empty string, but in other cases it would be a directory name with a trailing slash).
Yank in a tree from a commit in the sub-graph from repo-B, with another appropriate --prefix, so that there are no collisions between entries from A and B.
Use git write-tree to write the tree. Its output is the tree hash ID for the next step.
Use git commit-tree with appropriate -p argument(s) to set the parent(s) of the new commit. Feed it the appropriate (combined or whatever) commit message text. Use the environment variables GIT_AUTHOR_NAME, GIT_AUTHOR_EMAIL, GIT_AUTHOR_DATE, GIT_COMMITTER_NAME, GIT_COMMITTER_EMAIL, and GIT_COMMITTER_DATE to control the author and committer names and dates. The output from git commit-tree is the hash ID, which is the parent of some subsequent commit.

When the whole thing finishes, the last commits made for any particular branch or set of branches are the hash IDs that go into those branches, so you can now run:

git branch <name> <hash>

for each such hash ID.

I was more alluding to the "pull only versus shared push": http://hgbook.red-bean.com/read/collaborating-with-other-people.html#id372641. With rebase and pull request early on, the GitHub model caught on, while the BitBucket one (initially based on Subversion, then Mercurial) played catch-up. I still remember my debates with Ry4an (his actual name!) regarding rebase and Mercurial indelible changesets! (https://stackoverflow.com/a/2672489/6309) — VonC, Apr 28 '19 at 06:31
@VonC: even better than rebase is Mercurial's "evolve" extension. Unfortunately that's *still* not in official Hg (not even as a bundled extension). Before rebase and histedit became bundled extensions, Mercurial was kind of deficient: you could graft-and-strip but that was extremely crude. — torek, Apr 28 '19 at 16:49

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

[given all project content is in src and include and all tests content is in short and long,]

If I checkout a commit that was created in project 4 months ago, I would like to see project/src and project/include as they appeared in this commit, but I would like also to have tests/short and tests/long as they were at the same time in the (then separate) test repository. […]

Is there already a tool that would do this?

There is, it's named git filter-branch. By far the simplest to implement is to walk the project history and hunt up "the" corresponding tests commit's content, here's a sketch:

git init junk
cd junk
git remote add project /path/to/project
git remote add tests /path/to/tests
git remote update

git filter-branch --index-filter '
        mydate=`git show -s --date=raw --pretty=%ad $GIT_COMMIT`
        thetest=`git rev-list -1 --before="$mydate" --remotes=tests`
        [[ -n $thetest ]] && git read-tree --prefix= $thetest
' -- --remotes=project

which will get slow if your "tests" history's got many thousands of commits, if you're talking about the linux repo or something on that scale it would wind up cheaper to pregenerate a date-sorted tests list and step through that.

Adding `git commit --allow-empty -m "Empty commit before filter-branch"` seems necessary after `get remote update`. Otherwise `git filter-branch` errors out with `fatal: Needed a single revision`. — Xavier Nodet, Apr 29 '19 at 05:48
The effect of this method can be described as such: rewriting the commits in `project` such that they also contain the changes that happened in `test` since the last commit. In other words, the commits in `test` are squashed and the changes added to commits in `project`. I would prefer to keep the commits coming from `test` separate from those in `project`. On the other hand, a single command is much simpler than all other answers so far... — Xavier Nodet, Apr 29 '19 at 05:53
If you want preserve the tests history structure, easiest would be to add the commit as a submodule, instead of read-tree use `git update-index --cacheinfo 160000,$thetest,tests`. — jthill, May 03 '19 at 22:36

score 2 · Answer 3 · answered Apr 27 '19 at 13:09

I think you should combine the two repositories creating 2 branches (git fetch without merge). Then interactively rebase one branch, stop at every commit and do git cherry-pick the corresponding commit into the current branch. Then continue interactive rebase to the next commit (this saves the "edited" commit without modifications).

Perhaps that can even be automated. Instead of interactive rebase and manual cherry-picking you probably can use git rebase --interactive -x executing git cherry-pick after every commit. The problem is how to find out what commit to cherry-pick. I think it should be second-branch~count. The count can be edited before interactive rebase while editing rebase-todo file.

Merge two distinct git repositories by interlacing commits

3 Answers3

How to write your own automation script

Linked