Merging two git repositories of the same project, linking file history

Question

I have a project which I started a long time ago, and made a number of commits to. The project was then abandoned for about two years, during which time I forgot I had been using git version control on the project. I picked it up, copying all files to a new machine, and started a new git repo with ~100,000 lines of code and dozens of files, which now has its own lengthy commit history. I recently rediscovered the old repo, and attempted to merge the commit history of both repos together, using the instructions here.

However, the result was incomplete. If I look at the commit history on github, commits from the old and new repository are intact, but each individual file history does not extend back to the old repository's series of commits, still showing them as simply created during the commit made at the creation of the new repository. A couple of files which were not transferred when I manually copied everything over to start the new repo don't show up at all.

The project's file structure and naming convention has changed significantly since the end of the old repository's history, and some file associations may not be obvious. If I have to link the old with the new one at a time manually, I can do that, but an automatic solution would be better.

*but each individual file history does not extend back to the old repository's series of commits*, hmm, this happens because you did *copying all files to a new machine, and started a new git repo*, its very likely that git sees them as two different files.. at that point, i'd give up already and use the old repo just as a museum.. — Bagus Tesa, Nov 09 '18 at 01:06
How much did you change between the last commit of the old repo and the first commit of the new? — Daniel H, Nov 09 '18 at 01:14
Note that what you want to do will change all the new repo's hash values, no matter how you do it; combining everything might not be worth that cost. — Daniel H, Nov 09 '18 at 01:17
Little to nothing changed between the last old/first new commits -- the structural overhaul I referred to was quite recent, and what prompted me to go looking for the older history, but the files at the first commit of the new should be fairly analagous to the files in the old. — Joe Gallagher, Nov 09 '18 at 01:22
If the new repo's history is linear (no merges), you can probably use `git rebase`. Otherwise it's probably better to use `git filter-branch` to preserve merges. I'm not confident enough in either one to give a sure answer (especially not `filter-branch`); I'd back up both repos before trying anything. — Daniel H, Nov 09 '18 at 01:28
`rebase` is not a good option for this use case. Compared to re-parenting (using `filter-branch`) the only thing `rebase` adds is ways for things to go wrong. — Mark Adelsberger, Nov 09 '18 at 18:55
@MarkAdelsberger It also adds being easier to work with. If I knew that the new repo's history was simple and that the first commit of the new repo were *exactly* identical to the last commit of the old, I could write an answer using `rebase`. I don't know these things, which is why I haven't answered yet because I don't know `filter-branch` as well. — Daniel H, Nov 10 '18 at 01:57
@DanielH - No, being more familiar is not the same as being easier to work with. Reparenting is safer (as in rebasing will almost never work correctly for this use case) and just as easy. — Mark Adelsberger, Nov 10 '18 at 21:12
@MarkAdelsberger I know that being more familiar isn't the same as being easier to work with, but every other reference I've seen to `git filter-branch` has said it is difficult to work with. If I'm wrong, that's good; I know that using `rebase` here is the "if all you have is a hammer" solution. — Daniel H, Nov 11 '18 at 00:25

score 1 · Accepted Answer · answered Nov 09 '18 at 04:20

I assume you followed the steps from the top answer to the question you linked. Those are not the best steps for this situation.

You have two segments of history for your project. If we suppose the first segment had commits

A -- B -- C <--(master)

and the second segment had commits

D -- E -- F <--(master)

then a complete history which behaves as expected would look like

A -- B -- C -- D' -- E' -- F' <--(master)

(A note on notation: I've replaced D with D' in the combined history, etc. The reasons for this are arguably technical and probably not immediately important; in summary, it just means that in terms of commit identity, D' is distinct from D because D' has C as a parent whereas D does not. But the letter is kept the same, to show that D' represents the same state of the code - i.e. the same content or TREE - as D.)

The answer you linked does not accomplish that. It meets the two most basic goals - putting the commits in one repo, and combining them into one graph - but it does not meet the most valuable one: making a coherent history of them. Instead it gives you

   A -- B -- C
              \
D -- E -- F -- f*

where f* is a merge commit (i.e. a commit with multiple parents) whose content matches F, but who also lists C as part of its history.

The problem with this is that C is not then recognized as part of Ds history. In fact, git's default history filtering rules (e.g. for log output) will exclude A, B, and C entirely, because from git's point of view the state of the code can be explained without them.

(Most of the current comments on your question, which talk about things like the similarity heuristic, are red herrings. It seems to me those comments were written by people who didn't really look closely at the steps you had followed.)

There are a couple different ways to get to the desired state. If this is a repo that only you use, or if you can coordinate with all repo users to do a history rewrite, then a "re-parenting" operation would be a good solution. This is a permanent fix that will create a seamless history; but, because it will change the history of the current repo's branches, coordination with any other users is important. The issue of rewriting shared histories is generally described in the git rebase documentation in the section about "recovering from upstream rebase".

Another alternative is to use git replace. This has the advantage that it is not a history rewrite, but it does have some known issues, and it requires a little special setup in each clone. (If the setup isn't done, it just means that particular clone doesn't see the full history.)

Here is a post that discusses ways to do each of these: Git: Copy history of file from one repository to another

There are other variations as well, and it's hard to say which would best suit your situation. If you want to more generally explore the possibilities, you might consult the documentation for git filter-branch and git replace.

I asked about the difference between (with your labels) `C` and `D` so that I could tell if the desired history was `A -> B -> C -> D' -> E' -> F'`, as you say, or `A -> B -> C -> E' -> F'`; if `C` and `D` have the same history, then `C = D'` and there's no need to include both, so the `rebase` command would be different. I didn't follow up on that because I then realized that `git rebase` might not preserve all the important aspects of history (e.g., merges), and I couldn't give an answer that would, but I still think the question is relevant for determining how to rewrite history. — Daniel H, Nov 09 '18 at 09:25

Merging two git repositories of the same project, linking file history

1 Answers1