How to repair a git history and correctly merge unrelated histories

Question

I have a Git repo that contains two unrelated histories as shown in this graph:

* commit a577995ec16ae05c2f81adfdba5ce28e7b8ba150
(A)
|
*   commit d89ddb17122ab9eea72e7006461cb04a5a879770
|\  Merge: 95febfb f85c1bb
| |
| * commit 97b8dc2f7cf7e81d75fee5565423b554d191e4f3
| |
| * commit c86ff8d4695f63c30ba096a5a71ab8f50536a31c
| (B)
|
* commit a577995ec16ae05c2f81adfdba5ce28e7b8ba150
|
* commit 53c3a6a895c2732c8262e6467b586284fbe7c79d
(C)

Notice I have labeled three points in the Git history, A, B & C, in the graph, just so that I can refer to them here in the text.

The two histories beginning at C and B are unrelated, and began their lives as two entirely separate Git repos. The histories were then combined in a way that is unknown (but is known to have involved a git filter-branch).

My problem

If I checkout at point (A) in the history, the files that were created at point C are shown in git log and other commands, incorrectly, as if they were added by the merge commit fe63b2f. Thus, git commands like git show, git blame and others all fail to tell me about the true history of those files. But they do show me the true history of files that were added at point (B).

Further observations

Git show --name-status

As suggested in comments, if I run this on the merge commit, I see:

$ git show --oneline -m --name-status d89ddb17122ab9eea72e7006461cb04a5a879770
d89ddb1 (from 95febfb) Merge branch 'master' of ../jenkins into alex/jenkins
A       files_from_C
A       more_files_from_C
D       jenkins/files_from_B
D       jenkins/more_files_from_B
d89ddb1 (from f85c1bb) Merge branch 'master' of ../jenkins into alex/jenkins
A       files_from_C
A       more_files_from_C

I have included the real output as a Gist here.

Git log --name-status --full-history --follow v Git log

More information on the Git log of files created at C. A plain git log:

$ git log TODO.md
commit 989011e2dee59f9502c369d8fac58b2b947ab4e6
Author: Alex Harvey <redacted>
Date:   Sat Sep 14 00:51:53 2019 +1000

    Some other commit

commit d89ddb17122ab9eea72e7006461cb04a5a879770
Merge: 95febfb f85c1bb
Author: Alex Harvey <redacted>
Date:   Wed Sep 11 12:18:54 2019 +1000

    Merge branch 'master' of ../jenkins into alex/jenkins

But with --all --name-status --full-history --follow -- I see all the history:

$ git log --all --oneline --name-status --full-history --follow -- TODO.md
989011e Add autogenerated helpers documentation
M       TODO.md
8766c98 Remove shunit2/_include.sh
M       TODO.md
15bf859 Remove shunit2/_include.sh
M       TODO.md
65ee7e5 Remove shunit2/_include.sh
M       TODO.md
2c601af Remove unused invalid_ami_stacks variable
M       TODO.md
d347137 Resolve unprintable characters in README
M       TODO.md
dc7068d TODO.md
M       TODO.md
etc

My question

Is there a way that I can edit this history so that git show, git blame, and others will reveal the true history of all this repo's files, whether they were created at (B) or (C)?

"are shown, incorrectly", by what? Please, show the actual commands you're issuing and the actual output they produce, what you've supplied is nothing more than characterization and edited "evidence". — jthill, Sep 20 '19 at 03:28
@jthill, If I run, for example, git log on a file that was created at C, the history tells me, incorrectly, that it was added by the merge commit. If I then checkout at point C, however, I can see that file's true history. Is that clearer? — Alex Harvey, Sep 20 '19 at 03:32
No, because when I do what you say, `git log -- C`, on a graph that created C in the first commit on the first-parent line, I don't see what you say you see. I see exactly what I should: the commit that added C, the one that created it. I have no idea what you're looking at. Post a link to the actual repo or start showing your actual command output, give people trying to help you something concrete to work with. ] — jthill, Sep 20 '19 at 04:16
@jthill. Yes, right. Clearly, this history has been "broken" somehow during the process that merged the second repo in. That process is documented [here](https://alexharv074.github.io/2017/10/04/merge-a-git-repository-and-its-history-into-a-subdirectory-of-a-second-git-repository.html). I have not, however, been able to reproduce this. My hope is someone with deep knowledge of the Git filesystem could offer an explanation _in theory_ on the kind of Git filesystem damage that could lead to this observed behaviour. — Alex Harvey, Sep 20 '19 at 04:20
In the mean time I'll have another go and seeing if I can reproduce something like this. If I succeed, I'll make it publicly available. — Alex Harvey, Sep 20 '19 at 04:22
What do you mean, you can't reproduce it? Either you've got a repo showing the symptoms or you don't. Try `git show --oneline -m ---name-status fe63` and `git log --all --oneline --name-status --full-history --follow -- C` where C is that file in your current tip. — jthill, Sep 20 '19 at 04:30
Yes @jthill. It is a safe assumption that I would not be here asking this question if I did not have an actual repo that exhibited the problematic behaviour. But I can't share this repo on the Internet unfortunately. That's why it would be great if I could reproduce the problem in a repo that I can share. In the mean time, I'll update the question with details from those commands you've suggested. — Alex Harvey, Sep 20 '19 at 04:39
@jthill, I updated with the output of those commands you suggested. — Alex Harvey, Sep 20 '19 at 05:12

score 6 · Accepted Answer · answered Sep 20 '19 at 06:33

In Git, each commit is¹ a snapshot plus some metadata. Each commit is identified by its hash ID. The metadata in a commit include the hash ID(s) of its parent commit(s). This forms a graph—specifically a Directed Acyclic Graph, or DAG—whose vertices (or nodes) are the commits and whose edges are the one-way child-to-parent links from each node to its parent(s).

What this means is that the history in a repository is the commits. There is no file history. There are only commits.

While git log will show you a purported file history, if you ask it, it's really just making it up. It does so by comparing each commit to its parent(s). For ordinary single-parent commits, this works well. For merges, this sort of mostly kind of works for some or most cases, except when it doesn't. Your particular merge is one of the ones where it doesn't work very well.

You can use the -m flag, as you are doing, to "split" a merge. Instead of doing a combined diff (as with -c or --cc), or no diff at all (as is the default), the -m flag tells git log that, upon encountering the merge—commit d89ddb17122ab9eea72e7006461cb04a5a879770 in your example above—it should first do a diff using parent #1 and the merge. Then it does a second diff, using parent #2 and the merge. In your case parent #1 is either 95febfb or a577995ec16ae05c2f81adfdba5ce28e7b8ba150 (these cannot both be true—you must be omitting something here, or having git log omit something here), and parent #2 is either f85c1bb or 97b8dc2f7cf7e81d75fee5565423b554d191e4f3.

(The git show command is like git log except that it defaults to --cc rather than showing nothing, and stops after showing the named commit. Based on your git show it looks like the shorter hash IDs are the actual ones.)

Now, the fact that one particular git show (or git diff --name-status) output shows:

A       files_from_C
A       more_files_from_C
D       jenkins/files_from_B
D       jenkins/more_files_from_B

just means that in the parent, there were files whose names were the D names, and in the child, there were files whose names were the A names. It's likely that you have rename detection turned off here—rename detection is off by default in Git versions predating 2.9.0, and on by default in 2.9.0 and later. If you turn it on, Git might show these as "renamed" rather than deleted-and-added, if the contents are similar enough.

The same holds for the second git diff --name-status output from git show. This one is comparing the snapshot in parent #2 vs that in the merge-child. It's important to realize that these comparisons are valid on their own, but only give you a small-picture view. The true case is that there are two parents with two snapshots and one child—the merge commit—with one snapshot, and the three snapshots differ in various ways.

... with --all --name-status --full-history --follow -- I see all the history:

--follow turns on rename-finding, but it is a terrible hack. It can only look at one file. You tell git log a starting name. It looks at the first commit that git log looks at,² fetching that commit's parent(s). If there is just one parent, the job is easier: as before, Git diffs the parent vs the child. No file other than the named one is interesting. One of three things now happens:

If the diff (remember: with rename-finding turned on) shows that the file is modified in place, git log shows the commit, and moves on.
If the diff shows that the file is unchanged, git log does not show the commit, and moves on.
If the diff shows that the file is renamed—whether modified or not—git log shows the commit. Then it changes which name it's looking for, to use the "source" name from the parent commit. Then it moves on as before.

This same pattern is also used for merge commits! However, merge commits have very ... interesting git log behavior, which leads us to the next point. (It's time to stop for footnotes now.)

¹More precisely, the commit refers to a snapshot. If two different commits have 100% identical snapshots, they just re-use the same one.

²The order in which commits are walked, when git log is given --all, is somewhat tricky.

How `git log` works when there is more than one commit to show

We already mentioned that history is commits. When a commit chain is linear:

... <-F <-G <-H ...

it's pretty easy for Git to show commit H (by diffing G and H) and then just move on to show G (by diffing F and G) and then move on to show F, and so on. There's only one commit at a time to show: you start at the last one, identified by some branch name, and work backwards, one commit at a time.

This breaks down at merges. It also is a problem when you tell git log to start at two or more commits, as git log --all typically does.

The algorithm git log actually uses here involves a priority queue. You give git log some set of starting points:

git log master develop origin/feature

for instance resolves each of the three names, master, develop, and origin/feature to hash IDs (presumably commits—and if these are branch and remote-tracking names, they are commits). Assuming there are three different commit hash IDs,³ all three commit IDs go into the priority queue.

Now that the priority queue is non-empty, Git picks the first commit from the queue. Which one is first? That depends on the sort options you supply on the command line: --author-date-order, --topo-order, and so on. Giving no options means that the priority is by committer date: later dates have higher priority. To see what each sorting option does, see the git log documentation, but note that this sorting only happens when the queue has more than one commit in it.

The git log command now shows, or doesn't show, the commit it picked, based on the rest of the criteria from the command line. It typically then places all of the commit's parents into the priority queue, unless those parents have already been visited. However, several options, including listing a file name like TODO.md, change this behavior by turning on history simplification. When history simplification is on, some parents are omitted. Adding --full-history forces all parents to be inserted into the priority queue.

With --follow, this—--full-history—is not always helpful, as we're about to see. But let's finish up with the graph-walk algorithm first.

We can now look at how git log really works, in much more detail:

Place command-line arguments, as translated into raw commit hash IDs, into priority queue. If no command-line argument is used to select one or more starting commits, use HEAD to select the starting commit.
While the queue is not empty:
1. Take the first element off the queue. (This commit is now visited.)
2. Decide whether to show this commit. If so, show it (doing parent rewriting as well, if that is enabled—that's another topic entirely; it only matters if you are using --parents or --graph).
3. Enumerate this commit's parents, applying history simplification if enabled. Place chosen parent(s) into priority queue unless already present or already visited. If the commit has no parents, or they're skipped, the queue becomes shorter. If multiple parents go into the queue, the queue becomes longer. The "priority" part of the priority queue determines which commit will be at the front when we get back to step 1.

That's pretty much the whole algorithm. A lot of weirdness follows from steps 2 and 3. History simplification at merges, unless disabled with --full-history, consists of following some (randomly-chosen) TREESAME parent, if there is one! (Understanding this requires defining TREESAME. Fortunately you're using --full-history so we don't have to do that.)

³If you name tag objects, git log translates the tag name to a commit hash ID, almost as if you'd used tag^{commit}; see the git rev-parse documentation for details. The git log command is fundamentally interested in commits, so it ignores attempts to log blob hashes and the like.

Think about how the rename-detection interacts with the priority queue

Suppose we're looking at the following very simple history, with commit M as the HEAD on our single branch master:

M     (merge commit)
|\
| B   (parent #2)
A     (parent #1)

Suppose further that there's exactly one file in M, named final. Its contents exactly match those of the only file—which is named A—in commit A, and the only file—which is named B—in commit B.

(Here's the actual git log --oneline ... output:

*   f11ea2a (HEAD -> master) merge A and B to final
|\  
| * 811819b (B) B
* 50d92c7 A

which will be useful below. My hash IDs are of course mine.)

We run:

git log --name-status --oneline --follow --full-history -m -- final

(the -m is required in this case, as I found out via testing). Git extracts M and the first of the two parents and diffs them. It finds that, from A to M, there's a rename from A to final. So it will show commit M. Then it changes its file-following: it is no longer looking for final, but rather for A. Now it diffs commits B and M. There is no file named A so it shows nothing here.

The next commit in the queue is B (because it has a later date). To compare a no-parent (root) commit, Git will diff it against the empty tree. Git diffs nothing-vs-commit-B and finds that we added file B. This is not the file we are looking for, so Git says nothing.

Git now moves on to consider commit A. Here, it finds that commit A adds a file A, which is the one file it is looking for.

The final output is this:

$ git log --name-status --oneline --follow --full-history -m -- final
f11ea2a (from 50d92c7) (HEAD -> master) merge A and B to final
R100    A       final
50d92c7 A
A       A

The message f11ea2a (from 50d92c7) tells us that the commit being shown in the next line is virtual-split-f11ea2a with parent 50d92c7 (merge M with parent A). The R line tells us file A was renamed to final in the merge.

The virtual-split-f11ea2a for B is not printed because neither of these commits has file A in it, and we're already looking for A instead of final.

Next, 50d92c7 is commit A itself. The subsequent A line tells us file A was added in commit 50d92c7 (commit A).

Commit B is omitted, even though it too created B from scratch, and B was then renamed to final. Or was it A that was renamed to final? Well, both are true, or maybe neither: maybe I created file final from scratch, throwing away the two files A and B.

The real point of all this exercise is that there isn't one "real" file history. The only history in this Git repository is the set of commits in the repository, with their parent/child relationships. Everything else is fiction! We can, to some extent, get a useful fiction from git log, but that extent has limitations.

Conclusion

Is there a way that I can edit this history so that git show, git blame, and others will reveal the true history of all this repo's files, whether they were created at (B) or (C)?

Not really, no. The problem is there isn't a file history. You can arrange the commit history however you like, knowing now what (for instance) git log does in terms of looking for rename operations, whether that's because you've used -M or set diff.renames to true or are using a Git that is 2.9 or later or are using --follow to fake up a file history with Git's rather poor, but sometimes barely adequate, methods.

The git show command is the same as git log except that when generating diff output, it defaults to using --cc to produce combined diffs. A combined diff omits any file that is the same in any parent as it is in the child commit. Suppose merge commit M has parents P1 and P2, and all but two the files in M exactly match those in P1. Suppose further that the two files in M that don't match P1, do exactly match those in P2. The combined diff will therefore show no files changed.

The git blame command is more complex. It can look for lines that were copied or moved from any file in the parent: see the -C option. I have never delved into what it does at merge commits (does it look for lines copied or moved from any parent file?), but I assume that like git log, it is eventually forced to do some kind of history simplification, because it's impractical to follow every path backwards.

Thanks so much for this encyclopaedic answer! Isn't it possible though to edit the commit's parent/child relationships? I'm naively imagining that surely by changing a few pointers in the graph I could "fix" the history so I see what I need to see? — Alex Harvey, Sep 20 '19 at 06:50
You cannot change anything in any Git object (including a commit). Trying to do so just produces a new and different Git object. Each commit includes, as a raw hash ID, the hash ID(s) of its parent(s). This is how Git does its distributed-repository magic. What you *can* do is build a new history, by extracting commits, making changes, and committing the results: that's what `git filter-branch` and The BFG do, for instance. It's also what `git rebase` does, though the strategy in rebase is entirely different. — torek, Sep 20 '19 at 06:53
Note that Git does have the concept of a *replacement* object, via `git replace`. These replacements are not copied by `git clone` by default, though. They're really add-on objects, found via the `refs/replace/*` namespace: when Git is about to use some object with hash ID ``, it first checks to see if there's a `refs/replace/` and if so, uses the replacement instead, unless you disable this with `--no-replace`. — torek, Sep 20 '19 at 06:55
One sensible way to "edit" history is to use `git replace` to construct the history you want (in a temporary copy of the repository), then run an otherwise no-op `git filter-branch` to extract and re-commit everything but with the lookaside replacements inserted during the process. The result is a new repository, no longer compatible with the original, because many of the hash IDs are now different. — torek, Sep 20 '19 at 06:56
The reason `git replace` works pretty well is that you can *experiment*. Try changing some parent/child relationship (using a graft style replacement object; see the `git replace` docs), see if that's good. If it's bad, remove the graft. If it's good, keep it. Try another if appropriate. Repeat until you have what you like. Then have your red-letter / flag day: filter-branch, and get everyone to switch from existing clones of the repository you dislike, to new clones of the new repository you do like. — torek, Sep 20 '19 at 07:01
Ok, thanks- that sounds like a way forward. Thanks again for so much info! — Alex Harvey, Sep 20 '19 at 07:02

Alex Harvey · Answer 2 · 2019-09-20T15:38:19.063

Following @torek's observations I found a way to "fix" this that seems to be foolproof.

Using this example again:

* commit beb7ea3351f50dd29899baa878ea2fa29c437ecc
(A)
|
*   commit ed3fef629f8d7268fe29c37029977443eea46494
|\  Merge: d8cf79f ed3fef6
| |
| * commit 820bea750c86c90443ca1068e08d6b72cbe317ca
| |
| * commit 19efa83244f1e19976b0e543bf391099bcc1b056
| (B)
|
* commit d8cf79f2103b7d25e6c4dbb96bbd3f672d30bae8
|
* commit fc4c8d8cb9df4c3ee892f0f0f691c71526668d55
(C)

I define:

pointA=beb7ea3351f50dd29899baa878ea2fa29c437ecc
pointC=d8cf79f2103b7d25e6c4dbb96bbd3f672d30bae8
merge_commit=ed3fef629f8d7268fe29c37029977443eea46494

Then I use git replace:

▶ git replace -f --graft "$pointA" "$pointC" "$merge_commit"

This says to make both points C & the merge commit be parents of commit A.

My new graph looks like this:

▶ git log --graph 
*   commit beb7ea3351f50dd29899baa878ea2fa29c437ecc (HEAD -> master, replaced)
|\  Merge: d8cf79f ed3fef6
| | 
| *   commit ed3fef629f8d7268fe29c37029977443eea46494
| |\  Merge: d8cf79f 820bea7
|/ /  
| * commit 820bea750c86c90443ca1068e08d6b72cbe317ca
| | 
| * commit 19efa83244f1e19976b0e543bf391099bcc1b056
| 
* commit d8cf79f2103b7d25e6c4dbb96bbd3f672d30bae8
| 
* commit fc4c8d8cb9df4c3ee892f0f0f691c71526668d55

I suppose that is more complicated than it "needs" to be, but the good part is all my commands git log, git blame and so on show me corrected histories both in files at point B and at C.

Finally, as noted by @VonC and explained more fully in @torek's answer here, this has only replaced the local references.

Since I want to force push and force everyone to clone a new version of the history, I need to filter the branch using this:

▶ git filter-branch --tag-name-filter cat -- --all

Oh. You mean git replace won’t persist even if I force push? @VonC — Alex Harvey, Sep 20 '19 at 14:04
Yes: the replace generates references in the `refs/replace/` namespace, which is *not* pushed. Not pushed by default at least (https://stackoverflow.com/a/20072413/6309). See more at https://stackoverflow.com/a/44029527/6309. — VonC, Sep 20 '19 at 14:05
@VonC thanks for letting me know. I’ll try it figure out how to do that. — Alex Harvey, Sep 20 '19 at 14:07
See more at https://stackoverflow.com/a/44029527/6309 (another long and instructive answer from... torek, who else?) — VonC, Sep 20 '19 at 14:07

score 0 · Answer 3 · answered Sep 20 '19 at 06:31

Your gist shows the TODO.md file exists in the merge commit d89d and does not exist in either merge parent, exactly as git log and git show and so on have been telling you. The file was added manually in that merge, for reasons you'll have to find out from the merge author. So somewhere between the the C root and the merge somebody made a TODO.md and somewhere between that and the merge somebody deleted it (for reasons you'll have to find out from that commit's author). Likewise with the B root. Then whoever did the merge made a new TODO.md during the merge.

That's what happened, that's what Git's been telling you happened, that's what's in the recorded history: the file was added in that merge. If whoever deleted the earlier ones should have done something different, if those files were deleted in error, either go back and start from the last correct commit and then make the commits that should have been made, that's how you "fix" a history you don't want, you record the history you want.

I did the merge using the procedure I linked earlier [here](https://alexharv074.github.io/2017/10/04/merge-a-git-repository-and-its-history-into-a-subdirectory-of-a-second-git-repository.html). This has worked fine for me in the past so I assumed it was reliable. I'm guessing that the git filter-branch somehow edited something that it shouldn't have. Certainly no one added anything manually, if by that you mean they typed "git add" and then committed. — Alex Harvey, Sep 20 '19 at 07:05

How to repair a git history and correctly merge unrelated histories

My problem

Further observations

My question

3 Answers3

How `git log` works when there is more than one commit to show

Think about how the rename-detection interacts with the priority queue

Conclusion

Linked

How to repair a git history and correctly merge unrelated histories

My problem

Further observations

My question

3 Answers3

How git log works when there is more than one commit to show

Think about how the rename-detection interacts with the priority queue

Conclusion

Linked

How `git log` works when there is more than one commit to show