In Git, each commit is1 a snapshot plus some metadata. Each commit is identified by its hash ID. The metadata in a commit include the hash ID(s) of its parent commit(s). This forms a graph—specifically a Directed Acyclic Graph, or DAG—whose vertices (or nodes) are the commits and whose edges are the one-way child-to-parent links from each node to its parent(s).
What this means is that the history in a repository is the commits. There is no file history. There are only commits.
While git log
will show you a purported file history, if you ask it, it's really just making it up. It does so by comparing each commit to its parent(s). For ordinary single-parent commits, this works well. For merges, this sort of mostly kind of works for some or most cases, except when it doesn't. Your particular merge is one of the ones where it doesn't work very well.
You can use the -m
flag, as you are doing, to "split" a merge. Instead of doing a combined diff (as with -c
or --cc
), or no diff at all (as is the default), the -m
flag tells git log
that, upon encountering the merge—commit d89ddb17122ab9eea72e7006461cb04a5a879770
in your example above—it should first do a diff using parent #1 and the merge. Then it does a second diff, using parent #2 and the merge. In your case parent #1 is either 95febfb
or a577995ec16ae05c2f81adfdba5ce28e7b8ba150
(these cannot both be true—you must be omitting something here, or having git log
omit something here), and parent #2 is either f85c1bb
or 97b8dc2f7cf7e81d75fee5565423b554d191e4f3
.
(The git show
command is like git log
except that it defaults to --cc
rather than showing nothing, and stops after showing the named commit. Based on your git show
it looks like the shorter hash IDs are the actual ones.)
Now, the fact that one particular git show
(or git diff --name-status
) output shows:
A files_from_C
A more_files_from_C
D jenkins/files_from_B
D jenkins/more_files_from_B
just means that in the parent, there were files whose names were the D
names, and in the child, there were files whose names were the A
names. It's likely that you have rename detection turned off here—rename detection is off by default in Git versions predating 2.9.0, and on by default in 2.9.0 and later. If you turn it on, Git might show these as "renamed" rather than deleted-and-added, if the contents are similar enough.
The same holds for the second git diff --name-status
output from git show
. This one is comparing the snapshot in parent #2 vs that in the merge-child. It's important to realize that these comparisons are valid on their own, but only give you a small-picture view. The true case is that there are two parents with two snapshots and one child—the merge commit—with one snapshot, and the three snapshots differ in various ways.
... with --all --name-status --full-history --follow --
I see all the history:
--follow
turns on rename-finding, but it is a terrible hack. It can only look at one file. You tell git log
a starting name. It looks at the first commit that git log
looks at,2 fetching that commit's parent(s). If there is just one parent, the job is easier: as before, Git diffs the parent vs the child. No file other than the named one is interesting. One of three things now happens:
If the diff (remember: with rename-finding turned on) shows that the file is modified in place, git log
shows the commit, and moves on.
If the diff shows that the file is unchanged, git log
does not show the commit, and moves on.
If the diff shows that the file is renamed—whether modified or not—git log
shows the commit. Then it changes which name it's looking for, to use the "source" name from the parent commit. Then it moves on as before.
This same pattern is also used for merge commits! However, merge commits have very ... interesting git log
behavior, which leads us to the next point. (It's time to stop for footnotes now.)
1More precisely, the commit refers to a snapshot. If two different commits have 100% identical snapshots, they just re-use the same one.
2The order in which commits are walked, when git log
is given --all
, is somewhat tricky.
How git log
works when there is more than one commit to show
We already mentioned that history is commits. When a commit chain is linear:
... <-F <-G <-H ...
it's pretty easy for Git to show commit H
(by diffing G
and H
) and then just move on to show G
(by diffing F
and G
) and then move on to show F
, and so on. There's only one commit at a time to show: you start at the last one, identified by some branch name, and work backwards, one commit at a time.
This breaks down at merges. It also is a problem when you tell git log
to start at two or more commits, as git log --all
typically does.
The algorithm git log
actually uses here involves a priority queue. You give git log
some set of starting points:
git log master develop origin/feature
for instance resolves each of the three names, master
, develop
, and origin/feature
to hash IDs (presumably commits—and if these are branch and remote-tracking names, they are commits). Assuming there are three different commit hash IDs,3 all three commit IDs go into the priority queue.
Now that the priority queue is non-empty, Git picks the first commit from the queue. Which one is first? That depends on the sort options you supply on the command line: --author-date-order
, --topo-order
, and so on. Giving no options means that the priority is by committer date: later dates have higher priority. To see what each sorting option does, see the git log
documentation, but note that this sorting only happens when the queue has more than one commit in it.
The git log
command now shows, or doesn't show, the commit it picked, based on the rest of the criteria from the command line. It typically then places all of the commit's parents into the priority queue, unless those parents have already been visited. However, several options, including listing a file name like TODO.md
, change this behavior by turning on history simplification. When history simplification is on, some parents are omitted. Adding --full-history
forces all parents to be inserted into the priority queue.
With --follow
, this—--full-history
—is not always helpful, as we're about to see. But let's finish up with the graph-walk algorithm first.
We can now look at how git log
really works, in much more detail:
Place command-line arguments, as translated into raw commit hash IDs, into priority queue. If no command-line argument is used to select one or more starting commits, use HEAD
to select the starting commit.
While the queue is not empty:
- Take the first element off the queue. (This commit is now visited.)
- Decide whether to show this commit. If so, show it (doing parent rewriting as well, if that is enabled—that's another topic entirely; it only matters if you are using
--parents
or --graph
).
- Enumerate this commit's parents, applying history simplification if enabled. Place chosen parent(s) into priority queue unless already present or already visited. If the commit has no parents, or they're skipped, the queue becomes shorter. If multiple parents go into the queue, the queue becomes longer. The "priority" part of the priority queue determines which commit will be at the front when we get back to step 1.
That's pretty much the whole algorithm. A lot of weirdness follows from steps 2 and 3. History simplification at merges, unless disabled with --full-history
, consists of following some (randomly-chosen) TREESAME parent, if there is one! (Understanding this requires defining TREESAME. Fortunately you're using --full-history
so we don't have to do that.)
3If you name tag objects, git log
translates the tag name to a commit hash ID, almost as if you'd used tag^{commit}
; see the git rev-parse
documentation for details. The git log
command is fundamentally interested in commits, so it ignores attempts to log blob hashes and the like.
Think about how the rename-detection interacts with the priority queue
Suppose we're looking at the following very simple history, with commit M
as the HEAD
on our single branch master
:
M (merge commit)
|\
| B (parent #2)
A (parent #1)
Suppose further that there's exactly one file in M
, named final
. Its contents exactly match those of the only file—which is named A
—in commit A
, and the only file—which is named B
—in commit B
.
(Here's the actual git log --oneline ...
output:
* f11ea2a (HEAD -> master) merge A and B to final
|\
| * 811819b (B) B
* 50d92c7 A
which will be useful below. My hash IDs are of course mine.)
We run:
git log --name-status --oneline --follow --full-history -m -- final
(the -m
is required in this case, as I found out via testing). Git extracts M
and the first of the two parents and diffs them. It finds that, from A
to M
, there's a rename from A
to final
. So it will show commit M
. Then it changes its file-following: it is no longer looking for final
, but rather for A
. Now it diffs commits B
and M
. There is no file named A
so it shows nothing here.
The next commit in the queue is B
(because it has a later date). To compare a no-parent (root) commit, Git will diff it against the empty tree. Git diffs nothing-vs-commit-B
and finds that we added file B
. This is not the file we are looking for, so Git says nothing.
Git now moves on to consider commit A
. Here, it finds that commit A
adds a file A
, which is the one file it is looking for.
The final output is this:
$ git log --name-status --oneline --follow --full-history -m -- final
f11ea2a (from 50d92c7) (HEAD -> master) merge A and B to final
R100 A final
50d92c7 A
A A
The message f11ea2a (from 50d92c7)
tells us that the commit being shown in the next line is virtual-split-f11ea2a with parent 50d92c7
(merge M
with parent A
). The R
line tells us file A
was renamed to final
in the merge.
The virtual-split-f11ea2a
for B is not printed because neither of these commits has file A
in it, and we're already looking for A
instead of final
.
Next, 50d92c7
is commit A
itself. The subsequent A
line tells us file A
was added in commit 50d92c7
(commit A
).
Commit B
is omitted, even though it too created B
from scratch, and B
was then renamed to final
. Or was it A
that was renamed to final
? Well, both are true, or maybe neither: maybe I created file final
from scratch, throwing away the two files A
and B
.
The real point of all this exercise is that there isn't one "real" file history. The only history in this Git repository is the set of commits in the repository, with their parent/child relationships. Everything else is fiction! We can, to some extent, get a useful fiction from git log
, but that extent has limitations.
Conclusion
Is there a way that I can edit this history so that git show, git blame, and others will reveal the true history of all this repo's files, whether they were created at (B) or (C)?
Not really, no. The problem is there isn't a file history. You can arrange the commit history however you like, knowing now what (for instance) git log
does in terms of looking for rename operations, whether that's because you've used -M
or set diff.renames
to true
or are using a Git that is 2.9 or later or are using --follow
to fake up a file history with Git's rather poor, but sometimes barely adequate, methods.
The git show
command is the same as git log
except that when generating diff output, it defaults to using --cc
to produce combined diffs. A combined diff omits any file that is the same in any parent as it is in the child commit. Suppose merge commit M has parents P1 and P2, and all but two the files in M exactly match those in P1. Suppose further that the two files in M that don't match P1, do exactly match those in P2. The combined diff will therefore show no files changed.
The git blame
command is more complex. It can look for lines that were copied or moved from any file in the parent: see the -C
option. I have never delved into what it does at merge commits (does it look for lines copied or moved from any parent file?), but I assume that like git log
, it is eventually forced to do some kind of history simplification, because it's impractical to follow every path backwards.