TL;DR: "Combined diffs"
When you invoke git diff
this way—by naming three or more commits on the command line—you're invoking the same internal machinery that git show
uses for showing a merge commit. This depends on the fact that each commit stores a single Git "tree" object. A merge commit has two or more parent commits, each of which also has a tree object, so when git show
is handed a merge commit hash ID, it has three or more tree objects to compare, while the basic difference-engine algorithm can only take two at a time. It therefore does ... something, and knowing what that "something" is, is useful, every time you see one of these things. Git calls this something a combined diff. You don't have to memorize the details of combined diffs: just look them up in the manuals. Do, however, remember that the Git documentation splits up two key facts about combined diffs:
- Combined diffs omit some files entirely.
- Combined diffs can omit some diff hunks entirely.
When reading the manual pages, remember to search for both sections about combined diffs.
Note that you get combined diffs for a plain, no-arguments-provided git diff
command when you're in the middle of an incomplete merge operation, too. However, in this case, the sources for the to-be-diffed files are your working tree and Git's index, rather than multiple commits. This answer does not cover these details.
Short-ish
Git adds special git diff
logic in an attempt that (in my opinion at least) works sort-of-OK for some common cases where we want to see why a merge commit has the tree it has, and not some other tree we might have expected. This attempt has some flaws, and since merge-ort
became the default merge strategy (in Git 2.34) and it has some new features, there may someday be a better way to do this, but for now, git show
of a merge can sometimes help you figure out what happened. The mechanism it uses is to run git diff
n times, where n is the number of parents of the merge commit, then combine parts of the results, and discard other parts of the results, to form a combined diff. This makes sense for showing a merge. It makes less sense for showing a non-merge commit (and no sense for your original cherry-pick purpose).
The files that get omitted entirely from a combined diff are those where the merge commit's version of the file exactly matches the version in at least one of the parent commits. The general idea here is that whoever did the merge must have thought that one of the two parents had the right code. The flaw in this general idea is that perhaps the person who made the merge was an idiot. (Or, to be a bit fairer, perhaps the person who made the merge didn't realize that they needed to look at the other commit's changes. This actually happens surprisingly often in real life, via people who were not properly taught how to use Git's merge operation.)
In any case, the git diff
command line syntax has always allowed you to invoke the combined-diff code on any number of Git tree objects of your choice. When I was fixing a small bug in git diff
, I updated the usage documentation (see commit b7e10b2ca210d6a3647910fdecea33581e4eaf0d
) to mention that this is how you can get git diff
to do what git show
does.
The actual operation uses the commits' trees, and you can run git diff
on what Git calls a <tree-ish>. That is, git diff HEAD HEAD^1 HEAD^2
operates using HEAD^{tree}
as the merge-commit's tree and HEAD^1^{tree}
and HEAD^2^{tree}
as the other two trees. You can invoke it this way. But that's something of an accident of the implementation: if we documented it formally, we'd never be able to change this. There's some tension between documenting "what Git really does" and "what Git logically should be doing", and in this case, I felt that consistency favored the "logically should be doing", so that's what's in the linked commit.
Although I don't think anyone should use this mechanism with a non-merge commit, reading the long description below will allow you to understand what you see. You can decide whether what you see has any use to you: Git is a set of tools, not a solution, and you can plug the power screwdriver into the bandsaw even if that doesn't make sense. But by not documenting that you can do this, we try to keep people away from it. In a longer article like this one, though, I go right into the actual mechanism, so if you can do it, I show you how: I just tell you "don't trust it too much".
Long
You're ultimately interested in cherry-picking multiple commits, but you've asked about git diff
. This is a bit of an XY problem: cherry-picking is actually a kind of merge, which is not just a diff, and cherry-picking multiple commits means doing multiple merges, not one big merge. Still, the question you are asking is a valid question. It has an answer. Here is that answer.
Warning: this gets a bit long. Let's start with the easy part:
Does this actually do the equivalent of showing what the changes are if all of the commits were done ...
No. All of the commits you feed into this kind of git diff
are already done! There's no "if they were" about it. They are. They are not "proposed changes to make", they are existing commits. Moreover, no commit is a change in the first place!
Commits are snapshots; diffs are comparisons of snapshots
Let's take that first claim and refine it a bit. A commit is a snapshot plus metadata, so "commits are snapshots" is an incomplete statement: true as far as it goes, but missing the "plus metadata" part. The "snapshot" part is what we're concentrating on here though, so while we should keep the "plus metadata" in mind, let's go with the "snapshot" part:
Every commit has a full snapshot of every source file. More precisely, it has the source files it has: any source files it lacks means, in effect, "when extracting this snapshot, make sure to remove other files". Think of each commit as an archive (tarball, WinRAR, zip archive, whatever). If you downloaded and installed that archive, you would have those files, and no other files. That's the snapshot in the commit.
(The actual format of this snapshot is very special and Gitty, such that when we make thousands, or millions, of snapshots of some project, it hardly takes any more space than just one snapshot, or maybe a few tens or hundreds of snapshots. Git achieves this through de-duplication of snapshotted files, plus delta compression of internal Git objects that gets applied later in the process. We don't need to worry about any of this: that's all invisible to us, except in terms of savings on disk space and network bandwidth when we clone the repository.)
So, given any two commits, we have two snapshots. If the two commits are "near" each other, they are like two film frames. We can take those snapshots and place them side by side, and play a game of Spot the Difference. Did the dog move? Maybe his fur color changed! Look, the hands on the analog clock moved: an hour passed between the two snapshots!
Instead of completely writing down the two snapshots, we can express the difference between them, as a git diff
. Git's git diff
is inspired by the old context diff and unified diff formats from the traditional Unix diff
command, descended from Doug McIlroy's original implementation (see the Hunt–Szymanski or Hunt-McIlroy algorithm, though Git uses a variant of the Myers algorithm: see Myers diff algorithm vs Hunt–McIlroy algorithm). If we use this algorithm on adjacent commits—commits with a parent/child relationship, in Git—we see a representation of the change that some human made.
Sidebar: a diff isn't necessarily what a human did
Note that we don't necessarily see the actual change some human made. To take a trivial example, suppose someone has this as their original file:
Paris in the
the
the
spring
The human deletes one of the three redundant words the
: perhaps line 2. The computer says: "delete the last of the three redundant words the
", i.e., delete line 3. That's not the same thing, but it produces the same result.
More commonly, when we have languages that use balanced braces and/or parentheses as part of their construction, we might have:
if repeatable_test {
thing1 // may change the condition, so that testing again
// produces a different result
}
if repeatable_test {
thing2 // may change the condition
}
Someone might insert:
if repeatable_test {
thing3
}
between the first and second test, and our diff algorithm might present this as a change of the form:
if repeatable_test {
thing1
}
if repeatable_test {
+ thing3
+}
+if repeatable_test {
thing2
}
This change achieves the same result. But it's not what the human did. To the machine, there's no obvious way to choose which diff to use. Git recently (version 2.14) picked up a default diff indent heuristic for display to help out here, but it is not (yet?) used in merge, and this can cause problems during cherry-picking when Git picks the wrong set of "changed lines". (It's not all that common for it to cause problems, and indeed, it's not all that common to see it in the first place. That, plus the fact that the indent heuristic is non-obvious and doesn't always work, is why it took until Git 2.14 for Git to acquire it.)
Git merges
Before we cover "combined diffs", we really need to note some things about git merge
. The key insights are are these two:
Merging is about combining work. This means we need to define "work".
A merge commit, in Git, is a commit with special metadata. In particular it has two or more parent commits. There is nothing special about the snapshot in a merge commit. It is just the same ordinary snapshot as in any other commit.
Let's look briefly now at some of the metadata in each commit.
Every commit has a unique hash ID. The hash ID is a big, ugly, random-looking string of letters and digits, such as e4a4b31577c7419497ac30cebe30d755b97752c5
. This is actually a very large number expressed in hexadecimal. The number isn't actually random: it's a cryptographic checksum of the raw commit data, so that every piece of Git software anywhere in the universe will compute the same hash ID for the same commit. That way, two separate implementations of Git, working with two separate repositories, can talk to each other and find out which repository or repositories has some particular commit, just by comparing hash IDs. This clever trick resides at the heart of Git's distributed nature, making it efficient to have distributed clones of repositories. All we really need to know, though, is that the hash ID uniquely identifies some particular commit. Git needs this hash ID; if we can give Git the hash ID, Git can tell if it has the commit, and if it does have the commit, Git can get, use, and display the commit. If our Git—our software working with our repository—doesn't have the commit, we hook ours up to some other Git that does, and get it, and then we're good.
So: each commit has a snapshot plus metadata, and in the metadata for any one given commit, Git stores a list of previous commit hash IDs. Most commits have exactly one previous-commit-hash-ID in this list. Such a commit is an ordinary commit: it has one parent, and Git uses the stored hash ID in the commit to get and use the parent.
Being able to get the commit itself—the child—and the parent gives Git two commits, and now Git can play the Spot the Difference game and show us a diff. That's what we see when we run:
git show main
for instance. Git uses the name main
(or in my case below, the special magic name HEAD
) to find a hash ID like e4a4b31577c7419497ac30cebe30d755b97752c5
, uses hash ID e4a4b31577c7419497ac30cebe30d755b97752c5
to find parent commit 49c837424a6152618aad42fa6d5083c6be1fa718
, and uses the pair so that we get:
$ git show
commit e4a4b31577c7419497ac30cebe30d755b97752c5 ...
diff --git a/GIT-VERSION-GEN b/GIT-VERSION-GEN
index 120af376c1..b210b306b7 100755
--- a/GIT-VERSION-GEN
+++ b/GIT-VERSION-GEN
@@ -1,7 +1,7 @@
#!/bin/sh
GVF=GIT-VERSION-FILE
-DEF_VER=v2.37.0-rc2
+DEF_VER=v2.37.0
LF='
'
That's fine for an ordinary commit: the changes in the commit are those from the parent to the commit. But this also leads us to how merge works.
Let's draw a series of commits in a new, nearly-empty repository. Let's say we have just three of them, and for simplicity in our drawing, let's pretend their hash IDs are A
, B
, and C
in that order. Then we have:
A <-B <-C <--main
The name main
provides the hash ID of the latest commit C
. Commit C
stores a snapshot and metadata, and the metadata give Git the hash ID of commit B
. Commit B
stores a snapshot and metadata, and B
's metadata give Git the hash ID of commit A
. Commit A
stores a snapshot and metadata ... well, you get the idea, but let's note that A
is the very first commit. As such, it has no parent, so its list of parent hash IDs is just empty. This allows a program like git log
, which works backwards from the end to the beginning, to stop.
Now suppose time has passed and we have more commits (perhaps as many as eight!) and our drawing now looks like this:
...--F--G--H <-- main
The name main
now locates commit H
, which points back to earlier commit G
, which points back to F
, and so on. For various reasons I've grown lazy about drawing the arrows between commits, but still use an arrow coming out of a branch name to show where the branch name points.
Let's now make a new branch name, br1
, that also points to commit H
, like this:
...--F--G--H <-- br1, main
Note that all the commits are on both branches. We do, however, now need a way to know which name we're using. To help out, Git uses the special name HEAD
, written in all uppercase like this: it "attaches" this special name to one branch name. If we are "on" main
—if we have run git checkout main
or git switch main
—then HEAD
is attached to main
:
...--F--G--H <-- br1, main (HEAD)
If we run git switch br1
, to switch to branch br1
, we get:
...--F--G--H <-- br1 (HEAD), main
Either way we're using commit H
, but we're using it through a different name.
Now suppose we add one new commit, in the usual way (modify some files, git add
, and git commit
). We get a new commit, with a new, unique hash ID: we'll call this I
for short, and draw it in:
I <-- br1 (HEAD)
/
...--F--G--H <-- main
Note how HEAD
is still attached to the name br1
, but the name br1
now points to I
instead of H
. If we make a second new commit we get:
I--J <-- br1 (HEAD)
/
...--F--G--H <-- main
If we "switch back" to main
(with git switch main
or git checkout main
), we get:
I--J <-- br1
/
...--F--G--H <-- main (HEAD)
Git removes the commit-I
files—they're safely archived in commit I
for later recovery—and installs the commit-H
files for us to work on / with. We can now create another branch name br2
, or just use main
.
I'll go ahead and create and switch to a new name br2
and then make yet another new commit K
, to get this:
I--J <-- br1
/
...--F--G--H <-- main
\
K <-- br2 (HEAD)
Adding yet another commit L
gives me:
I--J <-- br1
/
...--F--G--H <-- main
\
K--L <-- br2 (HEAD)
This is how branches work (and grow) in Git. But now that we have some branches, we might want to use git merge
. And, as we noted above, merging is about combining work. But what work did we do on branch br1
? How will we know? What work did we do on branch br2
? How do we combine this work?
We could try using git diff
on commits J
and L
, but that's going to be wrong. Suppose that in H
, we described, in some text file, a red ball
, and by commit J
we had changed it to a blue ball
. Meanwhile in the K-L
series of commits, we left it alone. A diff from J
to L
will say that we should change blue ball
back to red ball
. That's not right!
Meanwhile, maybe on the H-K-L
line we found that RED
and BLUE
needed to be qualified: spelled out as COLOR_RED
and COLOR_BLUE
in some code files. We want to keep those changes too. If we compared L
to J
, it would say to change those back, and that's not right either.
What we need is to somehow compare what's in commit H
—the starting point where we began "new work" on br1
—to what's in commit J
, to see what work we did on br1
. Then, using the same starting commit, we can compare what's in H
to what's in L
, to see what work we did on br2
.
Commit H
in this case is the merge base, and doing these two sets of comparisons is how merging works. We diff H
twice: once against J
to see what changes happened on the H-I-J
path, and once against L
to see what changes happened on the H-K-L
path.
We then simply (or complicatedly) have Git combine these two sets of changes. If we changed red ball
to blue ball
, and changed if color == RED
to if color == COLOR_RED
, and likewise BLUE
to COLOR_BLUE
, Git will try to keep both changes. In some cases, these two changes will overlap (touch the same lines of the same files) and Git will declare a merge conflict. If Git doesn't see any conflicts—if no diff lines overlap, more or less1—Git will do the merge entirely on its own. If Git does see conflicts, it will stop in the middle of the merge. Git will make us fix up the files to contain the "right" final result, whatever we claim that is. Either way—whether we have to fix up the files ourselves, or whether Git thinks it can do everything on its own—we eventually pick a final snapshot to use with our new commit, and we have Git make this new merge commit M
, like this:
I--J
/ \
...--F--G--H M
\ /
K--L
I took all the branch names out of this diagram for several reasons:
- it's hard to make
main
point to H
and one of the two br
s point to M
without having the text overlap;
- the
br
name that points to M
depends on which branch we're "on" when we run git merge
, but the snapshot that goes in M
depends only on the actual merge snapshot, and that's what we really care about here.
So let's just assume that, some time later, we have:
I--J
/ \
...--G--H M--N <-- somebranch (HEAD)
\ /
K--L
Our key concepts here (remember the "key insights" line from above?) are that merge commit M
has two parents, J
and L
and that the snapshot in M
is the result of combining work. The "combining" took two diffs—a diff from H
to J
, and a diff from H
to L
—and smashed them together and applied the smashed-together changes to H
to get M
, possibly with human assistance.
1Git considers two diffs to conflict with each other if they just "touch at the edges" (abut), too. This is an arbitrary choice: some merge algorithms don't call this a conflict, and some do.
Combined diffs, or, how can we "see" commit M
?
When we look at an ordinary (single-parent) commit, we "see" it as a diff:
diff --git a/GIT-VERSION-GEN b/GIT-VERSION-GEN
index 120af376c1..b210b306b7 100755
--- a/GIT-VERSION-GEN
+++ b/GIT-VERSION-GEN
@@ -1,7 +1,7 @@
#!/bin/sh
GVF=GIT-VERSION-FILE
-DEF_VER=v2.37.0-rc2
+DEF_VER=v2.37.0
LF='
'
In this commit, just one file changed, GIT-VERSION-GEN
; one line of that one file changed. A simple git diff
shows us this. The diff algorithm itself can only compare two snapshots but we only have two snapshots to compare.
But for a merge commit like M
, we have at least three snapshots. We have J
, L
, and M
. (We might even have H
, if we care to find it again. Unfortunately Git doesn't record H
's hash ID, which I think is a mistake: we can run the same algorithm again to find H
, but Git also doesn't record the algorithm used, and does offer us a choice of algorithms, so we're SOL, and that's why it's a mistake. I think Git should have recorded the algorithm and the merge bases used, just for completeness, but certainly it should have saved at least one of these.)
Some commands, including git log
by default, just say, in effect: Oh, that's too hard. I just won't show any diff at all. They don't invoke the diff algorithm on merge commits.
Other commands, including git show
by default, have a different answer: they invoke a combined diff. A combined diff, in Git, takes a merge commit snapshot like M
, and runs more than one diff. For git show
in particular, Git will list out the parent commits, in order—the list has an order—and run one git diff
from the first parent to the merge, then a second git diff
from the second parent to the merge. A merge can, technically, have two or more parents, so if it has three or more, Git keeps going here, running a third diff from the third parent to the merge, and so on.
Each diff here can list one or more changes to one or more files. One way to combine such diffs would be to literally combine them all, but that's not what git diff
's combined diff does. Instead, it takes a couple of short-cuts, based on the idea that it's showing a merge commit.2 Specifically, Git looks at each parent-vs-merge comparison first. If any file in any parent exactly matches the final file in the commit, Git throws that file out of the diff entirely!
For a merge commit, this leaves only those files where the merge commit's version of the file doesn't match any parent commit's version of the file. That is, for our merge commit M
, file shown.txt
doesn't match J
's shown.txt
and doesn't match L
's shown.txt
. Git took changes from both commits and combined them—hence the name "combined diff".
Now, maybe branch br1
changed lines 5 through 10 of shown.txt
and branch br2
changed lines 105 through 110, so that there was no overlap at all. If that's the case, you'll see those changes with single +
and -
lines. These markers will show where lines were added or deleted, and which parent it was that got changed to produce the final result.
But maybe there was some overlap. Maybe br1
changed lines 5-10, and br2
changed line 7, right in the middle. Here, you'll see +
and -
lines where there are multiple +
and/or -
markers on the same line, like this example from the documentation:
- static void describe(char *arg)
-static void describe(struct commit *cmit, int last_one)
++static void describe(char *arg, int last_one)
Here, parent #1 said static void describe(char *arg)
. Parent #2 said static void describe(struct commit *cmt, int last_one)
. The merged result says static void describe(char *arg, int last_one)
. The diff output tells you that both of the two input lines were deleted and the final result effectively adds the new line to both input files. (We cannot see the merge base commit's copy of this file at all, as Git has no idea which commit(s) were the merge base(s).)
2This means that if you use it on something that isn't a merge commit, you're getting a deliberately defective diff. As long as you know this and take it into account, that's OK, just remember that Git does this.
-c
vs --cc
, and final notes
Note that when choosing a combined diff, you can either ask for -c
or --cc
. The difference between these is poorly documented: currently the main mention is in the git log
documentation, under the --diff-merges
option, which says this:
--diff-merges=combined
--diff-merges=c
-c
With this option, diff output for a merge commit shows the differences from each of the parents to the merge result simultaneously instead of showing pairwise diff between a parent and the result one at a time. Furthermore, it lists only files which were modified from all parents. -c
implies -p
.
--diff-merges=dense-combined
--diff-merges=cc
--cc
With this option the output produced by --diff-merges=combined
is further compressed by omitting uninteresting hunks whose contents in the parents have only two variants and the merge result picks one of them without modification. --cc
implies -p
.
I'm not completely convinced that these descriptions are entirely accurate for all versions of Git, but they mean what they say: --cc
"densifies" a combined diff by omitting any diff hunk where the merge result in that diff hunk matches any parent commit shown in that same hunk. This is useful with merges in that it shows us where a human did some picking and choosing during a merge conflict. Note that it discards conflict cases where the human picked one of the two parents—but that already happens anyway, even without --cc
, if the human did that for the entire file! )
For a more precise definition of "diff hunk", see In the context of git (and diff), what is a "hunk". VonC's answer here also discusses the "indent heuristic" I mentioned above.
Last, remember that any combined diff deliberately throws away some information. It does so on the assumption that the "final" snapshot is that in a merge commit. You can avoid combined diffs, even for merge commits, by asking Git to "virtually split" the merge, using git show -m
or git log -p -m
for instance. When Git encounters a merge commit M
here, it pretends, for diff-ing purposes, that there are two or more commits, one for each parent. Commit M (from J)
gets shown as git diff J M
, and commit M (from L)
gets shown as git diff L M
. For an octopus merge—which is what Git calls any merge with three or more parents—you'll get three or more "virtual splits". (However, if you add --first-parent
to the git log
options here, Git does the split, then only shows the first-parent-vs-merge-commit diff. For some workflows this is actually a very useful option.)