git finding duplicate commits (by patch-id)

Question

I'd like a recipe for finding duplicated changes. patch-id is likely to be the same but the commit attributes may not be.

This seems to be an intended use of patch-id:

git patch-id --help

IOW, you can use this thing to look for likely duplicate commits.

I imagine that stringing together "git log", "git patch-id" and uniq could do the job badly but if someone has an command that does the job well, I'd appreciate it.

This is a fascinating feature. Out of curiosity, how far back in the past are you intending to look? I could see some creative integration uses for this (i.e. "my contributor doesn't know how to rebase"), but over long history it would be less effective...? — Christopher, Jul 23 '12 at 02:23
The issue appeared in a week long history of a single branch, so my use case was quite gentle (git log -p was enough). The patch-id comment got me curious though... Searching all history could be painful. — bsb, Jul 25 '12 at 00:43
`git patch-id` should now *properly* reports all differences (attributes or binary) with Git 2.39 (Q2 2022). See my [updated answer below](https://stackoverflow.com/a/63674369/6309). — VonC, Oct 31 '22 at 16:11

score 12 · Answer 1 · answered Jul 23 '12 at 07:31

Because the duplicate changes are likely to be not on the same branch (except when there are reverts in between them), you could use git cherry:

git cherry [-v] [<upstream> [<head> [<limit>]]]

Where upstream would be the branch to check for duplicates of changes in head.

Slipp D. Thompson · Answer 2 · 2016-12-06T22:07:33.107

For looking for duplicates of a specific commit, this may work for you.

First, determine the patch id of the target commit:

$ THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF='7a3e67c'
$ git show $THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF | git patch-id
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3

The first SHA is the patch-id. Next, list the patch ids for every commit and filter out any that match:

$ for c in $(git rev-list --all); do git show $c | git patch-id; done | grep 'f6ea51cd6acd30cd627ce1a56e2733c1777d5b52'
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 5028e2b5500bd5f4637531e337e17b73f5d0c0b1
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 929c66b5783a0127a7689020d70d398f095b9e00

All together, with a few extra flags, and in the form of a utility script:

test ! -z "$1" && TARGET_COMMIT_SHA="$1" || TARGET_COMMIT_SHA="HEAD"

TARGET_COMMIT_PATCHID=$(
git show --patch-with-raw "$TARGET_COMMIT_SHA" |
    git patch-id |
    cut -d' ' -f1
)
MATCHING_COMMIT_SHAS=$(
for c in $(git rev-list --all); do
    git show --patch-with-raw "$c" |
        git patch-id
done |
    fgrep "$TARGET_COMMIT_PATCHID" |
    cut -d' ' -f2
)

echo "$MATCHING_COMMIT_SHAS"

Usage:

$ git list-dupe-commits 7a3e67c
5028e2b5500bd5f4637531e337e17b73f5d0c0b1
7a3e67ce38dbef471889d9f706b9161da7dc5cf3
929c66b5783a0127a7689020d70d398f095b9e00

It isn't terribly speedy, but for most repos should get the job done (just measured 36 seconds for a repo with 826 commits and a 158MB .git dir on a 2.4GHz Core 2 Duo).

I may be the only one confused, but just in case, "target-commit" is not a literal; replace it with the SHA of the commit you want to get a patch ID for. — Jimothy, Aug 20 '14 at 14:10
@Jimothy Yep, or a branch name or a tag name (any ref, I guess). I'll see if I can make it a bit clearer. — Slipp D. Thompson, Aug 20 '14 at 18:43

score 4 · Answer 3 · answered Jul 25 '12 at 04:30

4

I have a draft that works on a toy repo, but as it keeps the patch->commit map in memory it might have problems on large repos:

# print commit pairs with the same patch-id
for c in $(git rev-list HEAD); do \
    git show $c | git patch-id; done \
| perl -anle '($p,$c)=@F;print "$c $s{$p}" if $s{$p};$s{$p}=$c'

The output should be pairs of commits with the same patch-id (3 duplicates A B C come out as "A B" then "B C").

Change the git rev-list command to restrict the commits checked:

git log --format=%H HEAD somefile

Append "| xargs git show" to view the commits in detail, or "| xargs git show -s --oneline" for a summary:

0569473 add 6-8
5e56314 add 6-8 again
bece3c3 comment
e037ed6 add comment again

It turns out patch-id didn't work in my original case as there were additional changes in that later commit. "git log -S" was more useful.

answered Jul 25 '12 at 04:30

bsb

1,847
26
24

If you want to look at just at raw diffs between a commit and its parent, you could do something like `git diff $c~1 $c | git patch-id`. It's going to misbehave on merge commits. Following both merge parents is a more complex problem. – Christopher Jul 25 '12 at 09:57
It looks like patch-id finds the same diff? $ git diff HEAD~1 HEAD | git patch-id 3318362fa07e580.. 000000000000.. $ git show HEAD | git patch-id 3318362fa07e580.. c397c4cdc426.. – bsb Jul 26 '12 at 23:02
@bsb Are you sure you wanted to write `git show $c | git patch-id`? `git show` prints metadata, but `git patch-id` needs a patch as input... – Daniel Alder Apr 23 '14 at 16:02
@daniel-alder, I think I need the `show` rather than `diff` since that allows the Perl to print the duplicate commits (otherwise I just get a whole lot of zero shas). [The code](http://tinyurl.com/git-patch-id) skips non-diff input (although perhaps older versions don't suppose this, what version are you using?) – bsb Apr 28 '14 at 06:01
@bsb thx for this explanation. I checked again and saw the diff. `git show` and `git patch-id` seem to cooperate nicely, but only for normal commits. for merges it doesn't seem to show any diff, that was my problem. tested with 1.7.10.4 and 1.9.1 – Daniel Alder Apr 28 '14 at 19:52

unagi · Answer 4 · 2017-08-31T00:11:24.577

To search for duplicate commits of commit $hash, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | grep ^$(git show $hash|git patch-id|cut -c1-40) | cut -c42-80 \
    | xargs -r git show -s --oneline

For searching the duplicate of a merge commit $mergehash, replace $(git show $hash|git patch-id|cut -c1-40) above by one of the two patch IDs (1st column) given by git diff-tree -m -p $mergehash | git patch-id. They correspond to the diffs of the merge commit with each of its two parents.

To find duplicates of all commits, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | sort | uniq -w40 -D | cut -c42-80 \
    | xargs -r git log --no-walk --pretty=format:"%h %ad %an (%cn) %s" --date-order --date=iso

The search for duplicate commits can be extended or limited by changing the arguments to git rev-list, which accepts numerous options. For example, to limit the search to a specific branch specify its name instead of the option --all; or to search in the last 100 commits pass the arguments HEAD ^HEAD~100.

Note that these commands are fast since they use no shell loop, and batch-process commits.

To include merge commits, remove the option --no-merges, and replace xargs -r git show by xargs -r -L1 git diff-tree -m -p. This is much slower because git diff-tree is executed once per commit.

Explanation:

The first line generates a map of the patch IDs with the commit hashes (2-column data, of 40 characters each).
The second line only keeps commit hashes (2nd column) corresponding to the duplicate patch IDs (1st column).
The last line prints custom information about the duplicate commits.

score 2 · Answer 5 · answered Jul 28 '17 at 12:26

The nifty command suggested by bsb requires a couple of small tweaks:

(1) Instead of git show, which runs git diff-tree --cc, the command should use

    git diff-tree -p

Otherwise git patch-id generates spurious null SHA1 hashes.

(2) When the pipe to xargs is used, xargs should have the -L 1 argument. Otherwise a triplicated commit will not be paired with an equivalent commit.

Here's an alias to go in ~/.gitconfig:

dup = "!f() { for c in $(git rev-list HEAD); do git diff-tree -p $c | git patch-id; done | perl -anle '($p,$c)=@F;print \"$c $s{$p}\" if $s{$p};$s{$p}=$c' | xargs -L 1 git show -s --oneline; }; f" # "git dup" lists duplicate commits

score 0 · Answer 6 · answered Jun 17 '19 at 16:07

For anyone wanting to do this on windows powershell the equivalent command to unagi's answer is:

git rev-list --no-merges --all  | %{&git.exe show $_} | 
  git patch-id | ConvertFrom-String -PropertyNames PatchId, Commit | 
  Group-Object PatchId | Where-Object count -gt 1 | 
  %{$_.group.Commit + " "}

Gives an output like:

1605e0e1e13d7b3f456c20432d8edec664ca7117
1e8efa8f2f01962a2c08fd25caf687d330383428

b45b6db084b27ae420ac8e9cf6511110ebb46513
4a2e1e3ba5a9a1d5db1d00343813e1404f6124e2

With the duplicate commit hashes grouped together.

CAUTION: On my repo this was a slow command so be sure to filter the call to rev-list appropriately!

VonC · Answer 7 · 2022-10-31T16:10:53.607

Make sure to use a recent version of Git (2.39 or more)

The git log --format=%H mentioned by the OP bsb's answer is not always unique.

That is because, before Git 2.29 (Q4 2020), the patch-id computation did not ignore the "incomplete last line" marker like whitespaces.

See commit 82a6201 (19 Aug 2020) by René Scharfe (rscharfe).
^{(Merged by Junio C Hamano -- gitster -- in commit 5122614, 24 Aug 2020)}

patch-id: ignore newline at end of file in diff_flush_patch_id()

^{Reported-by: Tilman Vogel}
^{Initial-test-by: Tilman Vogel}
^{Signed-off-by: René Scharfe}

Whitespace is ignored when calculating patch IDs.
This is done by removing all whitespace from diff lines before hashing them, including a newline at the end of a file.
If that newline is missing, however, diff reports that fact in a separate line containing "\ No newline at end of file\n", and this marker is hashed like a context line.

This goes against our goal of making patch IDs independent of whitespace.

Use the same heuristic that 2485eab55cc (git-patch-id: do not trip over "no newline" markers, 2011-02-17) added to git patch-id^(man) instead and skip diff lines that start with a backslash and a space and are longer than twelve characters.

A "patch ID" is nothing but a SHA-1 of the diff associated with a patch, with whitespace and line numbers ignored

Actually, git patch-id will evolve with Git 2.39 (Q4 2022).

A new "--include-whitespace" option is added to "git patch-id"^(man), and existing bugs in the internal patch-id logic that did not match what "git patch-id" produces have been corrected with Git 2.39 (Q4 2022).

See commit 0d32ae8, commit 2871f4d, commit 93105ab, commit 0df19eb, commit 51276c1, commit 0570be7 (24 Oct 2022) by Jerry Zhang (jerry-skydio).
^{(Merged by Taylor Blau -- ttaylorr -- in commit 160314e, 30 Oct 2022)}

builtin: patch-id: add --verbatim as a command mode

^{Signed-off-by: Jerry Zhang}
^{Signed-off-by: Junio C Hamano}

There are situations where the user might not want the default setting where patch-id strips all whitespace.
They might be working in a language where white space is syntactically important, or they might have CI testing that enforces strict whitespace linting.
In these cases, a whitespace change would result in the patch fundamentally changing, and thus deserving of a different id.

Add a new mode that is exclusive of --stable and --unstable called --verbatim.
It also corresponds to the config patchid.verbatim = true.
In this mode, the stable algorithm is used and whitespace is not stripped from the patch text.

Users git of --unstable mainly care about compatibility with old versions, which unstripping the whitespace would break.
Thus there isn't a use case for the combination of --verbatim and --unstable, and we don't expose this so as to not add maintenance burden.

fixes https://github.com/Skydio/revup/issues/2

git patch-id now includes in its man page:

--verbatim

Calculate the patch-id of the input as it is given, do not strip any whitespace.

This is the default if patchid.verbatim is true.

But that is not all.
From the OP:

I'd like a recipe for finding duplicated changes. patch-id is likely to be the same but the commit attributes may not be.

That is also fixed with Git 2.39:

patch-id: fix patch-id for mode changes

^{Signed-off-by: Jerry Zhang}

Currently patch-id as used in rebase and cherry-pick does not account for file modes if the file is modified.
One consequence of this is that if you have a local patch that changes modes, but upstream has applied an outdated version of the patch that doesn't include that mode change, "git rebase"^(man) will drop your local version of the patch along with your mode changes.
It also means that internal patch-id doesn't produce the same output as the builtin, which does account for mode changes due to them being part of diff output.

Fix by adding mode to the patch-id if it has changed, in the same format that would be produced by diff, so that it is compatible with builtin patch-id.

And last difference which was not properly detected/reported:

builtin: patch-id: fix patch-id with binary diffs

^{Signed-off-by: Jerry Zhang}

"git patch-id"^(man) currently does not produce correct output if the incoming diff has any binary files.
Add logic to get_one_patchid to handle the different possible styles of binary diff.
This attempts to keep resulting patch-ids identical to what would be produced by the counterpart logic in diff.c, that is it produces the id by hashing the a and b oids in succession.

In general we handle binary diffs by first caching the object ids from the "index" line and using those if we then find an indication that the diff is binary.

The input could contain patches generated with "git diff --binary"^(man)".
This currently breaks the parse logic and results in multiple patch-ids output for a single commit.
Here we have to skip the contents of the patch itself since those do not go into the patch id.
--binary implies --full-index so the object ids are always available.

When the diff is generated with --full-index there is no patch content to skip over.

When a diff is generated without --full-index or --binary, it will contain abbreviated object ids.
This will still result in a sufficiently unique patch-id when hashed, but does not match internal patch id output.
We'll call this OK for now as we already need specialized arguments to diff in order to match internal patch id (namely -U3).

git finding duplicate commits (by patch-id)

7 Answers7

`patch-id`: ignore newline at end of file in `diff_flush_patch_id()`

`builtin`: patch-id: add `--verbatim` as a command mode

`--verbatim`

`patch-id`: fix `patch-id` for mode changes

`builtin`: patch-id: fix patch-id with binary diffs

Linked

git finding duplicate commits (by patch-id)

7 Answers7

patch-id: ignore newline at end of file in diff_flush_patch_id()

builtin: patch-id: add --verbatim as a command mode

--verbatim

patch-id: fix patch-id for mode changes

builtin: patch-id: fix patch-id with binary diffs

Linked

`patch-id`: ignore newline at end of file in `diff_flush_patch_id()`

`builtin`: patch-id: add `--verbatim` as a command mode

`--verbatim`

`patch-id`: fix `patch-id` for mode changes

`builtin`: patch-id: fix patch-id with binary diffs