20

This is sort of a follow-up to this question.

If there are multiple blobs with the same contents, they are only stored once in the git repository because their SHA-1's will be identical. How would one go about finding all duplicate files for a given tree?

Would you have to walk the tree and look for duplicate hashes, or does git provide backlinks from each blob to all files in a tree that reference it?

Community
  • 1
  • 1
readonly
  • 343,444
  • 107
  • 203
  • 205

6 Answers6

27
[alias]
    # find duplicate files from root
    alldupes = !"git ls-tree -r HEAD | cut -c 13- | sort | uniq -D -w 40"

    # find duplicate files from the current folder (can also be root)
    dupes = !"cd `pwd`/$GIT_PREFIX && git ls-tree -r HEAD | cut -c 13- | sort | uniq -D -w 40"
madeso
  • 197
  • 3
  • 8
bsb
  • 1,847
  • 26
  • 24
  • 2
    The above alias don't work on macOS because macOS/BSD version of `uniq` uses slightly different parameters, specifically, `-D` => `-d` and `-w` => `-s`. See manpage here https://www.unix.com/man-page/bsd/1/UNIQ/ – Devy Apr 17 '20 at 23:07
9

Running this on the codebase I work on was an eye-opener I can tell you!

#!/usr/bin/perl

# usage: git ls-tree -r HEAD | $PROGRAM_NAME

use strict;
use warnings;

my $sha1_path = {};

while (my $line = <STDIN>) {
    chomp $line;

    if ($line =~ m{ \A \d+ \s+ \w+ \s+ (\w+) \s+ (\S+) \z }xms) {
        my $sha1 = $1;
        my $path = $2;

        push @{$sha1_path->{$sha1}}, $path;
    }
}

foreach my $sha1 (keys %$sha1_path) {
    if (scalar @{$sha1_path->{$sha1}} > 1) {
        foreach my $path (@{$sha1_path->{$sha1}}) {
            print "$sha1  $path\n";
        }

        print '-' x 40, "\n";
    }
}
lmop
  • 941
  • 6
  • 5
7

Just made a one-liner that highlights the duplicates rendered by git ls-tree.
Might be useful

git ls-tree -r HEAD |
    sort -t ' ' -k 3 |
    perl -ne '$1 && / $1\t/ && print "\e[0;31m" ; / ([0-9a-f]{40})\t/; print "$_\e[0m"'
Romuald Brunet
  • 5,595
  • 4
  • 38
  • 34
4

The scripting answers from your linked question pretty much apply here too.

Try the following git command from the root of your git repository.

git ls-tree -r HEAD

This generates a recursive list of all 'blobs' in the current HEAD, including their path and their sha1 id.

git doesn't maintain back links from a blob to tree so it would be a scripting task (perl, python?) to parse a git ls-tree -r output and create a summary report of all sha1s that appear more than once in the list.

CB Bailey
  • 755,051
  • 104
  • 632
  • 656
0

More general:

( for f in `find .`; do test -f $f && echo $(wc -c <$f) $(md5 -q $f)   ; done ) |sort |uniq -c |grep -vE '^\s*1\b' |sed 's/.* //' > ~/dup.md5 ; \
( for f in `find .`; do test -f $f && echo $(wc -c <$f) $(md5 -q $f) $f; done ) |fgrep -f ~/dup.md5 |sort
  • This doesn't seem to answer the question, in that it doesn't search in the Git history at all. Also be aware of `find -type f` and `du`; your current version is very inefficient (goes over files multiple times). Not downvoted because this could be useful I guess. – remram Oct 26 '17 at 22:57
0

For Windows / PowerShell users:

git ls-tree -r HEAD | group { $_ -replace '.{12}(.{40}).*', '$1' } | ? { $_.Count -gt 1 } | select -expand Group

This outputs something like:

100644 blob 8a49bcbae578c405ba2596c06f46fabbbc331c64    filename1
100644 blob 8a49bcbae578c405ba2596c06f46fabbbc331c64    filename2
100644 blob c1720b20bb3ad5761c1afb6a3113fbc2ba94994e    filename3
100644 blob c1720b20bb3ad5761c1afb6a3113fbc2ba94994e    filename4
bart
  • 1,100
  • 11
  • 14