27

I have a huge git repo that eventually want to clean up with bfg.
But first, I want to track down and remove files in the HEAD which git treats as binary...

So, what i'm looking for is a command to find all files in the HEAD that git treats as binary.

These didn't help:

Thank you in advance for your help.

Community
  • 1
  • 1
fabien
  • 1,529
  • 1
  • 15
  • 28

5 Answers5

30
diff <(git grep -Ic '') <(git grep -c '') | grep '^>' | cut -d : -f 1 | cut -d ' ' -f 2-

Breaking it down:

  • git grep -c '' prints the names and line counts of each file in the repository. Adding the -I option makes the command ignore binary files.
  • diff <(cmd1) <(cmd2) uses process substitution to provide diff with named pipes through which the output of cmd1 and cmd2 are sent.
  • The grep and cut commands are used to extract the filenames from the output of diff.
jangler
  • 949
  • 6
  • 7
  • To deal with whitespace in file names correctly, you can change the `c` flag to `l` like so: `diff <(git grep -Il '' "$commit") <(git grep -l '' "$commit") | grep '^>' | cut -d':' -f2-` – sinelaw May 11 '16 at 12:08
  • How would you pipe this command to list get the file size of each file? – Chris F Sep 12 '19 at 15:03
  • I got it, just use `| xargs du -ch {} +` – Chris F Sep 12 '19 at 15:51
  • How could we change this to a history version of the command ? That is to say get all the binary files that once existed in the repo history? – plalanne Aug 20 '21 at 10:03
16

A simplified solution based on the answer of @jangler (https://stackoverflow.com/a/30690662/808101)

comm -13 <(git grep -Il '' | sort -u) <(git grep -al '' | sort -u)

Explanation:

  1. git grep

    • -l Ask to only print the filename of file matching the pattern '' (which should match with every line of every file)
    • -I This option makes the command ignore binary files
    • -a This option force to process binary files as if they were text
  2. sort -u Sort the result of the grep, since comm only process sorted files

  3. comm -13 List the files that are unique to the 2nd list (the git grep list with all files including the binary ones)

benjarobin
  • 4,410
  • 27
  • 21
9

Here is the same script for Windows using PowerShell:

$textFiles = git grep -Il .
$allFiles = git ls-files

foreach ($line in $allFiles){
    if ($textFiles -notcontains $line) {
        $line;
    }
}

Or in the short form:

$textFiles = git grep -Il .
git ls-files | where { $textFiles -notcontains $_ }

That takes O(n^2) to complete, and this is faster approach using hashtables:

$files = @{}
git ls-files | foreach { $files[$_] = 1 }
git grep -Il . | foreach { $files[$_] = 0 }
$files.GetEnumerator() | where Value -EQ 1 | sort Name | select -ExpandProperty Name

That takes O(n) to complete.

tsul
  • 1,401
  • 1
  • 18
  • 22
7
grep -Fvxf <(git grep -Il '') <(git grep -al '')

Explanation:

To also consider files added with git add but not yet committed:

grep -Fvxf <(git grep --cached -Il '') <(git grep --cached -al '')

Or you chould do a for loop on git ls-files with How to determine if Git handles a file as binary or as text?

Tested on Git 2.16.1 with this test repo.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
0

Fast, easy, not sure how accurate:

git ls-files -eol

Anything tagged i/-text is likely treated as a binary file.

Dmitri R117
  • 2,502
  • 23
  • 20