Git find all binary files in history

Question

Sorry if this is a duplicate of previous question, but I couldn't find quite what I'm looking for. I'm in the process of converting a large cvs codeset (20+ repositories with 15 years of history - 10-15 GB size) to git. Much of the size is due to binaries that were committed along with the code in the past. While some of the binaries are files that can be removed completely, it's desirable to keep many of them as well as their history. However, we don't want the repo to bloat.

We are currently planning on using git-fat to store the binaries, but I'm in the process of writing a script to automatically convert the files. My first step is to just try to identify all the files in the repo (included deleted files) which are binaries. Are there any simple approaches to accomplishing this? Thanks for your help

Edit

I actually think I found a reasonable approach where I just run

git log --numstat <first commit hash> HEAD

This prints out a list of all the files with two columns in front, the first contains the number of changes to the file (I'm not sure if it's in bytes or lines). But the important parts is for binary files it is '-'. By selecting lines with this tag, and "uniqueing" them, I believe I get the complete list of binary files.

Are there any flaws with this strategy?

We are in a similar position, and we've decided that the history of the project is to be kept in subversion and all new work is to be imported as a new project in git with no history. If anyone wants to view the history of a file they can do with their existing tools, but if they want to work on the code then they'll have to get the new stuff out of git. We think that the history is valuable, but as it's already available 'somewhere' we don't need to worry about porting it and making available through the new system (which will have its own history soon enough). — Software Engineer, Jan 13 '15 at 21:14
Yes, I think we will likely keep the cvs repository in read only mode. I was hoping to do a complete port over, but that may not be feasible. Good point though. — NotsoDarkMatters, Jan 13 '15 at 23:21

rubicks · Answer 1 · 2023-08-18T16:16:00.743

tldr;

git log --all --numstat \
    | grep '^-' \
    | cut -f3 \
    | sed -E 's|(.*)\{(.*) => (.*)\}(.*)|\1\2\4\n\1\3\4|g' \
    | sort -u

Explanation:

The git-log option --numstat

shows number of added and deleted lines in decimal notation and pathname without abbreviation, to make it more machine friendly. For binary files, outputs two - instead of saying 0 0.

Source: https://git-scm.com/docs/git-log, emphasis mine

This produces output entries like the following:

commit 0123456789012345678901234567890123456789
Author: Joe Example <jexample@domain.com>
Date:   Thu Mar 9 15:33:29 2017 +0000

    edit Dockerfile, add assets/foobar.jpg

1   1   Dockerfile
-   -   assets/foobar.jpg

The grep '^-' matches lines with a leading hyphen, the cut -f3 prints the third tab-delimited field, and the

sed -E 's|(.*)\{(.*) => (.*)\}(.*)|\1\2\4\n\1\3\4|g'

detects files that have been moved/renamed and prints both the source and destination; e.g., it would change this:

path/to/{foo => bar}/my-document.pdf

to this:

path/to/foo/my-document.pdf
path/to/bar/my-document.pdf

Finally, the sort -u will accumulate, sort, and uniquify the list of paths.

EDIT: This answer assumes the existence of a sed that supports extended regular expressions and capture groups; e.g., https://www.gnu.org/software/sed/ .

@DhruvaN, unless you've taken great pains to make `grep`, `sed`, `awk`, and `cut` accessible to that shell, I seriously doubt it. — rubicks, Oct 20 '20 at 15:48
Your explanation paragraph for `gsed -r ...` uses `sed -r` (not GNU Sed). I think that's incorrect? — Guildenstern, Aug 07 '23 at 08:13
@Guildenstern I have updated to be as `sed` agnostic as possible. — rubicks, Aug 18 '23 at 16:19

score 2 · Answer 2 · answered Jan 20 '15 at 17:27

One of the contributors to git-fat here.

If you're primarily concerned about the size of the file, and not specifically the type, then git-fat has a find command which allows you to find all the files in the git repository over a given size.

I currently contribute to cyaninc's fork, but both versions (Jed's and Cyan's) have the find command.

Also check out the retroactive import section on the READMEs. Both versions also support that as well.

score 1 · Answer 3 · answered Jan 13 '15 at 22:14

One solution would be to iterate through all revisions, get all files from each revision, get content of each file and then get a type of each file, so...

Here is how you can get list of all revisions:

$ git rev-list HEAD
32a9b9158d73dc80b355993a5a5f8fc49ae25334
9946574838bf5f984f5f4a19b2fc524f0a60378c
3f82a5dcecde0028da21fb266c1bbd7e9ec762ec
...

Here is how you can get a list of all files in a revision:

$ git ls-tree -r 32a9b9158d73dc80b355993a5a5f8fc49ae25334
100644 blob dcf290b1a99a8d2535b8aa8f85702cd1b7fac6e8    .gitignore
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    README

You can get content of each file by providing blob of each file in each revision using

git show:
$ git show dcf290b1a99a8d2535b8aa8f85702cd1b7fac6e8
.gitignore

*.pyc
rm_pyc.sh
aima/**/*.pyc
.idea

To test if a file is binary or not you can use /bin/file:

git show dcf290b1a99a8d2535b8aa8f85702cd1b7fac6e8 > file
/bin/file file
file: ASCII text

Git find all binary files in history

3 Answers3

Linked