2

I'd like to prune large files from my git repository. However, I'd like to be specific about it, so I would like to see all file sizes in all of the history for the repository?

I've created the following bash script, but it seems quite inefficent and may be missing files that have been deleted somewhere in history:

git log --pretty=tformat:%H | while read hash; do
   git show --stat --name-only $hash | grep -P '^(?:(?!commit|Author:|Date:|Merge:|   ).)*$' | while read filename; do
      if [ ! -z "$filename" ]; then
          git show "$hash:$filename" | wc -c | while read filesize; do
             if [ $(echo "$filesize > 100000" | bc) -eq 1 ]; then
                printf "%-40s %11s %s\n" "$hash" "$filesize" "$filename"
             fi
          done
      fi
   done
done

Any suggestions on a better way to go about it?

GaTechThomas
  • 5,421
  • 5
  • 43
  • 69
  • 1
    I just found Git Extensions, which has a "Find large files" plugin that does this via a GUI. So far it seems pretty fast. http://gitextensions.github.io/ – GaTechThomas Jan 13 '17 at 22:40
  • 1
    For others, be sure to check out the comments in the accepted answer. A nice gist had been produced for this effort. – GaTechThomas Mar 23 '19 at 12:59

3 Answers3

3

The git ls-files command will give you a list of all the files. If you pass the --debug option, it will output additional data in the format:

path/filename.ext
  ctime: ${timestamp}:0
  mtime: ${timestamp}:0
  dev: 16777220 ino: 62244153
  uid: 1912685926   gid: 80
  size: ${bytes}    flags: 0

You can then parse the results for the size value and compare it to whatever maximum you are setting.

Derek
  • 4,575
  • 3
  • 22
  • 36
  • 1
    `git ls-files` reads the index, so only includes files stored in the index, not all previous commits. – torek Jan 13 '17 at 00:34
3

You are most of the way there, really.

git log --pretty=tformat:%H

This should just be git rev-list <start-points>, e.g., git rev-list HEAD or git rev-list --all. You may want to add --topo-order --reverse for reasons we'll reach in a moment.

 | while read hash; do
   git show --stat --name-only $hash

Instead of git show --stat, you probably just want to use git ls-tree on the hash. Using a recursive git ls-tree you will find every tree and blob within the given commit, along with its corresponding path name.

The trees are probably not interesting, so we might drop down to blobs. Note, by the way, that git ls-tree will encode some problematic file names unless you use -z (but this makes it harder to read the items; bash can do it, plain sh can't).

 | grep -P '^(?:(?!commit|Author:|Date:|Merge:|   ).)*$' | while read filename; do

Using git ls-tree we can replace this with:

git ls-tree -r $hash | while read mode type objhash path; do

and then we'll skip anything whose type is not blob:

[ $type == blob ] || continue

  if [ ! -z "$filename" ]; then

We won't need this at all.

      git show "$hash:$filename" | wc -c | while read filesize; do
         if [ $(echo "$filesize > 100000" | bc) -eq 1 ]; then
            printf "%-40s %11s %s\n" "$hash" "$filesize" "$filename"
         fi

It's not clear to me why you have a while read filesize loop, nor the complex tests. In any case the easy way to get the size of the blob object is with git cat-file -s $objhash, and it's easy to test [ $blobsize -gt 100000 ] for instance:

    blobsize=$(git cat-file -s $objhash)
    if [ $blobsize -gt 100000 ]; then
       echo "$hash contains $filename size $blobsize"
    fi

However, by giving up git show in favor of git ls-tree -r, we see every copy of each file in every commit, rather than just seeing it once, in the first commit in which it appears. For instance, if commit f00f1e adds big file bigfile and it persists in commit baafba6 unchanged, we'll see it both times. Using git show --stat runs a variant of git diff to compare each commit against its parent(s), so that we omit the file if we have seen it before.

The slight defect (or maybe not-defect) is that we "re-see" a file if it comes back. For instance if that big file is removed in the third commit and restored in the fourth, we'll see it twice.

This is where we may want --topo-order --reverse. If we use this, we'll get all parent commits before their children. We can then save each diagnosed object hash, and suppress a repeat diagnostic. Here a nice programming language that has associative arrays (hash tables) would be handy, but we can do this in plain bash with a file or directory that contains previously-displayed object hashes:

#! /bin/sh

# get temporary file to hold viewed object hashes
TF=$(mktemp)
trap "rm -f $TF" 0 1 2 3 15

BIG=100000  # files up to (and including?) this size are not-big

git rev-list --all --topo-order --reverse |
while read commithash; do
    git ls-tree -r $commithash |
    while read mode type objhash path; do
        [ $type == blob ] || continue      # only look at files
        blobsize=$(git cat-file -s $objhash)
        [ $blobsize -lt $BIG ] && continue # or -le
        # found a big file - have we seen it yet?
        grep $objhash $TF >/dev/null && continue
        echo "$blobsize byte file added at commit $commithash as $path"
        echo $objhash >> $TF # don't print again under any path name
    done
done

Note that since we now remember large files by their hash ID, we won't re-announce them even if they re-appear under another name (e.g., get git mved, or are removed and then re-appear under the same or another name).

If you prefer the diff-invoking method that git show uses, we can use that instead of our hash-saving temporary file, but still avoid the clumsy grepping away of commit messages, by using the appropriate plumbing command, which is git diff-tree. It's also probably still wise to use --topo-order (just as a general rule), although it's no longer required. So this gives:

BIG=100000 # just as before

git rev-list --all --topo-order | while read commithash; do
    git diff-tree -r --name-only --diff-filter=AMT $commithash |
        tail -n +2 | while read path; do
            objsize=$(git cat-file -s "$commithash:$path")
            [ $objsize -lt $BIG ] && continue
            echo "$blobsize byte file added at commit $commithash as $path"
        done
done

git diff-tree needs -r to work recursively (same as git ls-tree), needs --name-only to print only file names, and needs --diff-filter=AMT to print only the names of files added, modified, or type-changed (from symlink to file or vice versa). Obnoxiously, git diff-tree prints the commit ID again as the first line. We can suppress the ID with --no-commit-id but then we get a blank line, so we might as well just use tail -n +2 to skip the first line.

The rest of the script is the same as yours, except that we get the object's size the easy way, using git cat-file -s, and test it directly with the [ / test program.

Note that with merge commits, git diff-tree (like git show) uses a combined diff, showing only files that, in the merge result, don't match either parent. This should be OK since if file huge is 4GB in the merge result but is identical to file huge that was 4GB in one of the two merged commits, we'll see huge when it's added to that commit, instead of seeing it in the merge itself.

(If that's not desirable, you can add -m to the git diff-tree command. However, then you'll need to drop the tail -n +2 and put in the --no-commit-id, which behaves differently under -m. This particular behavior in Git is somewhat annoying, although it makes sense with the default output format, which is similar to git log --raw.)

(NB: code above is not tested - spotted and fixed $hash vs $commithash on last re-read.)

torek
  • 448,244
  • 59
  • 642
  • 775
  • This answer is awesome! I've been running the first one and getting good results for the past several hours now. Haven't tried the second one yet. I expect that for this sort of thing that there's just no fast way of doing it. – GaTechThomas Jan 13 '17 at 20:34
  • 2
    Well, each file tends to appear many, many times: if you commit something with 1000 files, then change 2 files and commit, you've repeated 998 files. Then you change 1 file and commit and you've repeated 999 files, of which either 998 or 997 are repeats from the first commit. We're getting the size of each file every time, rather than saving it, because sh is not a very clever programming language. So it's expensive. The diff-tree based version gets fewer sizes, so even though diffing is harder, it might be faster in the end. – torek Jan 13 '17 at 20:38
  • Use `tail -n +2` to skip first line – Miguel Angelo Mar 22 '19 at 22:51
  • @MiguelAngelo: indeed. Fixed (well, fix in progress, should be done by the time you read this) – torek Mar 22 '19 at 23:54
  • 2
    Hope you don't mind, I have created and published a gist with your code, giving credit of course: [git-find-big-files-in-hist.sh](https://gist.github.com/masbicudo/c87600d08ba32903b0e0863efd0966a8) – Miguel Angelo Mar 23 '19 at 00:46
  • 2
    @MiguelAngelo Seems fine. BTW you can encode a literal escape in many shells using `$'`, e.g., `dkgray=$'\e[90m'`. This works in bash, and in BSD sh, but not in dash-based sh where the `$` becomes a dollar sign. See also https://unix.stackexchange.com/a/266942/162084 – torek Mar 23 '19 at 00:58
2
git log --name-only --diff-filter=d --all --pretty=format:%H \
| awk '/^$/{c=""}!c{c=$1;next}{print c":"$0,c,$0}' \
| git cat-file --batch-check=$'%(rest)\t%(objectsize)'

That's ~show all changed-but-not-deleted files after the commit id for every commit in history, reformat the list to

sha:path sha path

for each and feed that to --batch-check for one-pass extraction of the sizes~.

jthill
  • 55,082
  • 5
  • 77
  • 137