You are most of the way there, really.
git log --pretty=tformat:%H
This should just be git rev-list <start-points>
, e.g., git rev-list HEAD
or git rev-list --all
. You may want to add --topo-order --reverse
for reasons we'll reach in a moment.
| while read hash; do
git show --stat --name-only $hash
Instead of git show --stat
, you probably just want to use git ls-tree
on the hash. Using a recursive git ls-tree
you will find every tree and blob within the given commit, along with its corresponding path name.
The trees are probably not interesting, so we might drop down to blobs. Note, by the way, that git ls-tree
will encode some problematic file names unless you use -z
(but this makes it harder to read the items; bash can do it, plain sh can't).
| grep -P '^(?:(?!commit|Author:|Date:|Merge:| ).)*$' | while read filename; do
Using git ls-tree
we can replace this with:
git ls-tree -r $hash | while read mode type objhash path; do
and then we'll skip anything whose type is not blob:
[ $type == blob ] || continue
if [ ! -z "$filename" ]; then
We won't need this at all.
git show "$hash:$filename" | wc -c | while read filesize; do
if [ $(echo "$filesize > 100000" | bc) -eq 1 ]; then
printf "%-40s %11s %s\n" "$hash" "$filesize" "$filename"
fi
It's not clear to me why you have a while read filesize
loop, nor the complex tests. In any case the easy way to get the size of the blob object is with git cat-file -s $objhash
, and it's easy to test [ $blobsize -gt 100000 ]
for instance:
blobsize=$(git cat-file -s $objhash)
if [ $blobsize -gt 100000 ]; then
echo "$hash contains $filename size $blobsize"
fi
However, by giving up git show
in favor of git ls-tree -r
, we see every copy of each file in every commit, rather than just seeing it once, in the first commit in which it appears. For instance, if commit f00f1e
adds big file bigfile
and it persists in commit baafba6
unchanged, we'll see it both times. Using git show --stat
runs a variant of git diff
to compare each commit against its parent(s), so that we omit the file if we have seen it before.
The slight defect (or maybe not-defect) is that we "re-see" a file if it comes back. For instance if that big file is removed in the third commit and restored in the fourth, we'll see it twice.
This is where we may want --topo-order --reverse
. If we use this, we'll get all parent commits before their children. We can then save each diagnosed object hash, and suppress a repeat diagnostic. Here a nice programming language that has associative arrays (hash tables) would be handy, but we can do this in plain bash with a file or directory that contains previously-displayed object hashes:
#! /bin/sh
# get temporary file to hold viewed object hashes
TF=$(mktemp)
trap "rm -f $TF" 0 1 2 3 15
BIG=100000 # files up to (and including?) this size are not-big
git rev-list --all --topo-order --reverse |
while read commithash; do
git ls-tree -r $commithash |
while read mode type objhash path; do
[ $type == blob ] || continue # only look at files
blobsize=$(git cat-file -s $objhash)
[ $blobsize -lt $BIG ] && continue # or -le
# found a big file - have we seen it yet?
grep $objhash $TF >/dev/null && continue
echo "$blobsize byte file added at commit $commithash as $path"
echo $objhash >> $TF # don't print again under any path name
done
done
Note that since we now remember large files by their hash ID, we won't re-announce them even if they re-appear under another name (e.g., get git mv
ed, or are removed and then re-appear under the same or another name).
If you prefer the diff-invoking method that git show
uses, we can use that instead of our hash-saving temporary file, but still avoid the clumsy grepping away of commit messages, by using the appropriate plumbing command, which is git diff-tree
. It's also probably still wise to use --topo-order (just as a general rule), although it's no longer required. So this gives:
BIG=100000 # just as before
git rev-list --all --topo-order | while read commithash; do
git diff-tree -r --name-only --diff-filter=AMT $commithash |
tail -n +2 | while read path; do
objsize=$(git cat-file -s "$commithash:$path")
[ $objsize -lt $BIG ] && continue
echo "$blobsize byte file added at commit $commithash as $path"
done
done
git diff-tree
needs -r
to work recursively (same as git ls-tree
), needs --name-only
to print only file names, and needs --diff-filter=AMT
to print only the names of files added, modified, or type-changed (from symlink to file or vice versa). Obnoxiously, git diff-tree
prints the commit ID again as the first line. We can suppress the ID with --no-commit-id
but then we get a blank line, so we might as well just use tail -n +2
to skip the first line.
The rest of the script is the same as yours, except that we get the object's size the easy way, using git cat-file -s
, and test it directly with the [
/ test
program.
Note that with merge commits, git diff-tree
(like git show
) uses a combined diff, showing only files that, in the merge result, don't match either parent. This should be OK since if file huge
is 4GB in the merge result but is identical to file huge
that was 4GB in one of the two merged commits, we'll see huge
when it's added to that commit, instead of seeing it in the merge itself.
(If that's not desirable, you can add -m
to the git diff-tree
command. However, then you'll need to drop the tail -n +2
and put in the --no-commit-id
, which behaves differently under -m
. This particular behavior in Git is somewhat annoying, although it makes sense with the default output format, which is similar to git log --raw
.)
(NB: code above is not tested - spotted and fixed $hash
vs $commithash
on last re-read.)