616

I have a 300 MB git repo. The total size of my currently checked-out files is 2 MB, and the total size of the rest of the git repo is 298 MB. This is basically a code-only repo that should not be more than a few MB.

I suspect someone accidentally committed some large files (video, images, etc), and then removed them... but not from git, so the history still contains useless large files. How can find the large files in the git history? There are 400+ commits, so going one-by-one is not practical.

NOTE: my question is not about how to remove the file, but how to find it in the first place.

pants
  • 192
  • 13
  • 6
    the blazingly fast one liner in the answer by @raphinesse should be marked as the answer instead nowadays. – soloturn Jul 07 '20 at 19:35

14 Answers14

1354

A blazingly fast shell one-liner

This shell script displays all blob objects in the repository, sorted from smallest to largest.

For my sample repo, it ran about 100 times faster than the other ones found here.
On my trusty Athlon II X4 system, it handles the Linux Kernel repository with its 5.6 million objects in just over a minute.

The Base Script

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

When you run above code, you will get nice human-readable output like this:

...
0d99bb931299  530KiB path/to/some-image.jpg
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

macOS users: Since numfmt is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.

Filtering

To achieve further filtering, insert any of the following lines before the sort line.

To exclude files that are present in HEAD, insert the following line:

grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}') |

To show only files exceeding given size (e.g. 1 MiB = 220 B), insert the following line:

awk '$2 >= 2^20' |

Output for Computers

To generate output that's more suitable for further processing by computers, omit the last two lines of the base script. They do all the formatting. This will leave you with something like this:

...
0d99bb93129939b72069df14af0d0dbda7eb6dba 542455 path/to/some-image.jpg
2ba44098e28f8f66bac5e21210c2774085d2319b 12446815 path/to/hires-image.png
bd1741ddce0d07b72ccf69ed281e09bf8a2d0b2f 65183843 path/to/some-video-1080p.mp4

Appendix

File Removal

For the actual file removal, check out this SO question on the topic.

Understanding the meaning of the displayed file size

What this script displays is the size each file would have in the working directory. If you want to see how much space a file occupies if not checked out, you can use %(objectsize:disk) instead of %(objectsize). However, mind that this metric also has its caveats, as is mentioned in the documentation.

More sophisticated size statistics

Sometimes a list of big files is just not enough to find out what the problem is. You would not spot directories or branches containing humongous numbers of small files, for example.

So if the script here does not cut it for you (and you have a decently recent version of git), look into git-filter-repo --analyze or git rev-list --disk-usage (examples).

raphinesse
  • 19,068
  • 6
  • 39
  • 48
  • 61
    To use this on Mac you need to `brew install coreutils` and then replace `cut` with `gcut` and `numfmt` with `gnumfmt`. – Nick Sweeting Sep 14 '17 at 18:42
  • Back to the problem itself, just one question remains: my repo is dramatically smaller but when I rerun the list command, I still see the large file. Is this expected? – Sridhar Sarnobat Oct 07 '17 at 00:53
  • 1
    @Sridhar-Sarnobat Well, properly removing files from a repo can be challenging. See if the [official checklist](https://git-scm.com/docs/git-filter-branch#_checklist_for_shrinking_a_repository) helps you. Alternatively check the [other question](https://stackoverflow.com/questions/2100907/how-to-purge-a-huge-file-from-commits-history-in-git) linked in this question. – raphinesse Oct 07 '17 at 09:15
  • 3
    I would suggest to use `objectsize:disk` instead of `objectsize`. – Victor Yarema Oct 17 '17 at 09:39
  • 2
    Thanks a lot. Worked for me on MacOs (with homebrew 'coreutils' package, with 'gcut', 'gnumfmt' instead of 'cut' and 'numfmt') – beefeather Dec 10 '17 at 23:25
  • with the `cut` and `numfmt` parts not working right on macOS, and considering how they are not really necessary for using the tool, I use this script without them and just look at the computer-friendly output. – Steven Lu Feb 09 '18 at 21:30
  • The `--with-default-names` flag with the `brew install coreutils` part no longer works as expected out of the box. Instead, like others have stated above, simply rename the `cut` to `gcut` and `numfmt` to `gnumfmt` in the shell command. – Mike Kormendy Nov 14 '18 at 05:53
  • 2
    When I run the 'The Base Script' I just get the error `error: option 'batch-check' takes no value` – botenvouwer Jan 16 '19 at 13:44
  • @botenvouwer Maybe an old git version? What version do you have? – raphinesse Jan 16 '19 at 18:21
  • 1
    @raphinesse That would be `git version 1.8.3.1`. And I use the command in the root of the git repo. – botenvouwer Jan 17 '19 at 11:53
  • 1
    @botenvouwer Well, as I suspected your git version is too old and does not support `batch-check` as it is used here (as can be seen in the docs: https://git-scm.com/docs/git-cat-file/1.8.3.1) – raphinesse Jan 17 '19 at 12:58
  • Well humm..., than I have to accept that some moron committed large binaries :(. Any way it works on my private system, thanks for looking this up mate! :) I and many more will need this in the future. – botenvouwer Jan 17 '19 at 13:46
  • @botenvouwer Ironically, 1.8.4 was the version that [introduced the feature](https://github.com/git/git/blob/77556354bb7ac50450e3b28999e3576969869068/Documentation/RelNotes/1.8.4.txt#L116-L119) – raphinesse Jan 17 '19 at 15:29
  • What happens if you have cloned the repo? Will pushes to or pulls from an un-filtered repo re-introduce the huge file? – Sridhar Sarnobat Jan 21 '19 at 03:52
  • 1
    @Sridhar-Sarnobat removing the file will rewrite the history up to the point where the file was introduced. See [The Perils of Rebasing](https://git-scm.com/book/en/v2/Git-Branching-Rebasing#The-Perils-of-Rebasing) for a detailed information on what that means for collaboration. – raphinesse Jan 21 '19 at 10:36
  • What would be needed to add to check if its also seen as a binary file by git? So filter by "binary" only. – Gabriel May 20 '19 at 16:52
  • 1
    Now the obvious. How do you run this *AWESOME* script in Git Bash? For Git noobs like me, I needed these basics to run this script in Git Bash for Windows. Save code in a textfile labelled say `myscript.sh` in directory below your repo. Then to run inside Git Bash windows at $ sign enter command `sh ../myscript.sh`. (Thanks @JohnWrensby [answer](https://stackoverflow.com/a/45584884/4606130)) – micstr Jun 26 '19 at 09:32
  • Thank you! Will this show all files even if there are not checked out nor any longer presented in any branch? – mvorisek Jul 28 '19 at 10:23
  • @Mvorisek this will show any files that exist in any commit of the repository – raphinesse Jul 28 '19 at 21:31
  • 1
    @raphinesse - thanks for the excellent script. If I may, I might suggest another suggestion re. filtering - that is, to find files that are not already managed by Git LFS, which can be done by `| grep -vF --file=<(git lfs ls-files -a | awk '{print $3}') \`. I might have suggested this as an edit, but for such a popular and worthily up-voted answer, that felt a bit intrusive - hence the suggestion by comment. – amaidment Jun 17 '20 at 15:33
  • @amaidment That's a good idea! But what about a file that has been moved to LFS but still has a large blob in the repo? Would such a file also be removed by your filter? It should not be removed IMHO. – raphinesse Jun 17 '20 at 15:45
  • @raphinesse - can that happen? I had thought that moving a file to LFS meant that it didn't leave a blob in the repo, but just a tiny pointer. My apologies if that's noob naivety, but I'm quite new to Git. In my use case, I'm trying to shrink a repo (exc. LFS), so I'm trying to work out what to remove (i.e. using git [filter-repo](https://github.com/newren/git-filter-repo/)) and what to migrate to LFS (using [git-lfs-migrate](https://github.com/git-lfs/git-lfs/blob/master/docs/man/git-lfs-migrate.1.ronn)), and wanted to find the large stuff that was left. – amaidment Jun 17 '20 at 15:54
  • 1
    @amaidment Well if you properly rewrite your history (what `filter-branch`, `filter-repo` and seemingly `lfs-migrate` are doing) then the big blobs will be gone from your object store. But if you just naively add the file to LFS the old blobs should be left behind AFAICT. Furthermore, I don't even know if LFS managed files would appear with their actual (as in checked-out) file size in the listing produced by my script. Unfortunately I have no repo at hand to test right now either. – raphinesse Jun 18 '20 at 09:17
  • @raphinesse - another thing - with your suggested filter for files not in `HEAD`, I think the `awk` should be `awk '{print $4}'`. I found (at least with git 2.27.0) that the filenames were the fourth output from `git ls-tree` (see [docs](https://git-scm.com/docs/git-ls-tree#_output_format)). – amaidment Jun 18 '20 at 16:43
  • @amaidment I really meant to use `$3` here. But I see your point. Eventually it comes down to the question if you want to exclude the exact version of any file present in HEAD (use `$3`), or all files that ever lived in the repository under any path that is present in HEAD (use `$4`). But be careful with `$4`: If you have `foo/bar` in HEAD you would also exclude `foo/bar.bin`, even if the latter wasn't in HEAD – raphinesse Jun 18 '20 at 21:47
  • What would cause this query to omit blobs? I queried using this one-liner. I removed those files from the repo history using https://github.com/newren/git-filter-repo. Then I re-ran the query and was shocked that it found more blobs that were not in the original output. How is that possible!? – chrisinmtown Aug 18 '20 at 14:26
  • @chrisinmtown From the info you've given, I can't tell what's happening. Above command should never omit any blobs. Removing files from the history should neither change existing blob SHAs (since those depend on file content only) nor add new blobs. What _can_ change is the path displayed for a blob. This will happen if a blob is used in different tree objects or under different names and you only delete a subset of those occurrences from the repo. See https://git-scm.com/book/en/v2/Git-Internals-Git-Objects for how objects are stored in git. – raphinesse Aug 20 '20 at 07:45
  • 4
    This answer seems to print object IDs and file names, not the commits that added them, right? How do I find the commits I have to remove, as the question asks? – oarfish Aug 28 '20 at 07:45
  • 1
    Really great work! I wrote a powershell version based on this for use on Windows if that helps anyone else: https://stackoverflow.com/a/66653426/887962 – SvenS Mar 16 '21 at 10:35
  • 2
    I wondered how it would list files managed para Git LFS. So, I created [a repository with two large files](https://github.com/brandizzi/big), `wrong.iso` added/committed before enabling Git LFS, and `xubuntu-18.04.2-desktop-amd64.iso`, added and committed with Git LFS. (They're the same file, BTW). This scripts this for the one added before LFS: `177485aecd84 1,4GiB wrong.iso`. For the one added after LFS, this is the result: `c381232ed0de 135B xubuntu-18.04.2-desktop-amd64.iso`. So **LFS files are not listed with full size** (which is the behavior I wanted anyway.) – brandizzi Aug 12 '21 at 14:54
  • This is cool, but not an answer, as the question looks for the commits, not blob ids. Anyone have another oneliner to map to the commit? – oarfish Mar 20 '22 at 17:30
  • 1
    See also: [How to remove all files from the git history that are not currently present](https://stackoverflow.com/a/68368158/562769) – Martin Thoma Mar 26 '22 at 07:54
  • Note that for myself, this command would blow up on zsh 5.8.1 on mac with the error `__vsc_command_output_start:printf:2: %(: invalid directive`. If you get that error, just try running it in `bash` instead. – jgawrych Aug 07 '22 at 06:01
  • 1
    @oarfish https://stackoverflow.com/questions/223678/which-commit-has-this-blob has this topic covered. – Imperishable Night May 25 '23 at 22:03
196

I've found a one-liner solution on ETH Zurich Department of Physics wiki page (close to the end of that page). Just do a git gc to remove stale junk, and then

git rev-list --objects --all \
  | grep "$(git verify-pack -v .git/objects/pack/*.idx \
           | sort -k 3 -n \
           | tail -10 \
           | awk '{print$1}')"

will give you the 10 largest files in the repository.

There's also a lazier solution now available, GitExtensions now has a plugin that does this in UI (and handles history rewrites as well).

GitExtensions 'Find large files' dialog

bschlueter
  • 3,817
  • 1
  • 30
  • 48
skolima
  • 31,963
  • 27
  • 115
  • 151
  • 8
    That one-liner only works if you want to get the single biggest file (i.e., use tail -1). Newlines get in the way for anything bigger. You can use sed to convert the newlines so grep will play nice: ``git rev-list --objects --all | grep -E `git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -10 | awk '{print$1}' | sed ':a;N;$!ba;s/\n/|/g'` `` – Throctukes Jun 04 '14 at 13:58
  • 10
    grep: a70783fca9bfbec1ade1519a41b6cc4ee36faea0: No such file or directory – Jonathan Allard Jan 27 '15 at 21:16
  • Both one-liner do not terminate on large repositories. – user3072843 Oct 19 '15 at 13:29
  • 1
    The wiki link moved to: https://readme.phys.ethz.ch/documentation/git_advanced_hints/ – outsmartin Jan 06 '16 at 12:44
  • 18
    Finding GitExtensions is like finding the pot of gold and the end of the rainbow -- thank you! – ckapilla Jun 18 '16 at 15:11
  • 3
    Is there also an extension which prints the size of the files? – Michael Apr 03 '17 at 13:00
  • Michael see the answer below from @raphinesse - worked best for me (also fastest) – Gregor Mar 26 '18 at 10:39
  • The GitExtensions solution is nice, however it is so slow that it cannot really be used on large repositories (it would take several hours to finish on the repository where I tested it). – Étienne Feb 19 '20 at 14:06
174

I've found this script very useful in the past for finding large (and non-obvious) objects in a git repository:


#!/bin/bash
#set -x 
 
# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs
 
# set the internal field separator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';
 
# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`
 
echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."
 
output="size,pack,SHA,location"
allObjects=`git rev-list --all --objects`
for y in $objects
do
    # extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # find the objects location in the repository tree
    other=`echo "${allObjects}" | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done
 
echo -e $output | column -t -s ', '

That will give you the object name (SHA1sum) of the blob, and then you can use a script like this one:

... to find the commit that points to each of those blobs.

D. Ben Knoble
  • 4,273
  • 1
  • 20
  • 38
Mark Longair
  • 446,582
  • 72
  • 411
  • 327
  • 37
    This answer was really helpful, because it sent me to the post above. While the post's script worked, I found it painfully slow. So I rewrote it, and it's now significantly faster on large repositories. Have a look: https://gist.github.com/nk9/b150542ef72abc7974cb – Nick K9 Jun 23 '14 at 19:46
  • 11
    Please include full instructions in your answers and not just offsite links; What do we do when stubbisms.wordpress.com inevitably goes down eh? – ThorSummoner Sep 03 '14 at 19:44
  • @NickK9 interestingly I get different output from your script and the other. there's a bunch of bigger objects that yours seems to miss. Is there something I'm missing? – UpAndAdam Jan 05 '16 at 17:54
  • Oh cool! Thanks for making my script faster @nick\ k9 :D @UpAndAdam, are you saying my script produced incorrect output? – Antony Stubbs Oct 27 '16 at 07:31
  • # Problem This fails on locales where space is a digit separator as this confuses sorting, for example when `LANG=fr_FR.UTF-8`. Observed behavior is: returns a list of objects which aren't the top biggest at all, and are reported with zero size. # Workaround Setting `LC_ALL` to `C` before calling script works. # Solution It would be more robust if the script did that inside. – Stéphane Gourichon Nov 24 '16 at 10:25
  • @NickK9 any insight on UpAndAdam's experience with the script missing some files? Antony's script isn't producing any output and yours is, but I want to make sure it isn't missing anything. – indigo Mar 05 '17 at 23:50
  • I have posted a comparison of the linux kernel to show the differences. The repo is 1.83GB. The python script was faster. Importantly however, it would seem either the python script is missing some files or the bash one is somehow finding files it shouldn't... https://gist.github.com/akrueger/850115a0ce8e32ca4acdd9ab61a6753e – indigo Mar 06 '17 at 06:09
  • @A.Krueger @UpAndAdam The bash script sorts by physical size, whereas the python script's default is to sort by the compressed pack size. To compare apples to apples, you should pass `-p`. Does that fix the discrepancy? – Nick K9 Mar 07 '17 at 00:21
  • good approach and flexible one here: http://blog.jessitron.com/2013/08/finding-and-removing-large-files-in-git.html – herve Jun 29 '17 at 14:13
  • 1
    These comments make it sound like we're reporting size in bytes, but I get kilobytes. – Kat Nov 06 '17 at 15:45
  • Doesn't format well if there are spaces in the file path. Changing the `column` field separators to just `','` makes the output look a little better, except the 'location' column header is off – villapx Oct 12 '20 at 19:07
  • @NickK9 has also written a python 3 version of "git-largest-files" https://gist.github.com/malcolmgreaves/39e33e9b161916cb92ae0fdcfea91d64 – jerrymouse Nov 30 '20 at 15:54
38

Step 1 Write all file SHA1s to a text file:

git rev-list --objects --all | sort -k 2 > allfileshas.txt

Step 2 Sort the blobs from biggest to smallest and write results to text file:

git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt

Step 3a Combine both text files to get file name/sha1/size information:

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print $1,$3,$7}' >> bigtosmall.txt
done;

Step 3b If you have file names or path names containing spaces try this variation of Step 3a. It uses cut instead of awk to get the desired columns incl. spaces from column 7 to end of line:

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | cut -d ' ' -f'1,3,7-' >> bigtosmall.txt
done;

Now you can look at the file bigtosmall.txt in order to decide which files you want to remove from your Git history.

Step 4 To perform the removal (note this part is slow since it's going to examine every commit in your history for data about the file you identified):

git filter-branch --tree-filter 'rm -f myLargeFile.log' HEAD

Source

Steps 1-3a were copied from Finding and Purging Big Files From Git History

EDIT

The article was deleted sometime in the second half of 2017, but an archived copy of it can still be accessed using the Wayback Machine.

Sridhar Sarnobat
  • 25,183
  • 12
  • 93
  • 106
friederbluemle
  • 33,549
  • 14
  • 108
  • 109
  • 6
    One liner to do same thing: `git gc && join -e ERROR -a 2 -j 1 -o 2.1,2.3,1.2 --check-order <( git rev-list --objects --all | sort -k 1 ) <( git verify-pack -v .git/objects/pack/pack-*.idx | gawk '( NF == 5 && $2 == "blob" ){print}' | sort -k1 ) | sort -k2gr` – Iwan Aucamp Mar 05 '15 at 14:35
  • 1
    @Iwan, thanks for the one-liner! It doesn't handle filenames with spaces in them, this seems to: `join -t' ' -e ERROR -a 2 -j 1 -o 2.1,2.3,1.2 --check-order <( git rev-list --objects --all | sed 's/[[:space:]]/\t/' | sort -k 1 ) <( git verify-pack -v .git/objects/pack/pack-*.idx | gawk '( NF == 5 && $2 == "blob" ){print}' | sort -k1 | sed 's/[[:space:]]\+/\t/g' ) | sort -k2gr | less`. Note that you have to enter the actual TAB character after `join -t'` with CTRL+V per http://geekbraindump.blogspot.ru/2009/04/unix-join-with-tabs.html – Nickolay Jul 02 '15 at 09:11
  • 2
    @Nickolay with bash `$'\t'` should give you a tab. `echo -n $'\t' | xxd -ps` -> `09` – Iwan Aucamp Jul 02 '15 at 10:36
  • 1
    @IwanAucamp: even better, thanks for the tip! (Too bad I can't edit the previous comment.. oh well.) – Nickolay Jul 02 '15 at 23:07
  • Still my preferred option even if it's slower since you can see what the files are before removing. Unfortunately that article has vanished from the web. – Sridhar Sarnobat Oct 03 '17 at 21:12
  • But how do you actually perform the deletion now that you know the large file and its SHA? Oh wait, answer: `git filter-branch --tree-filter 'rm -f myLargeFile.log' HEAD` – Sridhar Sarnobat Oct 03 '17 at 21:17
  • 1
    @Sridhar-Sarnobat The article was saved by the Wayback Machine! :) https://web.archive.org/web/20170621125743/http://www.naleid.com/blog/2012/01/17/finding-and-purging-big-files-from-git-history – friederbluemle Oct 04 '17 at 11:04
  • How do we propagate the cleanup we did locally to the remote repo? – Pratik Khadloya Jan 10 '20 at 22:44
16

You should use BFG Repo-Cleaner.

According to the website:

The BFG is a simpler, faster alternative to git-filter-branch for cleansing bad data out of your Git repository history:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

The classic procedure for reducing the size of a repository would be:

git clone --mirror git://example.com/some-big-repo.git
java -jar bfg.jar --strip-biggest-blobs 500 some-big-repo.git
cd some-big-repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push
Warren Seine
  • 2,311
  • 2
  • 25
  • 38
  • 4
    BFG Repo-Cleaner is very good. It's lightening fast and works very reliably. – fschmitt Jan 20 '15 at 12:19
  • 33
    This doesn't tell you how to list all the largest files though. – Andi Jay Dec 01 '16 at 16:35
  • 5
    The problem with this is you can't just SEE what are the big files without actually removing them. I don't feel comfortable doing this without a dry run first that simply lists the big files. – Sridhar Sarnobat Oct 03 '17 at 21:11
  • What does `--strip-biggest-blobs 500` do? – 2540625 May 08 '20 at 22:31
  • in the end using git push didn't clean up the remote repository. I was still able to download the previous huge git .pack file – Sambit Swain Aug 13 '20 at 12:22
  • 3
    As of 2020 I would avoid bfg. It only accepts file basenames ("foo.out") not the path, so you cannot restrict it meaningfully. It has no -dryrun option. The last commit was 2015. Essentially it's dead. Downvoted (sorry). – chrisinmtown Aug 18 '20 at 14:29
15

If you only want to have a list of large files, then I'd like to provide you with the following one-liner:

join -o "1.1 1.2 2.3" <(git rev-list --objects --all | sort) <(git verify-pack -v objects/pack/*.idx | sort -k3 -n | tail -5 | sort) | sort -k3 -n

Whose output will be:

commit       file name                                  size in bytes

72e1e6d20... db/players.sql 818314
ea20b964a... app/assets/images/background_final2.png 6739212
f8344b9b5... data_test/pg_xlog/000000010000000000000001 1625545
1ecc2395c... data_development/pg_xlog/000000010000000000000001 16777216
bc83d216d... app/assets/images/background_1forfinal.psd 95533848

The last entry in the list points to the largest file in your git history.

You can use this output to assure that you're not deleting stuff with BFG you would have needed in your history.

Be aware, that you need to clone your repository with --mirror for this to work.

schmijos
  • 8,114
  • 3
  • 50
  • 58
  • 2
    Awesome!! However, you should note that you need to clone the repo with the --mirror options before running this command. – Andi Jay Dec 01 '16 at 16:38
  • I'm curious, what are the `1.1, 1.2, 2.3` numbers for? – ympostor Jan 06 '17 at 06:41
  • The numbers are a list of `.` specifying the order of the combination. See http://man.cx/join for more information. – schmijos Jan 09 '17 at 21:02
  • This isn't working properly for files with spaces in the path; the `join` command as-is is only taking the first "word" of the file path, as separated by whitespace – villapx Oct 12 '20 at 19:04
8

If you are on Windows, here is a PowerShell script that will print the 10 largest files in your repository:

$revision_objects = git rev-list --objects --all;
$files = $revision_objects.Split() | Where-Object {$_.Length -gt 0 -and $(Test-Path -Path $_ -PathType Leaf) };
$files | Get-Item -Force | select fullname, length | sort -Descending -Property Length | select -First 10
Julia Schwarz
  • 2,610
  • 1
  • 19
  • 25
  • 1
    This produces an answer different to @raphinesse, missing a bunch of the largest files on my repository. Also when one large file has a lot of modifications, only the largest size is reported. – kristianp Jul 19 '17 at 23:29
  • This script failed for me, with the error: `You cannot call a method on a null-valued expression. At line: 2 char: 1`. However, this answer worked: https://stackoverflow.com/a/57793716/2441655 (it's also shorter) – Venryx Dec 30 '19 at 14:53
7

For Windows, I wrote a Powershell version of this answer:

function Get-BiggestBlobs {
  param ([Parameter(Mandatory)][String]$RepoFolder, [int]$Count = 10)
  Write-Host ("{0} biggest files:" -f $Count)
  git -C $RepoFolder rev-list --objects --all | git -C $RepoFolder cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | ForEach-Object {
    $Element = $_.Trim() -Split '\s+'
    $ItemType = $Element[0]
    if ($ItemType -eq 'blob') {
      New-Object -TypeName PSCustomObject -Property @{
          ObjectName = $Element[1]
          Size = [int]([int]$Element[2] / 1kB)
          Path = $Element[3]
      }
    }
  } | Sort-Object Size | Select-Object -last $Count | Format-Table ObjectName, @{L='Size [kB]';E={$_.Size}}, Path -AutoSize
}

You'll probably want to fine-tune whether it's displaying kB or MB or just Bytes depending on your own situation.

There's probably potential for performance optimization, so feel free to experiment if that's a concern for you.

To get all changes, just omit | Select-Object -last $Count.
To get a more machine-readable version, just omit | Format-Table @{L='Size [kB]';E={$_.Size}}, Path -AutoSize.

SvenS
  • 795
  • 7
  • 15
  • 1
    Interesting to see a PowerShell version of my script! I have not tried it but from the code it looks like you do not output the `objectname` field. I really think you should though, since the path:objectname relationship is n:m not 1:1. – raphinesse Mar 16 '21 at 11:11
  • 1
    @raphinesse Yeah my use-case is to create an ignore-regex to migrate from TFVC to git without too many big files, so I was only interested in the paths of the files that I need to ignore ;) But you're right, I'll add it. Thanks for the edit by the way :) – SvenS Mar 16 '21 at 15:21
5

Powershell solution for windows git, find the largest files:

git ls-tree -r -t -l --full-name HEAD | Where-Object {
 $_ -match '(.+)\s+(.+)\s+(.+)\s+(\d+)\s+(.*)'
 } | ForEach-Object {
 New-Object -Type PSObject -Property @{
     'col1'        = $matches[1]
     'col2'      = $matches[2]
     'col3' = $matches[3]
     'Size'      = [int]$matches[4]
     'path'     = $matches[5]
 }
 } | sort -Property Size -Top 10 -Descending
Aaron
  • 1,390
  • 1
  • 16
  • 30
4

Try git ls-files | xargs du -hs --threshold=1M.

We use the below command in our CI pipeline, it halts if it finds any big files in the git repo:

test $(git ls-files | xargs du -hs --threshold=1M 2>/dev/null | tee /dev/stderr | wc -l) -gt 0 && { echo; echo "Aborting due to big files in the git repository."; exit 1; } || true
Vojtech Vitek - golang.cz
  • 25,275
  • 4
  • 34
  • 40
3

I was unable to make use of the most popular answer because the --batch-check command-line switch to Git 1.8.3 (that I have to use) does not accept any arguments. The ensuing steps have been tried on CentOS 6.5 with Bash 4.1.2

Key Concepts

In Git, the term blob implies the contents of a file. Note that a commit might change the contents of a file or pathname. Thus, the same file could refer to a different blob depending on the commit. A certain file could be the biggest in the directory hierarchy in one commit, while not in another. Therefore, the question of finding large commits instead of large files, puts matters in the correct perspective.

For The Impatient

Command to print the list of blobs in descending order of size is:

git cat-file --batch-check < <(git rev-list --all --objects  | \
awk '{print $1}')  | grep blob  | sort -n -r -k 3

Sample output:

3a51a45e12d4aedcad53d3a0d4cf42079c62958e blob 305971200
7c357f2c2a7b33f939f9b7125b155adbd7890be2 blob 289163620

To remove such blobs, use the BFG Repo Cleaner, as mentioned in other answers. Given a file blobs.txt that just contains the blob hashes, for example:

3a51a45e12d4aedcad53d3a0d4cf42079c62958e
7c357f2c2a7b33f939f9b7125b155adbd7890be2

Do:

java -jar bfg.jar -bi blobs.txt <repo_dir>

The question is about finding the commits, which is more work than finding blobs. To know, please read on.

Further Work

Given a commit hash, a command that prints hashes of all objects associated with it, including blobs, is:

git ls-tree -r --full-tree <commit_hash>

So, if we have such outputs available for all commits in the repo, then given a blob hash, the bunch of commits are the ones that match any of the outputs. This idea is encoded in the following script:

#!/bin/bash
DB_DIR='trees-db'

find_commit() {
    cd ${DB_DIR}
    for f in *; do
        if grep -q $1 ${f}; then
            echo ${f}
        fi
    done
    cd - > /dev/null
}

create_db() {
    local tfile='/tmp/commits.txt'
    mkdir -p ${DB_DIR} && cd ${DB_DIR}
    git rev-list --all > ${tfile}

    while read commit_hash; do
        if [[ ! -e ${commit_hash} ]]; then
            git ls-tree -r --full-tree ${commit_hash} > ${commit_hash}
        fi
    done < ${tfile}
    cd - > /dev/null
    rm -f ${tfile}
}

create_db

while read id; do
    find_commit ${id};
done

If the contents are saved in a file named find-commits.sh then a typical invocation will be as under:

cat blobs.txt | find-commits.sh

As earlier, the file blobs.txt lists blob hashes, one per line. The create_db() function saves a cache of all commit listings in a sub-directory in the current directory.

Some stats from my experiments on a system with two Intel(R) Xeon(R) CPU E5-2620 2.00GHz processors presented by the OS as 24 virtual cores:

  • Total number of commits in the repo = almost 11,000
  • File creation speed = 126 files/s. The script creates a single file per commit. This occurs only when the cache is being created for the first time.
  • Cache creation overhead = 87 s.
  • Average search speed = 522 commits/s. The cache optimization resulted in 80% reduction in running time.

Note that the script is single threaded. Therefore, only one core would be used at any one time.

pdp
  • 4,117
  • 1
  • 17
  • 20
2

Use the --analyze feature of git-filter-repo like this:

$ cd my-repo-folder
$ git-filter-repo --analyze
$ less .git/filter-repo/analysis/path-all-sizes.txt
Windel
  • 499
  • 4
  • 11
0

I stumbled across this for the same reason as anyone else. But the quoted scripts didn't quite work for me. I've made one that is more a hybrid of those I've seen and it now lives here - https://gitlab.com/inorton/git-size-calc

IanNorton
  • 7,145
  • 2
  • 25
  • 28
-1

to get a feeling for the "diff size" of the last commits in the git history

git log --stat

this will show the diff size in lines: lines added, lines removed

milahu
  • 2,447
  • 1
  • 18
  • 25