Context
We are busy to migrate a subversion repository to multiple git repositories. One of the repositories is smaller than 100MB, but the .git repository is over 5GB. The aim is to preserve the git history, but the large files should have been removed. We do not want a .git repository that is larger than 300MB otherwise it will take too long to clone a git repository.
Current
[user@localhost testGitMigration]$ du -h .git/
4.0K .git/refs/heads
0 .git/refs/tags
4.0K .git/refs/original/refs/heads
4.0K .git/refs/original/refs
4.0K .git/refs/original
8.0K .git/refs
0 .git/branches
40K .git/hooks
8.0K .git/info
808M .git/objects/pack
4.0K .git/objects/info
4.0K .git/objects/2d
4.0K .git/objects/14
4.0K .git/objects/a8
808M .git/objects
4.0K .git/logs/refs/heads
4.0K .git/logs/refs
8.0K .git/logs
808M .git/
Expected
[user@localhost testGitMigration]$ du -h .git/
4.0K .git/refs/heads
0 .git/refs/tags
4.0K .git/refs/original/refs/heads
4.0K .git/refs/original/refs
4.0K .git/refs/original
8.0K .git/refs
0 .git/branches
40K .git/hooks
8.0K .git/info
1M .git/objects/pack
4.0K .git/objects/info
4.0K .git/objects/2d
4.0K .git/objects/14
4.0K .git/objects/a8
1M .git/objects
4.0K .git/logs/refs/heads
4.0K .git/logs/refs
8.0K .git/logs
1M .git/
Problem statement
As defined in the context paragraph, the goal is to get rid of too large files in the .git repository. It turned out that back in the day, some iso's were committed. Although it was possible to migrate a svn folder into a git repository, i.e. the history looks identical. The .git is over 5GB, while the content of the repo is less than 100MB. If the large files will be removed from the git repo, will the history still correct or corrupted? In summary, the repository should not be larger than 5GB, while the content is smaller than 100MB.
Does anybody has experience with such migrations? Another solutions I can think of is ignoring the .git history and commit the files as is to a new repository, but then all history will be gone. So preferred is to preserve the history, but remove the too large files. How to find these too large files by the way? The repository was created in 2013 and it is unclear what files should be removed from the log and how to do that without corrupting the log.
Sample code and data
In order to reproduce this a new git repository was created locally by running mkdir testGitMigration
, cd testGitMigration
and git init
.
By having a git repository now in place, two text files were created and an iso was downloaded:
[user@localhost testGitMigration]$ du -h *
4.0K hello
825M ubuntu-16.04.3-server-amd64.iso
4.0K world
As you could see, there is one large file, ubuntu-16.04.3-server-amd64.iso that is larger than 800MB. Multiple of such large files were probably added back in the days in the current situation we are experiencing. As the .git repository contains all the history, the size of this directory will be probably larger than 808MB:
[user@localhost testGitMigration]$ du -h .git/
4.0K .git/refs/heads
0 .git/refs/tags
4.0K .git/refs
0 .git/branches
40K .git/hooks
4.0K .git/info
808M .git/objects/pack
0 .git/objects/info
4.0K .git/objects/ce
4.0K .git/objects/b4
4.0K .git/objects/3b
4.0K .git/objects/55
4.0K .git/objects/53
4.0K .git/objects/cc
4.0K .git/objects/a1
4.0K .git/objects/5c
4.0K .git/objects/c7
4.0K .git/objects/fe
808M .git/objects
4.0K .git/logs/refs/heads
4.0K .git/logs/refs
8.0K .git/logs
808M .git/
Let's see what will happen if the iso will be removed:
[user@localhost testGitMigration]$ git log
commit fe7455c0eb6964772526eb848255a6eb11f2283a
Author: user <user@user.user>
Date: Wed Dec 20 21:28:00 2017 +0100
removed iso
commit 5ce0fed4ebe891accd9a1fc3f0ee8ebd3af8d7f0
Author: user <user@user.user>
Date: Wed Dec 20 21:22:54 2017 +0100
third file
commit 53dd97210f2b7b8270d66698bb0438d5071b0038
Author: user <user@user.user>
Date: Wed Dec 20 21:22:40 2017 +0100
second file
commit 3b1baf9d65f051b4fc402d7375f3ff199ddd2dab
Author: user <user@user.user>
Date: Wed Dec 20 21:19:24 2017 +0100
first file
Although the iso has been removed the repository size is still larger than 800MB. So this indicates that it could be possible that multiple large files were added back in the day and that these were removed as the repo itself is smaller than 300MB and the git repo large than 5GB.
So how to get rid of these large files? If we try to achieve this in this test scenario then the expectation is that the repo will be smaller than 1MB as the disk usage of the files in this repo, without the iso is as follows:
[user@localhost testGitMigration]$ du -h *
4.0K hello
4.0K world
Based on the output of the git log it is impossible to see the size of each commit. So how to get the size of each commit?
This code was found.
https://gist.github.com/magnetikonline/dd5837d597722c9c2d5dfa16d8efe5b9
#!/bin/bash -e
# work over each commit and append all files in tree to $tempFile
tempFile=$(mktemp)
for commitSHA1 in $(git rev-list --all); do
git ls-tree -r --long "$commitSHA1" >>"$tempFile"
done
# sort files by SHA1, de-dupe list and finally re-sort by filesize
sort --key 3 "$tempFile" | \
uniq | \
sort --key 4 --numeric-sort --reverse
# remove temp file
rm "$tempFile"
When run, it shows the following output:
[user@localhost testGitMigration]$ ./gitlistobjectbysize.sh
100644 blob 02b6feb032c58dc07eb18af81a4067fbf154cc30 865075200 ubuntu-16.04.3-server-amd64.iso
100644 blob ce013625030ba8dba906f756967f9e9ca394464a 6 hello
100644 blob cc628ccd10742baea8241c5924df992b5c019f71 6 world
So the commit that contains the large file was found! Let's remove it!
[user@localhost testGitMigration]$ git filter-branch --tree-filter 'rm -f ubuntu-16.04.3-server-amd64.iso' -- --all
Rewrite fe7455c0eb6964772526eb848255a6eb11f2283a (4/4)
Ref 'refs/heads/master' was rewritten
Now the .git size should be smaller than 1MB right? Let's check:
[user@localhost testGitMigration]$ du -h .
4.0K ./.git/refs/heads
0 ./.git/refs/tags
4.0K ./.git/refs/original/refs/heads
4.0K ./.git/refs/original/refs
4.0K ./.git/refs/original
8.0K ./.git/refs
0 ./.git/branches
40K ./.git/hooks
8.0K ./.git/info
808M ./.git/objects/pack
4.0K ./.git/objects/info
4.0K ./.git/objects/2d
4.0K ./.git/objects/14
4.0K ./.git/objects/a8
808M ./.git/objects
4.0K ./.git/logs/refs/heads
4.0K ./.git/logs/refs
8.0K ./.git/logs
808M ./.git
808M .
Still the same? How is that possible?
So what about removing the entire commit? This will not be an option as it could be possible and is also likely that multiple files were part of a certain commit. If the whole commit will be removed, the too large file could be gone, but then all other files that need to be kept are gone as well. This is not an option.
Perhaps a garbage collection should be done once the iso has been removed?
[user@localhost testGitMigration]$ git gc
Counting objects: 14, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (14/14), done.
Total 14 (delta 5), reused 9 (delta 1)
The .git dir is now smaller than 1MB? No:
[user@localhost testGitMigration]$ du -h .git/
0 .git/refs/heads
0 .git/refs/tags
0 .git/refs/original
0 .git/refs
0 .git/branches
40K .git/hooks
8.0K .git/info
808M .git/objects/pack
4.0K .git/objects/info
808M .git/objects
4.0K .git/logs/refs/heads
4.0K .git/logs/refs
8.0K .git/logs
808M .git/