4

I have this site hosted on github (gh-pages): http://ktsaou.github.io/blocklist-ipsets/. It is one repo with 2 brances: master (where the source files are stored) and gh-pages (mostly json files for the site). Hundreds of files are updated automatically every few minutes and about 200 commits are made per day, to both branches. Of course there are files that are updated on every commit and files that are never updated.

The problem is that the repo is booming. In just a few days each branch takes about 1GB on disk. Now I run daily git gc --aggressive --prune=now to minimize their size and when they reach 1GB I cleanup everything and start from zero.

I have tried all methods to shrink the repo, without luck.

Check this example:

#!/bin/bash
cd /tmp/test1 || exit 1
[ -d .git ] && rm -rf .git

git init
touch file.txt
git add file.txt

for x in {1..20}
do
        # create a new file
        echo "commit $x" >file.txt

        # copy a big file, but make some changes to it
        cat /var/log/messages | sort -R >>file.txt

        # commit it
        git commit file.txt -m "commit $x"

        if [ $x -eq 1 ]
        then
                echo
                echo "Size after $x commits:"
                du -s -h .git

                git gc --aggressive --prune="now"

                echo
                echo "Size after $x commits and aggressive garbage collection:"
                du -s -h .git
        fi
done

#git log | cat

echo
echo "Size after $x commits:"
du -s -h .git

git gc --aggressive --prune="now"

echo
echo "Size after $x commits and aggressive garbage collection:"
du -s -h .git

git log | grep ^commit | head -n 1 | cut -d ' ' -f 2 >.git/info/grafts

git filter-branch -- --all

echo
echo "Size after $x commits and graft:"
du -s -h .git

git gc --aggressive --prune="now"

echo
echo "Size after $x commits, graft and aggressive garbage collection:"
du -s -h .git

Here is the output:

# bash test.sh
Initialized empty Git repository in /tmp/test1/.git/
[master (root-commit) e63c3a5] commit 1
 1 file changed, 11926 insertions(+)
 create mode 100644 file.txt

Size after 1 commits:
560K    .git
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)

Size after 1 commits and aggressive garbage collection:
252K    .git
[master b054c76] commit 2
 1 file changed, 11825 insertions(+), 11825 deletions(-)
[master ba5eae5] commit 3
 1 file changed, 11774 insertions(+), 11774 deletions(-)
[master ad5842f] commit 4
 1 file changed, 11795 insertions(+), 11795 deletions(-)
[master 8edcf5f] commit 5
 1 file changed, 11797 insertions(+), 11797 deletions(-)
[master 09fefb6] commit 6
 1 file changed, 11793 insertions(+), 11793 deletions(-)
[master 26a89b9] commit 7
 1 file changed, 11791 insertions(+), 11791 deletions(-)
[master a5569ae] commit 8
 1 file changed, 11810 insertions(+), 11810 deletions(-)
[master 9120440] commit 9
 1 file changed, 11785 insertions(+), 11785 deletions(-)
[master b6c17ed] commit 10
 1 file changed, 11815 insertions(+), 11815 deletions(-)
[master 493ea14] commit 11
 1 file changed, 11838 insertions(+), 11838 deletions(-)
[master f41e066] commit 12
 1 file changed, 11832 insertions(+), 11832 deletions(-)
[master 9cb0c1a] commit 13
 1 file changed, 11803 insertions(+), 11803 deletions(-)
[master 8160cf1] commit 14
 1 file changed, 11803 insertions(+), 11803 deletions(-)
[master c7563a8] commit 15
 1 file changed, 11796 insertions(+), 11796 deletions(-)
[master e57c5e1] commit 16
 1 file changed, 11824 insertions(+), 11824 deletions(-)
[master 4a55c03] commit 17
 1 file changed, 11807 insertions(+), 11805 deletions(-)
[master a23ad81] commit 18
 1 file changed, 11791 insertions(+), 11791 deletions(-)
[master f504fe8] commit 19
 1 file changed, 11817 insertions(+), 11817 deletions(-)
[master 3f10dde] commit 20
 1 file changed, 11783 insertions(+), 11783 deletions(-)

Size after 20 commits:
4,9M    .git
Counting objects: 60, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (40/40), done.
Writing objects: 100% (60/60), done.
Total 60 (delta 19), reused 2 (delta 0)

Size after 20 commits and aggressive garbage collection:
1,9M    .git
Rewrite 3f10ddeca786824d43988becd99990ab039b34d3 (1/1)
Ref 'refs/heads/master' was rewritten

Size after 20 commits and graft:
1,9M    .git
Counting objects: 61, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (41/41), done.
Writing objects: 100% (61/61), done.
Total 61 (delta 20), reused 41 (delta 0)

Size after 20 commits, graft and aggressive garbage collection:
1,9M    .git

As you can see, grafting did not solve the issue. The size of the repo after crafting is exactly the same as before grafting.

Any ideas how to shrink a repo?

Costa Tsaousis
  • 349
  • 1
  • 5
  • I edited the question. Truncating the commit log does not shrink the repo. It remains exactly the same. The question you suggested talks about truncating the commit log, which does not help in my case (example script given to reproduce it). – Costa Tsaousis Aug 10 '15 at 14:52
  • How many developers are working in parellel to have hundreds of files updates every few minutes? I have the feeling you're not using gi to hold your source code, but your data. Why don't you use (and backup) a database to hold the data? Or just files on your server, backed up. – JB Nizet Aug 10 '15 at 16:50
  • As I said, it holds a site. Browse it at http://ktsaou.github.io/blocklist-ipsets/. The data of the site are not too big. Just a few MB. But it is updated frequently. From your response, I understand that it is not possible to shrink a git repo. Zap and start from zero is the only solution? – Costa Tsaousis Aug 10 '15 at 21:14
  • No, my answer didn't imply that. My answer implied that git is typically used to hold source code, not data. Especially if you don't care about the history of the data. You could thus have these few MB not on github, as files on some host or in a database, and not in your git repo. StackOverflow doesn't have al the questions and answers as files in a git repo. It has them into a database. The git repo contains the source code used to build the application. What I'm suggesting you is to do the same. – JB Nizet Aug 10 '15 at 21:26
  • Thanks you for your suggestions. I disagree though. If a site can be made static, it is a total waste of resources to make it dynamic. A huge difference in operations, in development, in maintainance, in customer experience. Even in stackoverflow, the questions and the answers are most probably stored in a NOSQL DB, which is closer to a git repo, or a filesystem, than a relational database. Anyway, I already know exactly what I am asking. If there was no size limitation on github, a git repo would be ideal for tracking the changes of IP lists. It would be a nightmare to do it otherwise. – Costa Tsaousis Aug 11 '15 at 06:59

0 Answers0