2

I have found the answer to How to shrink a Git repo, my question is when is the right point in time to do it. Here is the context, that may help to understand:

  • We have a small project with ~ 10 people working on code, 4 locally in Germany, 6 remote in China.
  • The repository was created anew a year ago (no history) with the source code (mostly Java) of our project.
  • We have a relatively straight-forward process
    • Developers work locally on a feature branch (shared with others). If necessary, they add a developer branch (that is pushed as well) to avoid loss of data.
    • When the feature is finished the feature branch is then merged on the master, and removed some time later.
  • The repository has now the size of 4.5 GB, which is a burden with our network locally but even worse when working remotely.

When is the right time to shrink the repository?

Community
  • 1
  • 1
mliebelt
  • 15,345
  • 7
  • 55
  • 92
  • 3
    Why exactly is the overall size of the repository considered a burden when you almost never need to clone it as a whole? Don't you have some issues in the workflow? Why is it so large, don't you store large non-source-code data in there? How large is the data itself? Also I'm afraid the question *when* to do it can be only answered by you. – Pavel Šimerda Sep 21 '14 at 16:19
  • I am not working in the project, so I just heard it lately. Cloning over the internet (1MBit connection) is a burden, so the remote workers have to use different strategies. I will invest some time where the size comes from, but that is a different question. Do you think there are strategies available, that help manage that? – mliebelt Sep 21 '14 at 16:30
  • That sounds like a clone would take more or less ten hours. That sounds unbearable even for a large project. The best strategy is to only put source code in your git repository. If you need version control of large blobs, use one or more separate repositories or submodules for those. I'm just looking at a repo of the linux kernel and the `.git` directory has less than a gigabyte. I guess it's the largest project currently cloned in my computer. If you have a huge project with lots of developers and gigabytes of code, it probably deserves splitting as well. – Pavel Šimerda Sep 21 '14 at 16:38
  • Or you can live with the ten hours for a clone, as the developers need to do it only once anyway. Just instruct them to never delete their local copy and instead move it over installations and replaced hardware. – Pavel Šimerda Sep 21 '14 at 16:39
  • I don't think that the question (or answer) is based on opinions. There are hard facts here: remote development (> 20.000km, 4.5 GB repo). But perhaps it is only a problem we face ... Of course, we have to check the reason for the size, and how to shrink it, but when is the ripe point (after a release, ...) could be a valid question. – mliebelt Sep 22 '14 at 18:56
  • When is the ripe point is certainly an opinion based question. Which files belong in git as well. The question is certainly valid but as I learned recently, SE is trying to avoid this type of questions. – Pavel Šimerda Sep 23 '14 at 11:43

1 Answers1

3

For comparison: The Linux repository, which is the largest git repository I know of, has almost 470k commits and more than 4k contributors. It took 1.15 GB at checkout. After a git gc --aggressive its size went down to 858 MB.

You certainly have files in your repository that don't belong there. I'm primarily thinking of various binary files. These should be stored elsewhere if they take too much space.

If you happen to store compiled files, you should remove them from the repository and add the corresponding patterns to your .gitignore file. As a rule of thumb, files that can be generated from other files in the repository and that take space or are binary files shouldn't be commited.

I just found this tool: BFG Repo-cleaner. It's a helper tool that lets you rewrite your history with removing problematic files. You could use it to remove the files that don't belong there.
Take care though, rewriting history means most commits will get a different SHA-1 hash. So everyone on your team would have to switch repositories at the same time: you generate the new repo, and then everyone will have to abandon the old repo and use the new one from now on.

But: cloning a repository shouldn't be problematic in the first place. You are supposed to clone a repository only once. If you need a second repository for whatever reason, clone it from the first one or just copy the .git directory from it.

Likewise, the remote people could have cloned the repository only once (so you transfer these 4.5 GB only once between Germany and China). Then, the Chinese people can clone it locally between themselves and just switch the upsteam remote afterwards.

In conclusion, I don't know if cleaning the repository is worth it in the first place, since you're not supposed to clone it very often.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158