Remove all information from a .git directory that can be re-downloaded

Question

I have a git repository that, when just checked out, takes around 2.3 GiB even in the shallowest configuration, of which 1.9 GiB is inside .git/objects/pack. The working tree files are just about .5 GiB.

Considering I have a remote from which I can re-fetch all the objects if needed, the question is:

What (and how) can I delete from inside .git everything that I could then re-fetch safely, with simple git commands, from the remote?

Testing a bit, I found out that if I delete everything under .git/objects/pack/, it will be re-downloaded from the remote with a simple git fetch.

There are some complaints like:

error: refs/heads/master does not point to a valid object!
error: refs/remotes/origin/master does not point to a valid object!
error: refs/remotes/origin/HEAD does not point to a valid object!

But then .git/objects/pack gets repopulated and further calls to git fetch don't complain anymore.

Is it safe to nuke .git/objects/pack* like this?

Assumptions:

There are no local-only commits in the repo or any form of git manipulation (like adding/removing objects from the stage), just checking out a specific branch in shallow mode.
The remote won't be rewriting history for the checked out branches.
I have no control whatsoever over the contents of the remote repository itself. It's a dependency of my project, but a fast changing one that is only available as git, and I want instructions for automated use in a continuous integration setting. Tips on how to modify the repository itself to make it take less space aren't going to help.
As I mentioned earlier, 1.9 GiB is for a shallow clone of the one branch I'm interested. It's a lot bigger than that when it's non-shallow, due to it's long history (open-source project that has over 10 years).
There are other repositories checked out in the same continuous-integration pipeline and I'd like to apply the same reduction of redundant-with-remote info in all of them.

The intent is to reduce as much as possible the amount of space taken by artifacts from a continuous-integration pipeline, but retaining enough information so that a those artifacts could be downloaded and restored to working order in the developer workstation with as little (and as normal) commands as possible.

Large hard drives are pretty cheap compared to a developer's time trying to micro-manage what Git is doing. — crashmstr, Feb 08 '17 at 20:58
I don't remember the details but iirc you can basically only clone the most recent history, but I don't know if you can switch back and forth. — MikeMB, Feb 08 '17 at 21:05
What sort of stuff are you putting into this repository? Are there images and video and office files? Are you compressing things? All of these things can bloat a repository. — Schwern, Feb 08 '17 at 21:17
Possible duplicate of [Converting git repository to shallow?](http://stackoverflow.com/questions/4698759/converting-git-repository-to-shallow) — Jonas Schäfer, Feb 08 '17 at 21:23
@crashmstr, large drives are cheap until you start managing a CI pipeline that is creating lots of artifact bundles, one for each build, with not much freedom to control the pipeline infrastructure itself or the remote repositories I'm fetching. And I don't want to micromanage manually what git is doing. I want automated commands I can run blindly that shrink `.git` beyond what `--depth` can do, but are recoverable later. — LeoRochael, Feb 09 '17 at 21:37
"working tree files are just about .5 GiB" - was it 0.5? So, several times _less_ than shallow pack size? I wonder how it could be ever be so. But in case they are even equal, what's the point to use git at all? Download archive then. — max630, Feb 09 '17 at 23:48

Schwern · Answer 1 · 2017-02-09T22:00:12.517

Deleting stuff in .git/ is just going to break things. That contains the complete history of your project and those pack files are how git saves space. There are far, far better ways to reduce the size of your repo.

First is to run garbage collection, git gc. This will do a number of things to reduce the size of the repository on your disk. You shouldn't have to, this runs periodically, but it might help.

If it doesn't, try a shallow clone where you only get part of the history. This only clones the latest 100 commits from master.

git clone --depth=100 <remote>

Similarly, you can just clone one branch.

git clone --single-branch --branch master <remote>

These can be "deepened" later with git-fetch.

But the best thing to do is to reduce the size of your repo. Git is very efficient on space and 2 gigs is enormous. It suggests there are a lot of very large binary files in the repository, images, videos, spreadsheets, and compressed files... which git cannot efficiently compress. To handle this there are two tools, git-lfs (Large File Support) and the BFG Repo Cleaner.

git-lfs lets you store old versions of large files in cloud storage rather than in everyone's .git directory. This can immensely reduce the size of the repository... going forward.

BFG Repo Cleaner lets you easily rewrite history, including options to remove large files.

Put them together, and you can use the BFG Repo Cleaner to change existing large files to use git-lfs. This can immensely reduce the size of your repository. For example, this would change all *.mp4 to use git-lfs.

$ java -jar ~/bfg-1.12.15.jar --convert-to-git-lfs '*.mp4' --no-blob-protection

Instructions for that can be found here.

The other important thing is to not compress files. You mentioned continuous integration artifacts, and I'm willing to bet they're compressed. Git will do its own more efficient compression, and it can ensure there's only ever one copy of a file through history, but it can only do it on text. Unpack tarballs and zipfiles before committing them.

If you absolutely cannot reduce the size of the repository, your remaining option is to have everyone share one .git directory. You can do this with the --git-dir option, or by setting GIT_DIR.

git --git-dir=/path/to/the/.git log

This is a terrible idea. While everyone can have their own checkout, they'll all be sharing the same repository state. If one dev makes a change, the other devs will see it, but now with a different working directory.

For example, dev1 adds a file.

$ touch this
$ GIT_DIR=~/tmp/foo/.git git add this
$ GIT_DIR=~/tmp/foo/.git git st
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    new file:   this

Then dev2 suddenly sees this.

$ GIT_DIR=~/tmp/foo/.git git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    new file:   this

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    deleted:    this

They're sharing the same staging area, but not the same working copy. Devs will be stumbling over each other constantly.

If git clone --depth=1 is still producing repos that are too big, then there's simply a lot of data in each checkout. There isn't much you can do about that. If .git on a shallow clone is 2 gigs, then the checkout is going to be even larger.

As for the idea of performing surgery on .git, maybe you could get away with deleting some objects and hope that a git fetch --deepen can fix it, but maintaining this across multiple devs... it's a maintenance nightmare.

At that point, you might as well just delete .git entirely. Now you've effectively exported the latest commit. There's various ways to do this directly.

Or just stop wasting time and money and buy bigger hard drives. Every person-hour spent on this is a hard drive you could have bought.

Thank you for your detailed answer, but as I have now clarified in the question, I have no control over contents of the repository, and shallow-cloning has already been tried. — LeoRochael, Feb 09 '17 at 12:40
@LeoRochael If you can `git push --force` you can do everything I mentioned above. You can even go back in history and unpack compressed files with [`git filter-branch`](https://git-scm.com/docs/git-filter-branch). If `git clone --depth=1` is giving you a 1.9 GB checkout, then that's how much data there is. Git can't change that. You don't have much choice but to change how the repo is being used or get a bigger hard drive. — Schwern, Feb 09 '17 at 20:29
I can't do `git push --force`, it's not my repo, and I have many repos to which I want to apply this operation. And as I clarified, I'm interested in what destructive operations I can do to `.git` in an automated manner to reduce it's size in a way that is later recoverable. Tips on how to reduce the size of the remote won't help me here, unfortunately. — LeoRochael, Feb 09 '17 at 21:33
@LeoRochael If `git clone --depth=1` and `git gc` don't help, there isn't much you can do. `.git` has to at least have a compressed copy of the complete checkout. There is one option that I'll edit in: have everyone use a shared `.git` directory by setting the `GIT_DIR` environment variable. This is going to cost you a lot more hassle than just buying a bigger hard drive. — Schwern, Feb 09 '17 at 21:49
The space saving I want to achieve is inside the artifacts generated by a continuous-integration server. Artifacts are compressed archives (tar.gz or zip) of the result of the continuous integration run. Sharing repositories between developers is not the issue here, nor is it possible to share repositories between c.i. runs as I don't control directly the c.i. environment. Deleting the `.git` and recording the hash of the .git is one solution (would work like `git submodules`). I'm wondering if there is an easier one. — LeoRochael, Feb 10 '17 at 13:16

max630 · Answer 2 · 2017-02-10T13:48:36.113

reduce as much as possible the amount of space taken by artifacts from a continuous-integration pipeline, but retaining enough information so that a those artifacts could be downloaded and restored to working order in the developer workstation with as little (and as normal) commands as possible

I don't fully understand your case but one often forgotten way of reducing network data size and server's memory use is to:

distribute some stable repository (which includes only branches which are not rewritten), and then
use use --reference <path> while cloning.

In normal development conditions (text files, not all fof them updated in each commit) it's way more efficient than using shallow clones.

As for the thing you asked, I think it makes no sense to try to save on removing anything from the repository. Most data is used for pack, which is needed and rest is insignificant.

PS: the repository can be initialized in temporary storage just by git itself:

CACHE_REPO=/tmp/repo
if ![ -d "$CACHE_REPO" ]; then
  git clone --single-branch --no-checkout --branch=_BRANCH_ _REMOTE_ "$CACHE_REPO"
fi

_BRANCH_ is master or some other branch you are sure will not be force-pushed. You can try making it shallow, it might or might not work, I'm not sure about it.

Using `--reference` would be nice, but I don't control the environment where the CI pipeline is run. — LeoRochael, Feb 10 '17 at 12:54
But you can specify the batch program to run there? Can you use something like `CACHE_DIR=/tmp/foo; if ![ -d "$CACHE_DIR" ]; then curl blabla.tar.gz | tar -C /tmp/foo xzf -; fi` to fill the cache automatically? — max630, Feb 10 '17 at 13:10
I can specifiy what gets run during the pipeline build, but I'd then have to host `blabla.tar.gz` somewhere. I was hoping to limit my interventions to the commands run inside the pipeline. — LeoRochael, Feb 10 '17 at 13:18

LeoRochael · Accepted Answer · 2017-12-05T17:21:11.793

What (and how) can I delete from inside .git everything that I could then re-fetch safely, with simple git commands, from the remote?

How about everything?

If you don't want to worry about the internals of .git and whether something is or not recoverable, you can save just enough information to check it all out again, and restore the workspace to a functionally similar state than it was when running in the C.I. pipeline.

In the C.I. Pipeline

Add somewhere a file like this (let's call it degit.sh)

#!/bin/bash
set -ex
GIT_REMOTE=$( git remote get-url origin )
GIT_BRANCH=$( git rev-parse --abbrev-ref HEAD )
GIT_COMMIT=$( git rev-parse HEAD )

# TABs, not spaces, indenting the block below:
cat <<-EOF > .gitrestore
    set -ex
    test ! -e .git
    tmpclone=\$( mktemp -d --tmpdir=. )
    git clone $GIT_REMOTE -n --branch=$GIT_BRANCH \$tmpclone
    ( cd \$tmpclone ; git reset --hard $GIT_COMMIT )
    mv \$tmpclone/.git .
    rm -rf "\$tmpclone"
    rm -f \$0
EOF

rm -rf .git

Then, inside the root of each of the git repos of your Continous Integration (C.I.) workspace, you call it so that it generates a .gitrestore file.

It will look something like this:

set -ex
test ! -e .git
tmpclone=$( mktemp -d --tmpdir=. )
git clone git@example.com:example/repo.git -n --branch=example-branch $tmpclone
mv $tmpclone/.git .
git reset --hard example-commit-hash
rm -rf "$tmpclone"
rm -f $0

Notice that it self destructs after running successfully. You don't want to run it twice.

In the Developer Machine

Now your developer can fetch the C.I. artifacts and run, inside each repository:

bash .gitrestore

And it will have a repository that looks very much like what the C.I. pipeline had, except for an updated view of the remotes, which allows the developer to compare what the C.I. had with what she has.

Other considerations

This assumes that only the C.I. machine is constrained for space, not the developer machine (neither her bandwidth).

If you want to save space/bandwidth on the developer end, you can pass --depth=1, which will clone only the specified branch (i.e., it implies --single-branch and will restrict the history to a single commit.

Remove all information from a .git directory that can be re-downloaded

3 Answers3

How about everything?

In the C.I. Pipeline

In the Developer Machine

Other considerations