42

I have a clone. I want to reduce the history on it, without cloning from scratch with a reduced depth. Worked example:

$ git clone git@github.com:apache/spark.git
# ...
$ cd spark/
$ du -hs .git
193M    .git

OK, so that's not so but, but it'll serve for this discussion. If I try gc it gets smaller:

$ git gc --aggressive
Counting objects: 380616, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (278136/278136), done.
Writing objects: 100% (380616/380616), done.
Total 380616 (delta 182748), reused 192702 (delta 0)
Checking connectivity: 380616, done.
$ du -hs .git
108M    .git

Still, pretty big though (git pull suggests that it's still push/pullable to the remote). How about repack?

$ git repack -a -d --depth=5
Counting objects: 380616, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (95388/95388), done.
Writing objects: 100% (380616/380616), done.
Total 380616 (delta 182748), reused 380616 (delta 182748)
Pauls-MBA:spark paul$ du -hs .git
108M    .git

Yup, didn't get any smaller. --depth for repack isn't the same for clone:

$ git clone --depth 1 git@github.com:apache/spark.git
Cloning into 'spark'...
remote: Counting objects: 8520, done.
remote: Compressing objects: 100% (6611/6611), done.
remote: Total 8520 (delta 1448), reused 5101 (delta 710), pack-reused 0
Receiving objects: 100% (8520/8520), 14.82 MiB | 3.63 MiB/s, done.
Resolving deltas: 100% (1448/1448), done.
Checking connectivity... done.
Checking out files: 100% (13386/13386), done.
$ cd spark
$ du -hs .git
17M .git

Git pull says it's still in step with the remote, which surprises nobody.

OK - so how to change an existing clone to a shallow clone, without nixing it and checking it out afresh?

paul_h
  • 1,859
  • 3
  • 19
  • 27
  • What do you wish to do? you wish to work on multiple branches simultaneously? this is why you do the re-clone? – CodeWizard Jul 03 '16 at 17:27
  • What's wrong with cloning it again? That said, see [Section 7.13](https://git-scm.com/book/en/v2/Git-Tools-Replace) of the Pro Git book. It walks you through splitting a repository into two, one with recent commits only, the other retaining historical data. – chepner Jul 03 '16 at 20:26
  • @CodeWizard I've filled my SSD drive, and want some space back, without deleting whole clones. – paul_h Jul 05 '16 at 00:25
  • 1
    i made a [git-shallow-maker](https://github.com/milahu/random/blob/master/git/git-shallow-maker) to copy all local branches to a new local repo. this will copy only the needed commits, so the new repo is shallow – milahu Feb 12 '23 at 13:58

5 Answers5

46
git fetch --depth 10

this will fetch all newer commits from origin and then cut off the local history to depth of 10.

for normal purposes your local git history is now at length of 10. but beware that the files of the old commits still occupy space on your disk and that the commits still exist in the remote repository.

if your aim was to have a shorter log because you currently don't need years worth of commit history then you are done. your log will be short and most common git commands now only see 10 commits.

if your aim was to free disk space because older commits have huge binary blobs which you don't need to work now then you have to actually remove the files from your disk. see below for a short description how to do so.

if your aim was to completely remove the old commits (for example to remove a password from old commits) then this is not the correct command to do so. the commits are still visible and accesible for all who have access to the remote repository. you need to remove the commits from the remote repository. see below for links with more info on how to remove commits from a remote repo.

to undo a --depth and get the entire history again:

git fetch --unshallow

how to free disk space

data loss warning! read the notes and pay attention to what you are doing.

after a git fetch --depth xx the files of the old commits still hang around on disk. git won't remove those files as long as some references are still holding on to those commits. so you need to remove those references. those references are, roughly in order of data pertinence: the reflog, stashes, tags, and branches.

the reflog is typically safe to clear. read the notes below for when you might want to think twice before clearing the reflog.

to clear the reflog:

git reflog expire --expire=all --all

stashes should be temporary anyways. so just drop them like it's hot:

git stash drop

tags and branches typically hold data you want to keep. so be carefull with the next two commands. read the notes below for more information.

to remove all tags:

git tag -l | xargs git tag -d

to remove a branch:

git branch -d branchname

beware of data loss! read the notes below and think before you delete.

once you have removed all references you can call the git garbage collector to actually remove the files of the old commits:

git gc --prune=now

now the files should be removed from disk.


notes

tags and branches are often synced with the remote repo. but they can also exists in your local repo only. those that exists on the remote repo can always be fetched again if needed. those that exists only locally will be lost if you delete them.

the easiest way to backup your local tags and branches is to copy your entire local repo to another disk. you can also clone your repo locally. but make sure to include all tags and branches as a simple clone will not. see below for a link explaining how to do so.

the reflog is something like a local history of past local repository states. it is entirely local to your local repository. many git commands will record the previous state of the local repository in the reflog. with the reflog you can undo some commands or at least retrieve lost data if you made a mistake. so think before you clear the reflog.

old reflog entries are cleared automatically after a certain time by git garbage collector (about 90 days IIRC). tags and branches however will stay around forever. so if you want to free disk space you have to at least remove the tags and branches manually.


see also

https://linuxhint.com/git-shallow-clone-and-clone-depth/

http://gitready.com/intermediate/2009/02/09/reflog-your-safety-net.html

How do I edit past git commits to remove my password from the commit logs?

Delete all local git branches

Fully backup a git repo?

Lesmana
  • 25,663
  • 9
  • 82
  • 87
  • 1
    `git fetch --depth X` allows not only reduce but also increase depth of a repository – ephemerr Oct 08 '18 at 11:04
  • 2
    This works perfectly. To restore to full history, use `git fetch --unshallow`. – iwat Dec 12 '18 at 18:27
  • I got here from https://stackoverflow.com/questions/4698759/converting-git-repository-to-shallow/40452701 which is basically the same question stated more concisely. I use the reflog for recovering from mistakes a lot, and am surprised to see it recommended to remove it when it clears itself anyway. Is it important to remove? I feel like people should at least be given a warning of its importance. – fuzzyTew Aug 16 '20 at 15:54
  • it is only important to clear the reflog if you want the old commits to be removed now. which is presumably the case if you want to shallow your git repo. i mentioned it here because people will likely be confused why the disk space is still occupied after shallowing the repo. if you have a link to an article highlighting the importance of the reflog then please post it here. i will include it in my answer. – Lesmana Aug 16 '20 at 17:29
  • Here's a link from google: http://gitready.com/intermediate/2009/02/09/reflog-your-safety-net.html . Usually the reflog expires after 90 days which is pretty long ... but shouldn't use that much space unless you're storing big binary files in your repo or something. – fuzzyTew Aug 30 '20 at 18:22
  • I don't see `--prune=all` in the docs. Should it be `--prune=now` instead? – Paul Aug 12 '23 at 07:35
  • thanks for spotting that. i updated my answer. the git devs changed "all" to "now". see here https://www.spinics.net/lists/git/msg354409.html – Lesmana Aug 12 '23 at 11:31
21
git clone --mirror --depth=5  file://$PWD ../temp
rm -rf .git/objects
mv ../temp/{shallow,objects} .git
rm -rf ../temp

This really isn't cloning "from scratch", as it's purely local work and it creates virtually nothing more than the shallowed-out pack files, probably in the tens of kbytes total. I'd venture you're not going to get more efficient than this, you'll wind up with custom work that uses more space in the form of scripts and test work than this does in the form of a few kb of temporary repo overhead.

jthill
  • 55,082
  • 5
  • 77
  • 137
  • 1
    I like this (and upvoted it), but it does seem a bit "chummy with the implementation", as DMR once said about something else. In particular it assumes quite a bit about the object storage and the `shallow` file, both of which are implementation details. – torek Jul 09 '16 at 03:01
  • 1
    I like or just trust the design boundaries in git and the way they support its "[full access to internals](https://www.kernel.org/pub/software/scm/git/docs/#_description)". If it's that prominent in the main description, I'll rely on it. I'd be leerier about the _contents_ of a second-order feature like shallow, but however git does its bookkeeping for that there's no reason to avoid and still less change keeping it in a file named "shallow". @torek – jthill Jul 09 '16 at 03:52
  • With his method I'm left with a lot of broken refs. It doesn't really break anything, but it seems less than ideal. I "fixed" it for me by doing a bare clone from the remote. – Jochem Fuchs Nov 17 '16 at 11:02
  • Correction, that also failed. I ended up doing a new clean clone anyway. Still I'd like to know how to prevent this. As the suggested method seems less than ideal – Jochem Fuchs Nov 17 '16 at 12:31
  • I neglected to think about the remote-tracking refs, sorry. Adding `--mirror` will take care of that. – jthill Nov 17 '16 at 15:36
  • I could fix the 'does not point to a valid object!' problems by issuing these two commands before running the code in the answer: `git tag -l | xargs git tag -d`, `git branch -rl --format '%(refname)' | sed 's|refs/remotes/||g' | xargs git branch -rd` – vsz Aug 14 '19 at 00:10
4

Edit, Feb 2017: this answer is now outdated / wrong. Git can make a shallow clone shallower, at least internally. Git 2.11 also has --deepen to increase the depth of a clone, and it looks as though there are eventual plans to allow negative values (though right now they are rejected). It's not clear how well this works in the real world, and your best bet is still to clone the clone, as in jthill's answer.


You can only deepen a repository. This is primarily because Git is built around adding new stuff. The way shallow clones work is that your (receiving) Git gets the sender (another Git) to stop sending "new stuff" upon reaching the shallow-clone-depth argument, and coordinates with the sender so as to understand why they have stopped at that point even though more history is obviously required. They then write the IDs of "truncated" commits into a special file, .git/shallow, that both marks the repository as shallow, and notes which commits are truncated.

Note that during this process, your Git is still adding new stuff. (Also, when it has finished cloning and exits, Git forgets what the depth was, and over time it becomes impossible even to figure out what it was. All Git can tell is that this is a shallow clone, because the .git/shallow file containing commit IDs still exists.)

The rest of Git continues to be built around this "add new stuff" concept, so you can deepen the clone, but not increase its shallowness. (There's no good, agreed-upon verb for this: the opposite of deepening a pit is filling it in, but fill has the wrong connotation. Diminish might work; I think I'll use that.)

In theory, git gc, which is the only part of Git that ever actually throws anything out,1 could perhaps diminish a repository, even converting a full clone into a shallow one, but no one has written code to do that. There are some tricky bits, e.g., do you discard tags? Shallow clones start out sans tags for implementation reasons, so converting a repository to shallow, or diminishing an existing shallow repository, might call for discarding at least some tags. Certainly any tag pointing to a commit wiped out by the diminish action would have to go.


Meanwhile, the --depth argument to git-pack-objects (passed through from git repack) means something else entirely: it's the maximum length of a delta chain, when Git uses its modified xdelta compression on Git objects stored in each pack-file. This has nothing to do with the depth of particular parts of the commit DAG (as computed from each branch head).


1Well, git repack winds up throwing things out as a side effect, depending on which flags are used, but it's invoked this way by git gc. This is also true of git prune. For these two commands to really do their job properly, they need git reflog expire run first. The "normal user" end of the clean-things-up sequence is git gc; it deals with all of this. So we can say that git gc is how you discard accumulated "new stuff" that turned out to be unwanted after all.

Community
  • 1
  • 1
torek
  • 448,244
  • 59
  • 642
  • 775
0

OK here's an attempt to bash it, that ignores non-default branches, and also assumed the remote is called 'origin':

#!/bin/sh

set -e

mkdir .git_slimmer

cd $1

changed_lines=$(git status --porcelain | wc -l)
ahead_of_remote=$(git status | grep "Your branch is ahead" | wc -l)
remote_url=$(git remote show origin  | grep Fetch | cut -d' ' -f5)
latest_sha=$(git log | head -n 1 | cut -d' ' -f2)

cd ..

if [ "$changed_lines" -gt "0" ]
then
  echo "Untracked Changes - won't make the clone slimmer in that situation"
  exit 1
fi

if [ "$ahead_of_remote" -gt "0" ]
then
  echo "Local commits not in the remote - won't make the clone slimmer in that situation"
  exit 1
fi

cd .git_slimmer
git clone $remote_url --no-checkout --depth 1 foo
cd foo
latest_sha_for_new=$(git log | head -n 1 | cut -d' ' -f2)
cd ../..

if [ "$latest_sha" == "$latest_sha_for_new" ]
then
  mv "$1/.git" "$1/.gitOLD"
  mv ".git_slimmer/foo/.git" "$1/"
  rm -rf "$1/.gitOLD"
  cd "$1"
  git add .
  cd ..
else
  echo "SHA from head of existing get clone does not match the latest one from the remote: do a git pull first"
  exit 1
fi

rm -rf .git_slimmer

Use: 'git-slimmer.sh <folder_containing_git_repo>'

paul_h
  • 1,859
  • 3
  • 19
  • 27
0

I followed the steps of top answer, the repo size was reduced but not as lower as git clone --depth. I got the idea though. Finally I figured out that it was the remote branch references which stopped git gc do the job, delete the remote branches works like a charm:

git branch -rd $(git branch -r | grep -v 'origin/HEAD')

Friendly note: I open this new answer since I don't have enough reputations to comment below the original answer, anyone is welcome to copy or link this answer as a comment to make the original answer more perfect.

Wooork
  • 1