New repo with copied history of only currently tracked files

Question

Our current repo has tens of thousands of commits and a fresh clone transfers nearly a gig of data (there are lots of jar files that have since been deleted in the history). We'd like to cut this size down by making a new repo that keeps the full history for just the files that are currently active in the repo, or possibly just modify the current repo to clear the deleted file history. But I'm not sure how to do this in a practical manor.

I've tried the script in Remove deleted files from git history:

for del in `cat deleted.txt`
do
    git filter-branch --index-filter "git rm --cached --ignore-unmatch $del" --prune-empty -- --all
    # The following seems to be necessary every time
    # because otherwise git won't overwrite refs/original
    git reset --hard
    git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
    git reflog expire --expire=now --all
    git gc --aggressive --prune=now
done;

But given that we have tens of thousands of deleted files in the history and tens of thousands of commits, running the script would take an eternity. I started running this for just ONE deleted file 2 hours ago and the filter-branch command is still running, it's going through each of the 40,000+ commits one at a time, and this is on a new Macbook pro with an SSD drive.

I've also read the page https://help.github.com/articles/remove-sensitive-data but this only works for removing single files.

Has anyone been able to do this? I really want to preserve history of currently tracked files, I'm not sure if the space savings benefit would be worth creating a new repo if we can't keep the history.

You might be able to do something with `git filter-branch --prune-empty --tree-filter` using a script that compares every file in the tree against the list of files you want to keep (i.e. the currently tracked ones) and does a `git rm -f` on any files that you don't want. That will remove unwanted files at each commit in the history. — Jonathan Wakely, Jul 27 '13 at 19:45
@Brent please add to your question ***the exact script*** that you mentioned you tried. The `--index-filter` option to `git filter-branch` is supposed to run fast, so I'm surprised that you find it to be too slow. — , Jul 27 '13 at 21:26
If you have 10s of 1000s of deleted files - the script you're using will run git filter-branch 10s of 1000s of times. If you also have 10s of 1000s of commits - that means you're currently trying to (re)process many-millions of commits. — AD7six, Jul 27 '13 at 21:37
@Cupcake Yes that is the script I'm running, I've updated my question to include that. The filter-branch command is still running for my first deleted file, and it's been more than 2 hours since I started it. I'm on a new Macbook pro with SSD. Given that this command goes through each commit in the repo one by one I don't know how it could be expected to run fast. — Brent Sowers, Jul 27 '13 at 21:38
@BrentSowers another question, JAR files are binary, right? Do you know that Git is ill-suited for versioning binary files, because it has to keep each version in the repo every time it changes? Is it actually necessary to version these JAR files in Git? Are these external libraries? — , Jul 27 '13 at 21:45
@BrentSowers as [AD7six points out](http://stackoverflow.com/questions/17901588/new-repo-with-copied-history-of-only-currently-tracked-files?noredirect=1#comment26150310_17901588), you're running `filter-branch` multiple times in a bash script. That's probably why it's taking so long. It will probably run faster if you execute it once, and pass in a command that will have it remove the JAR files you don't want in one go. You might even have better luck with the `--tree-filter` option, compared to what you're currently doing. — , Jul 27 '13 at 21:52
You can find out more about the `--tree-filter` and `--index-filter` options of `filter-branch` at the [official Linux Kernel Git documentation](https://www.kernel.org/pub/software/scm/git/docs/git-filter-branch.html). — , Jul 27 '13 at 21:54
@Cupcake Yes jar files are binary. The jar files are libraries that our code uses, not our compiled output, so the jar files themselves never change, they are added once and potentially removed later when upgraded to a new version which has a new file name and hence is tracked separately. We have removed most of the jar files from our repo now, which is what prompted me to look for ways to purge the history. — Brent Sowers, Jul 28 '13 at 00:28
@BrentSowers newer .NET projects avoid this sort of problem by using NuGet package manager, which only versions a text configuration file specifying what libraries are required, and downloads the necessary libraries for the project if they are missing. The binaries themselves are never added to Git though, only the text config file is, so if there are any library upgrades, the config file is the only thing that changes in Git. Maybe there is something similar you could find for Java. — , Jul 28 '13 at 00:32
@Cupcake Yeah we're now using SBT to manage this in our project so the vast majority of the jars are downloaded from central repos. This has its own set of issues, but I don't want to stray too far from the topic at hand, we are stuck with the large repo because of not using SBT in the past — Brent Sowers, Jul 28 '13 at 00:39
Nowadays, I'd recommend considering changing the accepted answer to [this one](https://stackoverflow.com/a/61107746/184546) that uses git-filter-repo. — TTT, Dec 13 '21 at 20:25

AD7six · Accepted Answer · 2021-09-22T07:48:57.537

48

Delete everything and restore what you want

Rather than delete this-list-of-files one at a time, do the almost-opposite: delete everything and just restore the files you want to keep.

Like so:

# for unix

$ git checkout master
$ git ls-files > keep-these.txt
$ git filter-branch --force --index-filter \
  "git rm  --ignore-unmatch --cached -qr . ; \
  cat $PWD/keep-these.txt | tr '\n' '\0' | xargs -d '\0' git reset -q \$GIT_COMMIT --" \
  --prune-empty --tag-name-filter cat -- --all

# for macOS

$ git checkout master
$ git ls-files > keep-these.txt
$ git filter-branch --force --index-filter \
  "git rm  --ignore-unmatch --cached -qr . ; \
  cat $PWD/keep-these.txt | tr '\n' '\0' | xargs -0 git reset -q \$GIT_COMMIT --" \
  --prune-empty --tag-name-filter cat -- --all

It may be faster to execute.

Cleanup steps

Once the whole process has finished, then cleanup:

$ rm -rf .git/refs/original/
$ git reflog expire --expire=now --all
$ git gc --prune=now

# optional extra gc. Slow and may not further-reduce the repo size
$ git gc --aggressive --prune=now

Comparing the repository size before and after, should indicate a sizable reduction, and of course only commits that touch the kept files, plus merge commits - even if empty (because that's how --prune-empty works), will be in the history.

$GIT_COMMIT?

The use of $GIT_COMMIT seems to have caused some confusion, from the git filter-branch documentation (emphasis added):

The argument is always evaluated in the shell context using the eval command (with the notable exception of the commit filter, for technical reasons). Prior to that, the $GIT_COMMIT environment variable will be set to contain the id of the commit being rewritten.

That means git filter-branch will provide the variable at run time, it's not provided by you before hand. This can be demonstrated if there's any doubt using this no-op filter branch command:

$ git filter-branch --index-filter "echo current commit is \$GIT_COMMIT"
Rewrite d832800a85be9ef4ee6fda2fe4b3b6715c8bb860 (1/xxxxx)current commit is d832800a85be9ef4ee6fda2fe4b3b6715c8bb860
Rewrite cd86555549ac17aeaa28abecaf450b49ce5ae663 (2/xxxxx)current commit is cd86555549ac17aeaa28abecaf450b49ce5ae663
...

edited Sep 22 '21 at 07:48

answered Jul 28 '13 at 14:59

AD7six

63,116
12
91
123

1

You might want to use `xargs` instead of looping over the lines one by one. It will try to fit as many arguments as possible in each run. – Hasturkun Jul 28 '13 at 15:26
@AD7six I'm running it now with xargs: "git filter-branch --force --index-filter "cat /path/to/keep-these.txt | xargs git rm --ignore-unmatch --cached -qr . && git reset -q $GIT_COMMIT --" --prune-empty --tag-name-filter cat -- --all" Wouldn't the git reset -q command reset the file to the current state for each commit, instead of the state that it was in at each commit, therefore losing my actual change history? – Brent Sowers Jul 28 '13 at 18:48
So what should GIT_COMMIT be? The most recent commit? – Brent Sowers Jul 28 '13 at 19:02
@BrentSowers I've simplified the answer to try and avoid confusion. The command you're running (according to the above comment) isn't part of any answer I've provided, and will .. keep only the first commit of one file - the last file listed in `keep-these.txt`. – AD7six Jul 28 '13 at 22:02
@AD7six Thanks, I got an error that the argument list was too long so I threw xargs in there, but yeah I definitely did it wrong. I am running your command now, it might take a day or two to run through all commits but that is do-able. I'll let you how it looks when it finishes. – Brent Sowers Jul 29 '13 at 00:24
1

@AD7six It worked! Thanks for all of the help. It took about 10 hours to run. For those who are interested, to pull new commits from origin (or any remote) without growing the repo in size, you'll have to cherry pick them, run "git fetch origin master", then "git cherry-pick commitid" for each new commit, then run the git gc commands listed above again – Brent Sowers Jul 29 '13 at 14:33
If have tried to run the filter-branch command without luck until I made this (note xargs -0) git filter-branch --force --index-filter "git rm --ignore-unmatch --cached -qr . ; cat $PWD/keep-these.txt | xargs -0 git reset -q \$GIT_COMMIT --" --prune-empty --tag-name-filter cat -- --all – Alex R. R. Dec 19 '13 at 16:30
1

And what about the history of one file that may have been renamed ? since it's name was different before I suppose it will be removed. Or is there a way to keep it ? – SeB.Fr Mar 10 '14 at 15:47
I don't see any complexity - keep the existing file and rename it as a separate commit. lf that's not what you want to hear: ask another question :). – AD7six Mar 10 '14 at 17:14
@AD7six I've asked another question for it: http://stackoverflow.com/questions/33865637/clean-git-history-of-deleted-files-keeping-renamed-files-history – Cœur Nov 23 '15 at 07:25
1

@SeB.Fr the answer to keep history of renamed files is to add `git ls-files | while read -r line; do (git log --follow --raw --diff-filter=R --pretty=format:%H "$line" | while true; do if ! read hash; then break; fi; IFS=$'\t' read mode_etc oldname newname; read blankline; echo $oldname; done); done >> keep-these.txt` between second and third command – Cœur Nov 23 '15 at 13:54
As it is, it deletes files with spaces in them. You have to quote the list of files to avoid this. – qwazix Jan 26 '16 at 14:40
Even though the commands can handle spaces now, I still recommend anyone check their results before the clean up steps. You can easily restore things using `git reflog` before that. – keithyip Jun 27 '18 at 03:28

score 20 · Answer 2 · edited May 23 '17 at 12:10

Base on AD7six, with renamed files history preserved. (you can skip the preliminary optional section)

Optional

remove all remotes:

git remote | while read -r line; do (git remote rm "$line"); done

remove all tags:

git tag | xargs git tag -d

remove all other branches:

git branch | grep -v \* | xargs git branch -D

remove all stashes:

git stash clear

remove all submodules configuration and cache:

git config --local -l | grep submodule | sed -e 's/^\(submodule\.[^.]*\)\(.*\)/\1/g' | while read -r line; do (git config --local --remove-section "$line"); done
rm -rf .git/modules/

Pruning untracked files history, keeping tracked files history & renames

git ls-files | sed -e 's/^/"/g' -e 's/$/"/g' > keep-these.txt
git ls-files | while read -r line; do (git log --follow --raw --diff-filter=R --pretty=format:%H "$line" | while true; do if ! read hash; then break; fi; IFS=$'\t' read mode_etc oldname newname; read blankline; echo $oldname; done); done | sed -e 's/^/"/g' -e 's/$/"/g' >> keep-these.txt
git filter-branch --force --index-filter "git rm --ignore-unmatch --cached -qr .; cat \"$PWD/keep-these.txt\" | xargs git reset -q \$GIT_COMMIT --" --prune-empty --tag-name-filter cat -- --all
rm keep-these.txt
rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now

First two commands are to list tracked files and tracked files old names, using quotes to preserve paths with spaces.
Third command is to rewrite the commits for those files only.
Subsequent commands are to clean the history.

Optional (not recommended)

repack (from the-woes-of-git-gc-aggressive):

git repack -a -d --depth=250 --window=250

Is there a reason why you are not using --aggressive with git gc? — Jazaret, Apr 29 '19 at 13:06
@Jazaret it was too long time ago for me to remember. But if you follow the link at the end of the post (the-woes-of-git-gc-aggressive), there seems to be a reasoning against using `--aggressive`. Maybe 3.5 years ago I got influenced by it. — Cœur, Apr 29 '19 at 13:47
Doesn't matter but: the `g` flag on those `sed` commands is harmless but not necessary (it means "global" i.e. replace *all* on the line, which makes no difference when you're replacing `^` or `$` since there will only ever be one of these on each line anyway) — Silas S. Brown, Feb 25 '20 at 11:30

score 13 · Answer 3 · answered Apr 08 '20 at 18:39

As of April 2020, git produces the following warning when using git filter-branch:

WARNING: git-filter-branch has a glut of gotchas generating mangled history
         rewrites.  Hit Ctrl-C before proceeding to abort, then use an
         alternative filtering tool such as 'git filter-repo'
         (https://github.com/newren/git-filter-repo/) instead.  See the
         filter-branch manual page for more details; to squelch this warning,
         set FILTER_BRANCH_SQUELCH_WARNING=1.

I'm sure there's a safe way to use git filter-branch, but for those (like myself) unaware of how to avoid the gotchas mentioned above, git-filter-repo makes it pretty easy to retain the history of only currently tracked files:

$ git checkout master
$ git ls-files > /tmp/keep-these.txt
$ git filter-repo --paths-from-file /tmp/keep-these.txt

While git filter-branch took about 5 minutes to run on my repo, git filter-repo ran and repacked the repo in a little under a second!

It can be installed by following the instructions on its GitHub page. Alternatively, on a Mac you can just run brew install git-filter-repo.

Nice. Nowadays this should probably be the accepted answer. – TTT Dec 13 '21 at 20:24 — TTT, Dec 13 '21 at 20:24

AD7six · Answer 4 · 2013-07-28T21:59:22.133

6

Run git filter branch only once

The script in the question is going to be processing thousands of commits, thousands of times - and it's doing various (very slow) things once per iteration that ordinarily you'll only do at the end. That really is going to take forever.

Instead run the script once, removing all files in one go:

del=`cat deleted.txt`
git filter-branch --force --index-filter \
  "git rm --cached --ignore-unmatch $del" \
  --prune-empty --tag-name-filter cat -- --all

Once the process has finished then cleanup:

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now

# optional extra gc. Slow and may not further-reduce the repo size
git gc --aggressive --prune=now

If the above fails due to the number of files

If there are enough files in deleted.txt such that the above command is too large to run, it can be rewritten as something like so:

git filter-branch --force --index-filter \
  'cat /abs/path/to/deleted.txt | xargs git rm --cached --ignore-unmatch' \
  --prune-empty --tag-name-filter cat -- --all

(cleanup steps are the same)

This is identical to the version above - but the command to delete the files does so one at a time instead of all at once.

edited Jul 28 '13 at 21:59

answered Jul 27 '13 at 21:59

AD7six

63,116
12
91
123

That second part under the heading "No worky", you're not passing a Git command to `--index-filter`, will that actually work? You don't need to use `--tree-filter` instead? – Jul 27 '13 at 22:05
it expects a command - it can be any command. If tree-filter is appropriate I can't say - it's only relevant for cutting a dir-slice out of a repository, I don't think the OP is doing that. – AD7six Jul 27 '13 at 22:06
Oh, okay, I guess as long as you don't actually need to do any shell commands on files in the working copy, since there won't be one to work on with `--index-filter`. – Jul 27 '13 at 22:07
Note that all I've done is put the standard command in a loop in a function. However: while the idea is I'm sure fine, I just tested and it failed - will likely need to edit it. – AD7six Jul 27 '13 at 22:08
You've got the right idea though, it's so close, just need to work out a few minor kinks! – Jul 27 '13 at 22:09
it runs such that the function I declared wasn't in scope - just wrote it inline and it works. – AD7six Jul 27 '13 at 22:16
Is the `git gc --aggressive` actually unnecessary though, or even a good idea? If you want to get rid of dangling commits, just `git gc --prune=now` will probably be sufficient. I've [read bad things (1)](http://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/) [about using `git gc --aggressive` (2)](http://gcc.gnu.org/ml/gcc/2007-12/msg00165.html). – Jul 27 '13 at 22:27
I'm not that knowledgeable about what it's doing it's just [a standard practice](https://help.github.com/articles/remove-sensitive-data) when deleting something sensitive or sizable. I don't know what has changed in git in the past 5 years - but the command still exists. I'll add a comment to it. – AD7six Jul 27 '13 at 22:36
@AD7six Thanks for the response. I tried the second filter-branch command listed (the first failed because the argument list was too long), and it's been sitting at "Rewrite firstcommithashID (1/58968)" for several minutes now. We have an enormous number of deleted files in the history, I wasn't exaggerating when I said tens of thousands. Perhaps I will try only putting jar files in deleted.txt since they are the largest deleted files, although there are probably still hundreds of those. – Brent Sowers Jul 28 '13 at 00:32
1

@BrentSowers the command [`git ls-files`](https://www.kernel.org/pub/software/scm/git/docs/git-ls-files.html) can be used to get a list of files from the index, instead of having to iterate over an external file. Maybe it will be faster if you passed something like this to `--index-filter`: `git ls-files | grep .jar | xargs git rm --cached`. That command will delete ***all*** JAR files from your history though...maybe you can commit the ones you still want back in afterwards. – Jul 28 '13 at 03:19

score 0 · Answer 5 · answered Apr 02 '20 at 16:04

Adding to the accepted answer by AD7six (since I do not have enough reputation to comment the answer):

If you want to keep more than just master you can

remove tags and branches you do not need anymore
then create a list of files referenced in all those branches and tags you want to keep:

for tag in `git for-each-ref refs/tags --format='%(refname)' | cut -d / -f 3`
do
    echo $tag; sleep 3 # sleep to avoid: fatal: Unable to create '.git/index.lock': File exists.
    git checkout "$tag"
    git ls-files > ../keep_files_tag_$tag.txt
    git ls-files >> ../keep_files_all.txt
done
for branch in `git for-each-ref refs/heads --format='%(refname)' | cut -d / -f 3`
do
    echo $branch; sleep 3 # sleep to avoid: fatal: Unable to create '.git/index.lock': File exists.
    git checkout "$branch"
    git ls-files > ../keep_files_branch_$branch.txt
    git ls-files >> ../keep_files_all.txt
done
sort ../keep_files_all.txt | uniq > keep_files_uniqe.txt