Is it possible to remove old commits in Git without losing data?

Question

We are in the process of migrating from Mercurial to Git.

In the process I'd like to do a little bit of housekeeping on some of our older, and larger repositories.

We have one particular project that has almost 5 years of history and commits in it.

I can see no use case that would require us to revert back to a commit 3 years ago.

This particular project also has a commit that occurred 4 years ago in which a developer committed over 200,000 small text files that were used in a series of tests. This amount of files killed the performance of our systems. So a few commits later these files were removed. While this helped with the overall performance of the local systems, all of these files are still contained within the repository history.

My goal of this exercise is to get rid of these files and the overall bloat that it has caused in when cloning this repository.

So what I would like to learn is if there is a way that I can effectively trim old commits from our history in Git, without losing the changes that were made in those previous commits? In other words, resetting what will become the first commit in the repository to be what the working folder was at a particular point in time?

EDIT: Since I am concerned about removing the bloat caused by the addition and later deletion of a large number files, I don't consider this to be a direct duplicate of Remove an old Git commit from a branch without using a reverse patch? -- however the solution might turn out to be the same (I just don't know that at this point)

How does rebase work it you apply a commit that adds 100,000 files, the a commit which removes the 100,000 files. Do these files still stick around in the .git folder, or are they actually removed since the net effect is that they are no longer needed? — Richard West, Sep 22 '15 at 18:58
Removing the commit that added the files means that you undo adding the files, essentially removing them. I believe some sort of reference is kept for around 14 days in case you want to use the reflog to undo the rebase (not 100% about that last part) — Tim, Sep 22 '15 at 19:02

score 4 · Answer 1 · answered Sep 22 '15 at 22:11

Git, unlike many other version-control systems, does not store deltas.¹

This means that each commit is, in an important sense, totally independent of every other commit. This means you are free to "pluck out" any commit(s) without affecting any other commit, as long as you know what you're doing.

There's one important sense in which it's totally dependent on every predecessor commit, but that's in terms of its SHA-1 "true name", not the source tree associated with it. In other words, as long as you know what you're doing, this doesn't affect you.²

As for how to remove particular commit(s), well, you have several options. The answer associated with the question you linked-to uses an interactive rebase. This can work, although it only deals with simpler cases (one branch, one big file or set of files that must be deleted just once, that sort of thing). What you need to know here is that git rebase -i is essentially git cherry-pick on steroids, as it were: it automates a whole series of cherry-pick operations, then does some simple branch-label manipulation.

Another method is to use git filter-branch. This is likely the more-correct method in this case. The thing to know here is that git filter-branch is kind of like git rebase on steroids, as it were: it automates many copy operations (not specifically cherry-picks), then does complex, multiple-label manipulation (branches and, optionally, tags as well).

Let me have a footnote break and then I'll tell you what you need to know about filter-branch.

¹Deltas sneak back in via "pack files", which give git good compression (better than many other VCS-es), but these happen well below the point at which git stores a tree with each commit. As far as the commits go, each commit is simply an object with some metadata and a (single) "tree" object, and the tree contains a complete, independent snapshot of the files that go with that commit. When you git show a commit and see a delta, that's because git has extracted not only that particular commit, but also its parent commit(s), and then—at git show time—used its diff-generator to show you what happened in that commit, with respect to that parent or those parents.

²Of course, this leaves a lot of wiggle room if you aren't quite sure what you're doing. :-) In particular, no matter what you do here, you'll wind up "renumbering" all the commits "downstream" of any commit that gets modified. If someone else already has a copy of these commits (e.g., a clone of your current repo), they will have to take some action to update their copies, so you'll be making a bunch of work for them. If "they" include "you"—i.e., if you have a couple of copies of the original repo—you'll have to do something about that yourself, but that's probably just "throw away those copies and get new copies", which you can do at your own pace. You won't be annoying yourself, or at least, you'll know it when you are. :-)

Back to git filter-branch: what it does is much the same as almost every other git command. It does not—can not—change any existing commit. Instead, it copies commits, by extracting them, then applying some filter(s), then making new commits.

You should think of the git repository as a big pile of "objects", including commit objects, with each commit looking something like this:

tree 55c0d854767f92185f0399ec0b72062374f9ff12
parent 8413a79e67177d026d2d8e1ac66451b80bb25d62
author Junio C Hamano <redacted> 1436563740 -0700
committer Junio C Hamano <redacted> 1436563740 -0700

The last minute bits of fixes

Signed-off-by: Junio C Hamano <redacted>

Each commit can have an arbitrary number of labels (normally branch and tag names) "pointing to" that commit. A label "points to" a commit the same way that a commit "points to" its parent(s) and tree, by listing the SHA-1 "true name" of that object. (The other object types are "tree", "blob", and "annotated tag". All objects are "well inside" the repo, in .git/objects, while the labels are more "around the edge" of the repo, in .git/refs. A few special labels like HEAD are directly in .git/ itself. The exact location doesn't really matter: the key here is that labels point to commits, and get you, or git, started inside the repo. Then commits point to other commits, as needed.)

This is the actual contents of a commit inside the git repo for git (modified to take out email addresses so that spammers don't collect them). The SHA-1 for this commit is determined by its contents—the tree and parent values, the author and committer name and time stamps, and the message. The filter-branch command will, at some point, extract this commit, apply your filter(s), and then make a new commit from the result.

The git filter-branch command provides lots of filters so that you can change any or all part(s) of each commit, with variants that try to be extra-efficient. The slowest part of copying a modified commit is usually extracting all the old files, and then examining the result and making new files, and sometimes you can make a filter that works entirely within the "index", skipping the extract-and-examine steps. The principle is still the same though: check out the old commit in a temp directory; then modify it with filters; then make a new commit from the result.

Each new commit gets a new SHA-1 "true name".

If the new commit is exactly identical to the old commit—bit-for-bit identical—the new SHA-1 is the same as the old SHA-1. For filter-branch's purposes, this doesn't really matter: as it goes along copying commits, it updates a "map" file. The map file keeps pairs of values: old-SHA-1, new-SHA-1. Every time the script goes to copy a commit, it makes sure that the "parent" pointers look up the appropriate mapping, so that the new commits point to the new parents, while the old commits continue pointing to the old parents (as they must).

Eventually—this can take a very long time, which is why there are so many optimization flags—the filter-branch will have applied the filter(s) to all the commit(s) you asked it to look at. At this point, the map file needs to be applied to the labels.

Again, the labels are how you, and git itself, get started. If you're looking for commits on branch master, you start by looking up the label master. That contains the SHA-1 true-name of a commit: and by definition, that commit is the tip of branch master. That commit has some parents, those commits have their own parents, and so on; and git will construct the graph of commits dynamically, by reading these commits as needed.

So, the filter-branch command now simply needs to change all the old labels to point to the new commits, instead of pointing to the old commits.

The labels that git filter-branch rewrites are the ones you've named on its command line. For this sort of thing, you'd name --all which means all branches. In fact, --all means all references, but git filter-branch strips that down to just branches, unless you add --tag-name-filter. (I'm not entirely sure what use-case the git folks had in mind with this; most people just wind up using --tag-name-filter cat to keep the tag names unchanged while updating them to point to the newly copied commits.)

Search StackOverflow for more information on using (and speeding up) git filter-branch. I'm not sure if it's applicable for your particular case (I have never used it myself), but consider also using the "BFG repo cleaner", which is a sped-up stripped-down git filter-branch for the specific case of removing unwanted files. It's a lot less complicated to set up, since it doesn't apply arbitrary filters. It does have all the same caveats, of course, because fundamentally, commits can never be changed, the best you can do is make new copies that are similar-but-different and thus have different SHA-1 "true names".

Thanks for the detailed response torek! I'm going to digest this and give it a try on a copy of the repository. — Richard West, Sep 23 '15 at 13:56

score 0 · Answer 2 · answered Sep 23 '15 at 08:15

0

Remove these changesets from Mercurial side may be somehow easier and safer (you can always start from scratch with original repo, not trimmed clone):

just histedit and remove changeset, which adds files and commit(s), which deal with these files later

answered Sep 23 '15 at 08:15

Lazy Badger

94,711
9
78
110

Is it possible to remove old commits in Git without losing data?

2 Answers2