Git, unlike many other version-control systems, does not store deltas.1
This means that each commit is, in an important sense, totally independent of every other commit. This means you are free to "pluck out" any commit(s) without affecting any other commit, as long as you know what you're doing.
There's one important sense in which it's totally dependent on every predecessor commit, but that's in terms of its SHA-1 "true name", not the source tree associated with it. In other words, as long as you know what you're doing, this doesn't affect you.2
As for how to remove particular commit(s), well, you have several options. The answer associated with the question you linked-to uses an interactive rebase. This can work, although it only deals with simpler cases (one branch, one big file or set of files that must be deleted just once, that sort of thing). What you need to know here is that git rebase -i
is essentially git cherry-pick
on steroids, as it were: it automates a whole series of cherry-pick operations, then does some simple branch-label manipulation.
Another method is to use git filter-branch
. This is likely the more-correct method in this case. The thing to know here is that git filter-branch
is kind of like git rebase
on steroids, as it were: it automates many copy operations (not specifically cherry-picks), then does complex, multiple-label manipulation (branches and, optionally, tags as well).
Let me have a footnote break and then I'll tell you what you need to know about filter-branch.
1Deltas sneak back in via "pack files", which give git good compression (better than many other VCS-es), but these happen well below the point at which git stores a tree with each commit. As far as the commits go, each commit is simply an object with some metadata and a (single) "tree" object, and the tree contains a complete, independent snapshot of the files that go with that commit. When you git show
a commit and see a delta, that's because git has extracted not only that particular commit, but also its parent commit(s), and then—at git show
time—used its diff-generator to show you what happened in that commit, with respect to that parent or those parents.
2Of course, this leaves a lot of wiggle room if you aren't quite sure what you're doing. :-) In particular, no matter what you do here, you'll wind up "renumbering" all the commits "downstream" of any commit that gets modified. If someone else already has a copy of these commits (e.g., a clone of your current repo), they will have to take some action to update their copies, so you'll be making a bunch of work for them. If "they" include "you"—i.e., if you have a couple of copies of the original repo—you'll have to do something about that yourself, but that's probably just "throw away those copies and get new copies", which you can do at your own pace. You won't be annoying yourself, or at least, you'll know it when you are. :-)
Back to git filter-branch
: what it does is much the same as almost every other git command. It does not—can not—change any existing commit. Instead, it copies commits, by extracting them, then applying some filter(s), then making new commits.
You should think of the git repository as a big pile of "objects", including commit objects, with each commit looking something like this:
tree 55c0d854767f92185f0399ec0b72062374f9ff12
parent 8413a79e67177d026d2d8e1ac66451b80bb25d62
author Junio C Hamano <redacted> 1436563740 -0700
committer Junio C Hamano <redacted> 1436563740 -0700
The last minute bits of fixes
Signed-off-by: Junio C Hamano <redacted>
Each commit can have an arbitrary number of labels (normally branch and tag names) "pointing to" that commit. A label "points to" a commit the same way that a commit "points to" its parent(s) and tree, by listing the SHA-1 "true name" of that object. (The other object types are "tree", "blob", and "annotated tag". All objects are "well inside" the repo, in .git/objects, while the labels are more "around the edge" of the repo, in .git/refs. A few special labels like HEAD
are directly in .git/
itself. The exact location doesn't really matter: the key here is that labels point to commits, and get you, or git, started inside the repo. Then commits point to other commits, as needed.)
This is the actual contents of a commit inside the git repo for git (modified to take out email addresses so that spammers don't collect them). The SHA-1 for this commit is determined by its contents—the tree
and parent
values, the author
and committer
name and time stamps, and the message. The filter-branch
command will, at some point, extract this commit, apply your filter(s), and then make a new commit from the result.
The git filter-branch
command provides lots of filters so that you can change any or all part(s) of each commit, with variants that try to be extra-efficient. The slowest part of copying a modified commit is usually extracting all the old files, and then examining the result and making new files, and sometimes you can make a filter that works entirely within the "index", skipping the extract-and-examine steps. The principle is still the same though: check out the old commit in a temp directory; then modify it with filters; then make a new commit from the result.
Each new commit gets a new SHA-1 "true name".
If the new commit is exactly identical to the old commit—bit-for-bit identical—the new SHA-1 is the same as the old SHA-1. For filter-branch's purposes, this doesn't really matter: as it goes along copying commits, it updates a "map" file. The map file keeps pairs of values: old-SHA-1, new-SHA-1. Every time the script goes to copy a commit, it makes sure that the "parent" pointers look up the appropriate mapping, so that the new commits point to the new parents, while the old commits continue pointing to the old parents (as they must).
Eventually—this can take a very long time, which is why there are so many optimization flags—the filter-branch
will have applied the filter(s) to all the commit(s) you asked it to look at. At this point, the map file needs to be applied to the labels.
Again, the labels are how you, and git itself, get started. If you're looking for commits on branch master
, you start by looking up the label master
. That contains the SHA-1 true-name of a commit: and by definition, that commit is the tip of branch master
. That commit has some parents, those commits have their own parents, and so on; and git will construct the graph of commits dynamically, by reading these commits as needed.
So, the filter-branch command now simply needs to change all the old labels to point to the new commits, instead of pointing to the old commits.
The labels that git filter-branch
rewrites are the ones you've named on its command line. For this sort of thing, you'd name --all
which means all branches. In fact, --all
means all references, but git filter-branch
strips that down to just branches, unless you add --tag-name-filter
. (I'm not entirely sure what use-case the git folks had in mind with this; most people just wind up using --tag-name-filter cat
to keep the tag names unchanged while updating them to point to the newly copied commits.)
Search StackOverflow for more information on using (and speeding up) git filter-branch
. I'm not sure if it's applicable for your particular case (I have never used it myself), but consider also using the "BFG repo cleaner", which is a sped-up stripped-down git filter-branch
for the specific case of removing unwanted files. It's a lot less complicated to set up, since it doesn't apply arbitrary filters. It does have all the same caveats, of course, because fundamentally, commits can never be changed, the best you can do is make new copies that are similar-but-different and thus have different SHA-1 "true names".