Reformat entire codebase with git rewrite

Question

We have a fairly large codebase comprising of about 60000 commits. We want to reformat all our .java files while preserving the git history. So, the approach we took is to use git filter-branch --tree-filter to reformat the entire codebase while keeping the history intact. But, there are a few questions that I am unable to find an answer.

When I apply a --tree-filter and pass the command that reformats all the .java files in the root directory, the rewrite happens, but at the very end, I see all the .java files in the staging area. Is a commit needed at every step of the rewrite or does it happen automatically?
git filter-branch seems to take a range of commits and so that made me wonder if it is possible to save the commit ID before every rewrite and resume in case of a failure. Resumption is important as the whole process might take a few days to complete (even on a powerful compute instance).
For the purpose of reformatting the entire codebase, would --index-filter work?

UPDATE: Clarifications

The code base is about 2.2 million lines of Java code. Not doing a git rewrite would cause approximately 10%-12% of the codebase to be attributed to the wrong author. That's about 200K lines of java code which is something we wanted to avoid. Git rewrite makes it look like the person who made a change did it the right way.

It's not clear to me what you mean by "at the end ... all the .java files [are] in the staging area". The filter-branch command ends by, in essence, checking out the filtered result, so of course the staging area is non-empty unless the filtered result is empty. — torek, Sep 29 '14 at 18:45
Out of curiosity, what reformatting tool are you planning to use? Jalopy or something else? — Roberto Tyley, Sep 29 '14 at 21:24
The tool that we settled down with is to invoke Eclipse's code formatter (actually a patched version of it as the latest one has bugs that won't work for us) from [command line](http://blogs.operationaldynamics.com/andrew/software/java-gnome/eclipse-code-format-from-command-line). It's a bit slow, but every other tool that we looked at some problem that makes it infeasible for us. — Karthik, Sep 29 '14 at 21:30
@torek by "at the end.." I meant that after the filter-branch command completes (each of which formats all the .java files in the root directory) and I do a git status, I see all the .java files in the root directory showing up in the git status with a modified status. I expected that, when the git-filter-branch completes, my staging area would be empty as each of the rewrite that happened would have applied the code formatting on all the .java files and **committed** it. This is the reason I asked whether an explicit commit is needed at each stage. Hope I make sense here. — Karthik, Sep 29 '14 at 21:35
@Karthik: at the end, `filter-branch` does a `git read-tree -u -m HEAD` to do a trivial merge into the work directory. That could leave you with modifications. (It skips this step if you run the filter-branch operation on a bare repository.) — torek, Sep 29 '14 at 23:26
@torek I am still missing something. I have a simple repo with three commits. The first commit introduces a new java file and the next two makes trivial mods on the new java file. Now, I invoke filter-branch --tree-filter 'reformat-java-file' which simply changes tabs to spaces. After 3 rewrites, filter-branch completes but leaves the java file in the modified state with the diff indicating tabs are changed to spaces. Why wouldn't the rewrite in each step apply the filter and commit it? If it did that, I'd have a clean working directory at the end. What am I missing? — Karthik, Sep 30 '14 at 16:09
I'd have to see the actual command you're running. After filter-branch has made new commits it "rewrites the positive refs given on the command line", and after *that* it does this `read-tree -u -m HEAD` step as it exits. My off-hand guess is that your reformatter is working *outside* the temporary tree-directory, in your working copy only, so that no changes are being made to any commits so that no references are actually updated, but the work tree remains modified. — torek, Sep 30 '14 at 16:46
@torek that exactly was the problem! I did not realize the temporary directory aspect of it. For some reason, I thought it all happens in place. I should have paid more heed to the man page and specifically your first response here which also specifies that. Now, things are more clear. — Karthik, Sep 30 '14 at 18:08
Rather than rewriting history, could you create a script which runs reformat on each file, then commits just that file under the name of the appropriate author? And for files with multiple authors, where you want to preserve git blame line-based attribution, if your code-formatting tool can be told to just format specific lines of the file, you could run multiple passes on each file, one for each author... — Jesse W at Z - Given up on SE, Oct 09 '14 at 17:35

score 2 · Answer 1 · edited May 23 '17 at 12:25

As the author of the BFG (a faster, simpler alternative to git-filter-branch), I'm disposed to mention it, though it doesn't - out-of-the-box - do Java-source reformatting.

You mention that resume-after-failure for the git-filter-branch operation would be helpful- and that is, of course, because git-filter-branch is so slow. There is no way to resume a git-filter-branch operation - but if it was faster it wound't be such a big issue. The BFG can be many hundred times faster than git-filter-branch, because it only cleans any given version of file once - unlike git-filter-branch, which cleans the same file every time, every commit.

The BFG supports straight text-replacement in files, but as I said, it doesn't do Java-source reformatting. There would be two alternatives for getting that to work:

Invoke the BFG as library, as Christian Hoffmeister recently did - in your case, passing in a custom TreeBlobModifier that invokes Jalopy or some other Java source-code formatter.
Change the BFG so that it supports shelling out to invoke arbitrary bash commands - a bit like git-filter-branch's --tree-filter or --index-filter - but still, I would expect, rather faster.

Option 2 wouldn't be that hard to implement. However, I wonder if you could elaborate on why you want to take this drastic action- rewriting history? Is there really a substantial benefit to having a perfectly formatted history, relative to the hassle of rewriting commits and getting everyone to adapt to the changed history? Why not just do a one-off reformat of your latest commit?

Internally, we've gone back and forth on whether we need a git rewrite or not. We came up with lot of numbers and not doing the rewrite way would attribute 10%-12% of the codebase incorrectly to the wrong author and hence rewrite is our only option. — Karthik, Sep 29 '14 at 21:09
I've heard a lot about BFG and considered it at one point, but the only thing that was stopping me was that it did not handle tree-filter which is what I think we need here. — Karthik, Sep 29 '14 at 21:10
There's no "out of the box" way to resume a filter-branch, but in theory it could be done (copy the filter-branch script, hack on it quite a bit, etc :-) ). It's probably not been done because rewriting history is painful even if it all works perfectly. — torek, Sep 29 '14 at 23:28

score 1 · Accepted Answer · answered Sep 29 '14 at 18:43

Re 1: The --tree-filter does not require a separate commit: it simply dumps the tree corresponding to some commit into a temporary directory, runs your filter, and then takes the resulting directory as the new tree for the new commit. All alterations, including files created or removed, result in a different "new" commit, and as the manual page notes, .gitignore and all other ignore rules are not used (so if you create a .bak file or whatever, and would normally just .gitignore it, you must remove it manually in your tree-filter).

All of this work is done in a sub-directory of git's base "rewrite" temporary directory, which you can set with -d but defaults to .git-rewrite. (The sub-directory for filters—all of them, including the tree filter—is $tempdir/t, but that's not supposed to be relevant.) It is also all done with a special temporary index (staging area) file ($tempdir/index).

Note that the entire temporary directory is removed by the time git filter-branch exits.

Re 2: Yes, it's possible to save the to-be-filtered ID, it's in $GIT_COMMIT (an environment variable) for the duration of all the filter runs. (Since the filters are mostly evaled, you can even modify the environment to pass additional variables or change some; see the filter-branch script).

Re 3: Essentially, the difference between --index-filter and --tree-filter is that --tree-filter extracts the tree into a temporary directory, runs your filter, then rolls up the (potentially modified) tree to make the new tree for the new commit. By contrast, --index-filter loads the tree into the index file; runs your filter, which can modify the index; then uses the resulting index to make the new tree for the new commit.

In other words, the tree filter actually unpacks and repacks the index. This is why the index filter is faster: it skips the unpack/repack step. If you must modify actual files, it's clearly simpler to just unpack all of them, modify all of them, and repack all of them. You could gain some speed if many files won't be modified, by unpacking just the interesting ones, modifying those, and repacking the modified result, but to do that you need a fair bit of gritty low-level git knowledge. (It's easy to git checkout and git add each file as you go, but you must also find which files are to be modified.)

Reformat entire codebase with git rewrite

2 Answers2