I'm going to convert a large Mercurial project to Git this weekend using fast-export. I've tested that several times, and the results are good.
We'd also like to turn our source code encoding (lots of German comments/string literals with Umlauts) from ISO-8859-1 to UTF-8 (all other non-java files in the repo should stay as-is), and the Git migration delivers us a chance to do it now since everybody needs to clone again anyway. However, I don't find a good approach for it.
- I've tried the
git filter-tree --tree-filter ...
approach from this comment on SO. However while this seems ideal, due to the size of the repository (about 200000 commits, 18000 code files) it would take much more time than just the weekend I have. I've tried running it (in a heavily optimized version where the list of files is chunked and the sublists are converted in parallel (using GNU parallel)) straight from a 64GB tmpfs volume on a linux VM with 72 cores, and still it would take several days... - Alternatively, I've tried the simple approach where I perform the conversion simply on any active branch individually and commit the changes. However, the result is not satisfying because then I almost always get conflicts when merging or cherry-picking pre-conversion commits.
- Now I'm running approach 1 again but not trying to rewrite the complete history of all branches (
--all
as<rev-list>
) but just all commits reachable from the current active branches' and not reachable by some past commit which is (hopefully) a predecessor of all current branches (branch-a branch-b branch-c --not old-tag-before-branch-a-b-c-forked-off
as<rev-list>
). It's still running but I fear that I can't really trust the results as this seems like a very bad idea. - We could just switch the encoding in the master branch with a normal commit as in approach 2, but again this would make cherry-picking fixes from/to master a disaster. And it would introduce lots of encoding problems because developers would surely forget to change their IDE settings when switching between master and non-converted branches.
So right now, I somehow feel the best solution could be to just stick to ISO-8859-1.
Does anyone have an idea? Someone mentioned that maybe reposurgeon can do basically approach 1 using its transcode
operation with a performance much better than git filter-tree --tree-filter ...
but I have no clue how that works.