Switching a Git repository from ISO-8859-1 to UTF-8 encoding for source code files

Question

I'm going to convert a large Mercurial project to Git this weekend using fast-export. I've tested that several times, and the results are good.

We'd also like to turn our source code encoding (lots of German comments/string literals with Umlauts) from ISO-8859-1 to UTF-8 (all other non-java files in the repo should stay as-is), and the Git migration delivers us a chance to do it now since everybody needs to clone again anyway. However, I don't find a good approach for it.

I've tried the git filter-tree --tree-filter ... approach from this comment on SO. However while this seems ideal, due to the size of the repository (about 200000 commits, 18000 code files) it would take much more time than just the weekend I have. I've tried running it (in a heavily optimized version where the list of files is chunked and the sublists are converted in parallel (using GNU parallel)) straight from a 64GB tmpfs volume on a linux VM with 72 cores, and still it would take several days...
Alternatively, I've tried the simple approach where I perform the conversion simply on any active branch individually and commit the changes. However, the result is not satisfying because then I almost always get conflicts when merging or cherry-picking pre-conversion commits.
Now I'm running approach 1 again but not trying to rewrite the complete history of all branches (--all as <rev-list>) but just all commits reachable from the current active branches' and not reachable by some past commit which is (hopefully) a predecessor of all current branches (branch-a branch-b branch-c --not old-tag-before-branch-a-b-c-forked-off as <rev-list>). It's still running but I fear that I can't really trust the results as this seems like a very bad idea.
We could just switch the encoding in the master branch with a normal commit as in approach 2, but again this would make cherry-picking fixes from/to master a disaster. And it would introduce lots of encoding problems because developers would surely forget to change their IDE settings when switching between master and non-converted branches.

So right now, I somehow feel the best solution could be to just stick to ISO-8859-1.

Does anyone have an idea? Someone mentioned that maybe reposurgeon can do basically approach 1 using its transcode operation with a performance much better than git filter-tree --tree-filter ... but I have no clue how that works.

Note: if you have issue in commit messages encoding (in addition of code source file encoding), consider Git 2.23 (Q2 2019): see "[Migrate from CVS to Git without losing history](https://stackoverflow.com/a/56604301/6309)". — VonC, Jun 14 '19 at 19:46

score 2 · Accepted Answer · answered Jun 08 '18 at 14:38

2

A tree filter in git filter-branch is inherently slow. It works by extracting every commit into a full blown tree in a temporary directory, letting you change every file, and then figuring out what you changed and making the new commit from every file you left behind.

If you're exporting and importing through fast-export / fast-import, that would be the time to convert the data: you have the expanded data of the file in memory, but not in file-system form, before writing it to the export/import pipeline. Moreover, git fast-import itself is a shell script so it's trivial to insert filtering there, and hg-fast-export is a Python program so it's trivial to insert filtering there as well. The obvious place would be here: just re-encode d.

answered Jun 08 '18 at 14:38

torek

448,244
59
642
775

Great! At that point I just need to check if it's actually a Java file because only those should be converted. – Tassilo Horn Jun 08 '18 at 14:55
My Python foo is very very low. So do you think this might do the trick (using chardet to detect the current encoding): `d=ctx.filectx(file).data() if (d != None) and (rx.match(file)): # rx is re.compile(".*\\.java"$) enc=chardet.detect(d)['encoding'] if (enc != "ascii") and (enc != 'utf-8'): d=u''.join(d.decode(enc)).encode('utf8')` – Tassilo Horn Jun 08 '18 at 16:13
1

I've never used `chardet`, but it seems like it would do the trick. You could simplify the file-name test, just check for `.endswith('.java')`. The last line would just be `d = d.decode('enc').encode('utf8')`. – torek Jun 08 '18 at 17:30
Awesome! Thanks a lot. I've modified it a bit so that it can also gracefully handles files which have broken encoding and chardet guesses something very unlikely. In those cases I simply convert to UTF-8 with `'replace'` as error handler which should insert the Unicode Replacement Character instead of producing garbage. At least, that's easy to spot and fix afterwards. – Tassilo Horn Jun 08 '18 at 17:41
Oops, I notice I put `enc` in quotes in my comment above (hazard of typing things into comments like this, I can't fix it now :-) ). – torek Jun 08 '18 at 17:48
No problem, I still got it. – Tassilo Horn Jun 08 '18 at 17:51
The conversion is still running but I guess it'll finish within the next 10 hours or so. In the meantime, I've also run it on a copy of the same repository after nuking `.hg/` and initializing it anew, adding a handful of commits on top so that the history is not 200000 revisions but just 5. The results of this test were absolutely fantastic, so I'll mark you solution as accepted answer. Thanks a ton! :+1: (I'll also contribute that back to fast-export, of course. Maybe that could become a standard feature because I guess my task is not that uncommon.) – Tassilo Horn Jun 09 '18 at 08:21
2

Our git and utf-8 migration using this approach was a great success. The results are extremely good. It ran for 56 hours, though, where it used to run for about 3-4 hours without the encoding part. – Tassilo Horn Jun 12 '18 at 20:14
Yikes, that's a lot of hours of transcoding :-) – torek Jun 12 '18 at 20:41

score 2 · Answer 2 · edited Aug 02 '19 at 14:33

You might consider using git filter-branch --index-filter—as opposed to --tree-filter (which is the default). The idea is that with --index-filter, there's no checkout step (i.e. the worktree is not (re-)populated at all on each iteration).

So you might consider writing a filter for git filter-branch --index-filter which would use git ls-files—something like this:

Call git ls-files --cached --stage and iterate over each entry.

Consider only those which have the 100644 file mode—that is, are normal files.

For each entry run something like

sha1=`git show ":0:$filename" \
    | iconv -f iso8859-1 -t utf-8 \
    | git hash-object -t blob -w --stdin`
git update-index --cacheinfo "10644,$sha1,$filename" --info-only

Rinse, repeat.

An alternate approach I fathom would be to attack the problem from a different angle: the format of streams generated by git fast-export and consumed by git fast-import are plain text¹ (just pipe your exporter's output to less or another pager and see for yourself).

You could write a filter using your favourite PL which would parse the stream, re-encode any data chunks. The stream is organized in a way so that no SHA-1 hashes are used so you may re-encode as you go. The only apparent problem I fathom is that the data chunks bear no information about which file they will represent in the resulting commit (if any), so if you have non-text files in your history, you might need to either resort to guessing based on the contents of each data blob or make your processor more complicated by remembering the blobs it has seen and deciding which of them to re-encoded after it saw the commit record which assigns file names to (some of) those blobs.

¹ Documented in git-fast-import(1)—run git help fast-import.

That's also a good idea but as time presses, I use the approach to modify `hg-fast-export.py` to do the conversion there suggested by torek because my python foo is still better than my "shell and rather exotic git commands" foo. ;-) But I'll come back to that next time! WRT the streams idea: my problem is that sadly we don't have all files encoded in ISO-8859-1 but a wild mixture of ISO-8859-1, ISO-8859-15, UTF-8, cp1252, and simply broken. So at least I have to do it on a file-based manner where I can first guess and then convert from the guessed encoding. — Tassilo Horn, Jun 08 '18 at 17:50

euluis · Answer 3 · 2019-08-02T14:16:32.520

I had the exact same problem and the solution is based in @kostix answer of using as the basis the --index-filter option of filter-branch, but, with some additional improvements.

Use git diff --name-only --staged to detect the contents of the staging area
Iterate over this list and filter for:
1. git ls-files $filename, i.e., it isn't a deleted file
2. the result of git show ":0:$filename" | file - --brief --mime-encoding isn't binary, i.e., it is a text file, nor is already UTF-8 encoded
Use the detected mime encoding for each file
Use iconv to convert the the files
Detect the file mode with git ls-files $filename --stage | cut -c 1-6

This is the look of my bash function:

changeencoding() {
    for filename in `git diff --name-only --staged`; do
        # Only if file is present, i.e., filter deletions
        if [ `git ls-files $filename` ]; then
            local encoding=`git show ":0:$filename" | file - --brief --mime-encoding`
            if [ "$encoding" != "binary" -a  "$encoding" != "utf-8" ]; then
                local sha1=`git show ":0:$filename" \
                    | iconv --from-code=$encoding --to-code=utf-8 \
                    | git hash-object -t blob -w --stdin`
                local mode=`git ls-files $filename --stage | cut -c 1-6`
                git update-index --cacheinfo "$mode,$sha1,$filename" --info-only
            fi
        fi
    done
}

Switching a Git repository from ISO-8859-1 to UTF-8 encoding for source code files

3 Answers3