0

I'm working on a large legacy Java code-base that has been migrated from SVN to Git four years ago.

General knowledge about Git's configuration possibilities isn't too widespread in our developer team (I count myself in here, lol), and some things were probably done or thought through merely half-assed.

All developers work on windows machines, the build runs on a unix machine.

Assumption was that we would configure Git in a way, that Java source files will be checked out as windows-1252, but commited as UTF-8.

On repo side we only have a .gitattributes file for auto-conversion from CRLF to LF except for the property-files which are to be read by the unix build machine, which are forced to LF and windows-batch scripts which are forced to CRLF.

When I had to fix a mishap merge of another developer back in April, it seems that somehow all German umlauts and a lot of special characters in the affected Java source files (in hard-coded strings and comments) have been changed to "ue". Multiple months and four releases later, this has now been noticed in the production environment in the GUI by a user (could be worse, I guess).

Right now we're struggling with finding the right settings to have the encoding correct on both sides, so we can be sure it won't break again, and all source files behave the same way on every machine.

Now what I would like to know is:

  • Does it even make sense to use windows-1252 locally on our developer machines? Wouldn't it be easier to just use UTF-8 throughout the whole chain?
  • Is this something to configure in the .gitattributes file in our repo as well, or is it local Git configuration? Or both? And how?

Edit:

I have now tried fixing it forward by converting all .java source files to UTF-8 by using this approach, and what do I say... This whole codebase is so cursed by the >20 years it's been existing.

The build failed after conversion because there's German umlauts in method and variable names and enum constants, which Javac probably now can't process anymore. I have then started replacing all of them with the corresponding diphthongs "ae", "oe" and "ue" only to notice that due to the renaming of enum constants the semantics of the code change, and a database migration could become necessary.

It could possibly work some other way, but I already had to change so much code without even being halfway through that the risks probably outweigh the benefits, taking into account that we only had one obvious bug popping up in production due to the codepage chaos in the first place.

  • Without an assessment of the extent of the problem, it's not possible to articulate a recommendation. Are the problematic strings in localization resources, or in source files? If the latter, maybe the problematic strings should be refactored to use pure-ASCII escape codes, and remove the problem. If the former, perhaps it would make more sense to recode the entire code base to UTF-8 in one big commit. Either way, the developers should probably be educated to understand the difference. – tripleee Sep 20 '22 at 10:18
  • The choice of Windows as your developer platform is an unfortunate complication, especially if you need to process source files on the command line. A modern IDE can probably cope with UTF-8 source files just fine; but the native CMD prompt has limited support for this. – tripleee Sep 20 '22 at 10:28
  • Problem only affects hard-coded strings and comments in Java source files. I guess I made a mistake when I tried to fix the mishap merge, but I can't 100% recall what I did to "fix" it. I think I copied the commit-state before the mishap merge over the false state after that merge, except for the files that were actually intended to be merged by my colleague. And +1 about the choice of our developer platform; that's why I asked if we weren't better off using UTF-8 for source files in general, because all our devs use IntelliJ Idea which should handle it just fine. – coding_with_cats Sep 20 '22 at 10:31
  • Git does not exert any control over the character set, but changing all the files with `iconv` and committing them will easily take you to UTF-8 or whichever character set makes sense going forward. Java allegedly has issues with UTF-8, too; I'm not familiar enough with the platform to say whether that's going to be a problem. – tripleee Sep 20 '22 at 10:36
  • Thanks for the recommendation, I might try setting up a branch with all files converted to UTF-8 and see if the build works that way. – coding_with_cats Sep 20 '22 at 10:41
  • [1] Re _"Does it even make sense to use windows-1252 locally..."_, Windows has an optional setting that may be relevant to your problem: **Control Panel** > **Region** > select **Administrative** tab > Click **Change system locale...** > the **Region Settings** window has a checkbox: _Beta: Use Unicode UTF-8 for worldwide language support._ Is that option enabled? [2] Also see [What does "Beta: Use Unicode UTF-8 for worldwide language support" actually do?](https://stackoverflow.com/q/56419639/2985643) [3] What is the Windows system locale for your team members? – skomisa Sep 20 '22 at 17:44

0 Answers0