I'm working on a large legacy Java code-base that has been migrated from SVN to Git four years ago.
General knowledge about Git's configuration possibilities isn't too widespread in our developer team (I count myself in here, lol), and some things were probably done or thought through merely half-assed.
All developers work on windows machines, the build runs on a unix machine.
Assumption was that we would configure Git in a way, that Java source files will be checked out as windows-1252, but commited as UTF-8.
On repo side we only have a .gitattributes file for auto-conversion from CRLF to LF except for the property-files which are to be read by the unix build machine, which are forced to LF and windows-batch scripts which are forced to CRLF.
When I had to fix a mishap merge of another developer back in April, it seems that somehow all German umlauts and a lot of special characters in the affected Java source files (in hard-coded strings and comments) have been changed to "ue". Multiple months and four releases later, this has now been noticed in the production environment in the GUI by a user (could be worse, I guess).
Right now we're struggling with finding the right settings to have the encoding correct on both sides, so we can be sure it won't break again, and all source files behave the same way on every machine.
Now what I would like to know is:
- Does it even make sense to use windows-1252 locally on our developer machines? Wouldn't it be easier to just use UTF-8 throughout the whole chain?
- Is this something to configure in the .gitattributes file in our repo as well, or is it local Git configuration? Or both? And how?
Edit:
I have now tried fixing it forward by converting all .java source files to UTF-8 by using this approach, and what do I say... This whole codebase is so cursed by the >20 years it's been existing.
The build failed after conversion because there's German umlauts in method and variable names and enum constants, which Javac probably now can't process anymore. I have then started replacing all of them with the corresponding diphthongs "ae", "oe" and "ue" only to notice that due to the renaming of enum constants the semantics of the code change, and a database migration could become necessary.
It could possibly work some other way, but I already had to change so much code without even being halfway through that the risks probably outweigh the benefits, taking into account that we only had one obvious bug popping up in production due to the codepage chaos in the first place.