0

I have a repository where several files have been checked in from Windows, and have unicode characters in the FILENAMES. For example AgêBean.java, GûBean.java, LêgbaBean.java, and XêviosoBean.java. When these files are checked out on a CentOS 7 system, the bytes comprising filenames are interpreted as ISO-8859-1. This breaks stuff like the java compiler. For example, Java won’t compile the above files, because the unicode identifiers for the class, i.e. “AgêBean”, does not match the ISO-8859-1 filename, which the compiler sees as “AgêBean.java” The short, ugly workaround is to rename the files, but if they are checked in, then the same problems appear on the Windows side.

So what are some better solutions? I can imagine a few, but I don’t know how to do any of them, and google is not yet being helpful:

A) Re-configuring my CentOS filesystem so that all filenames are UTF-8 (or UTF-16) encoded.

B) Configuring git on Linux to understand that the filenames in the repository are encoded UTF-8, but the local system is ISO-8859-1, so all filenames need to be converted when checked in or out.

C) Configuring java (and terminals, and editors) on Linux to understand that the filenames under this directory are UTF-8 encoded, so each is decoded correctly.

I’d be happiest with solution “A”, but so far I have not found how to do that. I hope it’s not compiled-into the Cent0S 7 (or RHEL 8) kernel.

Charlweed
  • 1,517
  • 2
  • 14
  • 22
  • BTW, the command `locale charmap` on my linux system returns "ISO-8859-1", I'm looking for a way to change that with settings under /etc, but I'm not confident that's a fruitful approach. – Charlweed Apr 09 '21 at 22:00
  • `to your conclusions as well` what filesystem is that? Why does "filesystem" care about file names encoding? Filesystems care only about zero terminated strings, kernel does not care about unicode. I would blame: `When these files are checked out` I would blame the "checking out" part. How do you check out? What about something like `export LC_ALL=C` and then checking out? `I'm looking for a way to change that with settings under /etc,` see `/etc/locale.conf`. – KamilCuk Apr 09 '21 at 22:19
  • See [Should you use international identifiers in Java/C#?](https://stackoverflow.com/q/61615/1256452) (the answer is "no", for the reasons you are discovering). – torek Apr 10 '21 at 01:15
  • Meanwhile: Linux already assumes and uses UTF-8 (at the system call interface). It doesn't try to encode and decode the byte strings though, except when it really has to, e.g., when dealing with an NTFS volume. The issues you're seeing are because the *data inside files* has encodings as well; Git never touches those by default, although Git now has a `working-tree-encoding` attribute (q.v.). – torek Apr 10 '21 at 01:17
  • I edited my bash profile, and so far, adding the exports `export LANG=en_US.UTF-8` and `export LANGUAGE="en_US.UTF-8"` seem to be working. The compilers no longer complain, but I have not tried re-checking out the names with UTF-8 encoding. I suspect (but have not proven) that when LANG=en_US.UTF-8, most apps decode ISO-8858-1 correctly. – Charlweed Apr 10 '21 at 23:31

0 Answers0