39

I have a JavaEE project, in which I use message properties files. The encoding of those file is set to UTF-8. In the file I use the german umlauts like ä, ö, ü. The problem is, sometimes those characters are replaced with unicode like \uFFFD\uFFFD, but not for every character. Now, I have a case where ä and ü are both replaced with \uFFFD\uFFFD, but not for every occurring of ä and ü.

The Git diff shows me something like this:

 mail.adresses=E-Mail hinzufügen:
-mail.adresses.multiple=E-Mails durch Kommata getrennt hinzufügen.
+mail.adresses.multiple=E-Mails durch Kommata getrennt hinzuf\uFFFD\uFFFDgen.
 mail.title=Einladungs-E-Mail
 box.preview=Vorschau
 box.share.text=Sie können jetzt die ausgewählten Bilder mit Ihren Freunden teilen.
@@ -6880,7 +6880,7 @@ browser.cancel=Abbrechen
 browser.selectImage=übernehmen
 browser.starImage=merken
 browser.removeImage=Löschen
-browser.searchForSimilarImages=ähnliche
+browser.searchForSimilarImages=\uFFFD\uFFFDhnliche
 browser.clear_drop_box=löschen

Also, there are lines changed, which I have not touched. I don't understand why I get such a behavior. What could be the cause for the above problem?

My system:

  • Antergos / Arch Linux

    • System encoding UTF-8

      Python 3.5.0 (default, Sep 20 2015, 11:28:25) 
      [GCC 5.2.0] on linux
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import sys
      >>> sys.getdefaultencoding()
      'utf-8'
      
  • Eclipse Mars 1

    • Text file encoding UTF-8 ext file encoding
    • Properties file encoding UTF-8 Properties file encoding
  • Tomcat 8
  • Java JDK 8

If I use another Editor like Atom to edit those message properties files, I don't ran into this problem.

I also realized in a case, if I copy the original value browser.searchForSimilarImages=ähnliche from Git diff and replace the wrong value browser.searchForSimilarImages=\uFFFD\uFFFDhnliche in Eclipse with that, then I have the correct umlauts in the message properties file.

BuZZ-dEE
  • 6,075
  • 12
  • 66
  • 96
  • some of the Unicode letters in esponal carries one additional padded character, I would recommend you to use special tools to convert all the letters to escaped string before paste inside the properties file. Otherwise use Java Code **new String(value.getBytes("ISO-8859-1"), "UTF-8");** where value is the properties value – Dickens A S Jun 30 '15 at 16:55
  • What special tool do you mean? How should I do `new String(value.getBytes("ISO-8859-1"), "UTF-8");` to have it correct in the properties file? – BuZZ-dEE Jun 30 '15 at 17:02
  • Because of the ISO-8859-1 problem I would recommend not use the default properties loader provided by Java. Replace the loading process so that it directly loads everything from UTF-8 files instead: http://stackoverflow.com/questions/4659929/how-to-use-utf-8-in-resource-properties-with-resourcebundle – Robert Jun 30 '15 at 17:13
  • My colleagues do not have this problem. I wonder why and what the cause is it. – BuZZ-dEE Jun 30 '15 at 17:25
  • properties files are defined to use ISO-8859-1 encoding. they shouldn't work at all if you use UTF-8, so I don't see the point of using such files. – eis Nov 18 '15 at 19:18
  • How is your Eclipse workspace encoding set? *Window > Preferences > General > Workspace > Text File Encoding*. It must be UTF-8. Answer of hagrawal definitely makes it worse. Please put back "Java Properties File" encoding to ISO-8859-1 and don't touch it. – BalusC Nov 22 '15 at 19:22
  • 1
    @BalusC You haven't provided your reasons on why "you think" that its not good, just saying so is not at all sufficient. – hagrawal7777 Nov 22 '15 at 20:16
  • @BalusC It is set to [UTF-8](http://i.stack.imgur.com/MY5T3.png). – BuZZ-dEE Nov 24 '15 at 17:31
  • @eis As of Java 9+, properties files should be encoded in UTF-8: https://docs.oracle.com/javase/9/intl/internationalization-enhancements-jdk-9.htm#JSINT-GUID-974CF488-23E8-4963-A322-82006A7A14C7 – Rune Aamodt Oct 22 '20 at 10:02
  • @RuneAamodt yeah. though 1) this discussion happened 5 years ago, and 2) even Java9+ still fallbacks to ISO-8859-1 if reading using UTF-8 does not work – eis Oct 24 '20 at 07:52
  • I wanted to update the information, because I was myself thrown off course (and wasted some time) by this thread before I found out elsewhere that things had changed. – Rune Aamodt Oct 24 '20 at 21:04

7 Answers7

50

Root cause:

By default ISO 8859-1 character encoding is used for Eclipse properties file (read here), so if the file contains any character beyond ISO 8859-1 then it will not be processed as expected.

Solution 1

If you use Eclipse then you will notice that it implicitly converts the special character into \uXXXX equivalent. Try copying

会意字 / 會意字

into a properties file opened in Eclipse.

EDIT: As per comment from OP

Update the encoding of your Eclipse as shown below. If you set encoding as UTF-32 then even you can see Chinese character, which you cannot see generally.

How to change Encoding of properties file in Eclipse: See this Eclipse Bugzilla bug for more details, which talks about several other possibilities and in the end suggest what I have highlighted below. enter image description here

Chinese characters can be seen in Eclipse after encoding is set properly: enter image description here

Solution 2

If above doesn't work consistently for you (it does work for me and I never see encoding issues) then try this using some Eclipse plugin which handles encoding of properties or other files. For example Eclipse ResourceBundle Editor or Extended Resource-Bundle editor

I would recommend using Eclipse ResourceBundle Editor.

Solution 3

Another possibility to change encoding of file is using Edit --> Set Encoding option. It really matters because it changes the default character set and file encoding. Play around with by changing encoding using Edit --> Set Encoding option and do following Java sysout System.out.println("Default Charset=" + Charset.defaultCharset()); and System.out.println(System.getProperty("file.encoding"));

enter image description here


As an aside: 1

Process the properties file to have content with ISO 8859-1 character encoding by using native2ascii - Native-to-ASCII Converter

What native2ascii does: It converts all the non-ISO 8859-1 character in their equivalent \uXXXX. This is a good tool because you need not to search the \uXXXX equivalent of special character.

Usage for UTF-8: native2ascii -encoding utf8 e:\a.txt e:\b.txt


As an aside: 2

Every computer program whether an IDE, application server, web server, browser, etc. understands only bits, so it need to know how to interpret the bits to make expected sense out of it because depending upon encoding used, same bits can represent different characters. And that's where "Encoding" comes into picture by giving a unique identifier to represent a character so that all computer programs, diverse OS etc. knows exact right way to interpret it.

So, if you have written into a file using some encoding scheme, lets say UTF-8, and then reading using any editor but running with encoding scheme as UTF-8 then you can expect to get correct display.

Please do read my this answer to get more details but from browser-server perspective.

Community
  • 1
  • 1
hagrawal7777
  • 14,103
  • 5
  • 40
  • 70
  • I do not want to have things like `\uXXXX` in the properties file. I want to have the correct UTF-8 representation in the file. – BuZZ-dEE Jun 30 '15 at 17:06
  • @BuZZ-dEE I have edited my answer to address you concern. Chinese is ideographic language, if you can see Chinese character then you can see almost everything. Please let me know if it doesn't help. – hagrawal7777 Jun 30 '15 at 17:35
  • That is already set to UTF-8, why should I use UTF-32. My colleagues also use UTF-8 and they do not have this problem, so I think there musst be another solution. – BuZZ-dEE Jun 30 '15 at 17:50
  • The characters you have shown falls unders "Latin-1 Supplement" of Unicode block, and yes it is covered by UTF-8 encoding scheme. I demonstrated you as an example that if you set UTF-32 then even you can see Chinese character which you cannot see if your encoding scheme is UTF-8. Now, for the problem part you are facing - I think you may be editing your properties file in some other editor other than Eclipse which is having UTF-8, probably in some diff software like WinMerge. So, it may be getting screwed. – hagrawal7777 Jun 30 '15 at 17:59
  • No, edit those file in Eclipse. Also, the problem does not occur if I use an editor like Gedit or Atom. – BuZZ-dEE Jun 30 '15 at 18:43
  • @BuZZ-dEE Have you got your answer or something you found helpful, if not then please write your answer so that other's can be benefited from it. stackoverflow.com/help/accepted-answer – hagrawal7777 Jul 01 '15 at 20:32
  • No, and also no answer helped to solve the problem, so I can not accept one. – BuZZ-dEE Jul 01 '15 at 20:37
  • Ok, please do not forget to post your own answer if you are able to solve it. As per me, there could be some problem with your Eclipse only because your colleagues are good with UTF-8 and even I never found erratic behavior, after I set Eclipse encoding. So, may be you can give a try to download a fresh Eclipse installation, and also making sure that you are not editing your properties in any editor which doesn't support UTF-8, including auto-merging software of SCM tools. – hagrawal7777 Jul 01 '15 at 20:52
  • Another important tryout - a clean and new workspace as well, many time workspace screw like anything .. – hagrawal7777 Jul 01 '15 at 20:54
  • 1
    Got some solution on this ?? – hagrawal7777 Sep 24 '15 at 22:49
  • The problem also exists in Eclipse Mars 1. – BuZZ-dEE Nov 18 '15 at 16:20
  • I am really not sure if Eclipse would have problem if you have set the encoding correctly because I have been using the same and I didn't find any problem. Do this little test - download Notepad++ if you don't have, from encoding option in menu bar select ANSI and then put some FR character and save it. Do same for another new file but this time select encoding as UTF-8. Now open both the files again using a UTF-8 editor and ANSI. **So, what matters is with what encoding scheme you are saving the files and with what encoding scheme you are viewing the files.** – hagrawal7777 Nov 18 '15 at 17:12
  • To get expected result, both should be same, – hagrawal7777 Nov 18 '15 at 17:12
  • 1
    Note that you can set the encoding at the file level as well (via the file's Properties from the Package Explorer or Navigator). Also, in your code be sure to use the load/store methods that take Reader/Writer objects, respectively. That ensures you can specify the encoding when reading the file into your app. – bimsapi Nov 18 '15 at 21:29
  • Changing "Java Properties File" encoding in Eclipse properties is a really bad advice. Don't do that. – BalusC Nov 22 '15 at 19:23
  • You haven't provided your reasons on why "you think" that its not good, just saying so is not at all sufficient and proves nothing. – hagrawal7777 Nov 22 '15 at 23:48
  • That won't change the encoding used to read them via `java.util.Properties` API. – BalusC Nov 23 '15 at 08:06
  • @BalusC My colleagues have set there properties file to `UTF-8` and they don't have that behavior. They also told me, that I should do the setting. – BuZZ-dEE Nov 24 '15 at 17:49
  • @BalusC Buddy, that's all together a different story and not the point here. Here OP wants to know about Eclipse display and how Eclipse stores and reads files to display in it. Now if some Java or other API wants to read it then need to have their mechanism to handle it. For example, if you are using `ResourceBundle` to read then you may need to create and use a custom `ResourceBundle.Control` class which can be used with ResourceBundle to read properties in any given encoding scheme. – hagrawal7777 Nov 25 '15 at 14:33
  • This was nowhere covered in the answer and thus misleads the OP and a lot of starters. If you knew that beforehand, you'd not have formulated the answer in its current form nor ignorantly have pushed away the problem to "but colleagues did so". Moreover, you still haven't solved OP's concrete problem. – BalusC Nov 25 '15 at 14:34
  • @BalusC It solves and there many forums and blogs which talks about same. Read here https://www.eclipse.org/forums/index.php/t/24647/ .. In old days there were other solutions like configuring through `eclipse.ini` file etc., but I think with Eclipse 3 or so, this feature was introduced to have fine grained control .. What you are talking is right but contexts are different .. Here we are talking about Eclipse context and not Java or some other context .. – hagrawal7777 Nov 25 '15 at 14:41
  • @BuZZ-dEE Buddy, see if my latest edit to answer helps you. – hagrawal7777 Nov 28 '15 at 21:03
  • @MichaelHegner I am glad that it helped you, thanks for letting me know. – hagrawal7777 Feb 17 '17 at 19:48
  • 2
    Note: in JAVA9 the UTF-8 is now the default for the properties https://docs.oracle.com/javase/9/intl/internationalization-enhancements-jdk-9.htm#JSINT-GUID-974CF488-23E8-4963-A322-82006A7A14C7 - but you may have to configure eclipse specifically. – pdem Mar 01 '18 at 14:43
4

Add the following arguments to your eclipse.ini file.

-Dclient.encoding.override=UTF-8
-Dfile.encoding=UTF-8

By default Eclipse uses the encoding format picked up by the Java Virtual Machine (JVM). Also, you can set the file encoding to utf-8.

BuZZ-dEE
  • 6,075
  • 12
  • 66
  • 96
user1363516
  • 360
  • 4
  • 15
  • The [JVM uses the system encoding](https://stackoverflow.com/questions/1006276/what-is-the-default-encoding-of-the-jvm) and my system uses `UTF-8` and also my properties encoding is set to `UTF-8`. – BuZZ-dEE Nov 24 '15 at 17:51
  • I have requested a feature from oracle to remove the default 8859 encoding. No response yet. let's see if they will fix it. – user1363516 Dec 07 '15 at 21:45
4

Resolved by doing the below changes :

  1. Modified below properties in eclipse.ini and close and start the eclipse applications -Dclient.encoding.override=UTF-8 -Dfile.encoding=UTF-8
  2. Set the encoding to the UTF-8 [Navigation path : Edit -> Set encoding]

Set the encoding to the UTF-8 [Navigation path : Edit -> Set encoding]

Dilip K
  • 71
  • 6
2

Properties Files are expected to be ISO-8859-1 (Latin-1) encoded. Most likely this what eclipse was set to by default as well.

You have to make sure that every tool which is run in the build or whatever disregards the spec and uses UTF-8 instead.

tilois
  • 682
  • 5
  • 15
  • 1
    But there also `ä`, `ü` and `ö` in the file, which are not replaced. Why those are not replaced? How should I find setting which cause this problem? Do I need to search all Eclipse settings and also for every Eclipse plugin to find the wrong setting? – BuZZ-dEE Jun 30 '15 at 16:59
  • My guess is that *a tool* (maybe a save action?) updates only lines which are somehow touched. But it will get hard to find the culprit. – tilois Jun 30 '15 at 17:06
  • But there are lines changed, that I have not touched. – BuZZ-dEE Jun 30 '15 at 17:07
  • `\uFFFD`is an Java escaped character. Regular ISO-8859-1 encoded files don't use such an escaping. Therefore it must be the editor you use. Make sure you are not using the "Properties File Editor" in Eclipse or a similar external tool. – Robert Jun 30 '15 at 17:16
  • Latin-1 has some accented characters. – bmargulies Jun 30 '15 at 17:36
  • @bmargulies maybe, but in the properties file I have message which `ä` and with `\uFFFD\uFFFD` and some `ä` are replaced by `\uFFFD\uFFFD` and some not. – BuZZ-dEE Jun 30 '15 at 17:52
  • @Robert Which file editor should I use in Eclipse to edit properties files? – BuZZ-dEE Jun 30 '15 at 17:54
  • @BuZZ-dEE change the project encoding to UTF-8 and then use the standard "Text Editor" (see context menu of the file -> "Open With". Or use an external Editor like Notepad++ – Robert Jul 01 '15 at 08:33
  • @Robert Where can I do that: "change the project encoding to UTF-8"? If I look into the properties of the project, then there is already "UTF-8" encoding in the "Resource" menu point. – BuZZ-dEE Jul 01 '15 at 11:16
  • Open context menu of your project, "Properties" -> first page -> "Text file encoding" – Robert Jul 01 '15 at 13:00
  • 3
    It changes: since java 9 it is expected to be UTF-8 https://docs.oracle.com/javase/9/intl/internationalization-enhancements-jdk-9.htm#JSINT-GUID-974CF488-23E8-4963-A322-82006A7A14C7 – pdem Mar 01 '18 at 14:44
  • Note that in Spring Boot by default they are also expected to be UTF-8. – herman Feb 24 '20 at 14:22
1

This looks like a mixture of Eclipse and git encoding or rather not-encoding.

Git uses raw bytes and doesn't care about encoding. Using git diff you might get characters like shown here. An example there is R<C3><BC>ckg<C3><A4>ngig # should be "Rückgängig".

As you can see there's two funny bracket things showing per umlaut. And in your editor, there are always two \uFFFD for each umlaut in the lines starting with +.

So I assume that your UTF-8 editor tries to interpret the git notation and fails. This in turn leads to the representation \uFFFD, which basically meands that this is character whose value is unknown or unrepresentable (see here).

Like suggested in the first link, you can try setting LESSCHARSET=UTF-8 in your environment variable (Windows). Hmm, in Linux it should be in etc/profile ?

Calon
  • 4,174
  • 1
  • 19
  • 30
  • I used `set LESSCHARSET UTF-8` in the FISH shell and after that I had also `\uFFFD\uFFFD` instead of correct `€` sign. – BuZZ-dEE Nov 27 '15 at 09:20
0

see: a marker such as FFFD (REPLACEMENT CHARACTER) in http://unicode.org/faq/utf_bom.html

and see native2ascii --help

   -encoding encoding_name
          Specifies the name of the character encoding to be used by the conversion procedure. If this option is not present, then the
          default character encoding (as determined by the java.nio.charset.Charset.defaultCharset method) is used. The encoding_name
          string must be the name of a character encoding that is supported by the JRE. See Supported Encodings at
          http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

a case

$ file yourfile.properties
yourfile.properties : ISO-8859 text, with very long lines
$ native2ascii -encoding ISO-8859-1 yourfile.properties yourfile.properties 
Bruce Zu
  • 507
  • 6
  • 17
0

You could solve that issue by changing your Region settings if you're using Windows 11. Don't know if this works on earlier versions.

Take a look a this full detailed answer