How to check the charset of string in Java?

Question

In my application I'm getting the user info from LDAP and sometimes the full username comes in a wrong charset. For example:

Ð¢ÐµÑÑ61 Ð¢ÐµÑÑÐ¾Ð²Ð¸Ñ61

It can also be in English or in Russian and displayed correctly. If the username changes it's updated in database. Even if I change the value in the db it wont solve the problem.

I can fix it before saving by doing this

new String(incorrect.getBytes("ISO-8859-1"), "UTF-8");

However, if I will use it for the string including characters in Russian (for ex., "Тест61 Тестович61") I get something like this "????61 ????????61".

Can you please suggest something that can determine the charset of string?

Oh no, I have never noticed it, but I've been voting up for answers though. Now I will know, thank you for pointing it out for me. — Adilya Taimussova, Jul 16 '12 at 04:08

score 22 · Accepted Answer · edited May 27 '15 at 19:38

22

Strings in java, AFAIK, do not retain their original encoding - they are always stored internally in some Unicode form. You want to detect the charset of the original stream/bytes - this is why I think your String.toBytes() call is too late.

Ideally if you could get the input stream you are reading from, you can run it through something like this: http://code.google.com/p/juniversalchardet/

There are plenty of other charset detectors out there as well

edited May 27 '15 at 19:38

M. A. Kishawy

5,001
11
47
72

answered Jul 16 '12 at 04:54

radai

23,949
10
71
115

Thanks a lot for help! I'm not sure if I can get the input stream cas the user data is taken from context using UserService. Other way is probably to fix values in LDAP. – Adilya Taimussova Jul 18 '12 at 04:37
the link is dead – A_P Jul 19 '22 at 17:37

Lluís Turró Cutiller · Answer 2 · 2017-12-13T10:57:13.473

10

I had the same problem. Tika is too large and juniversalchardet do not detect ISO-8859-1. So, I did myself and now is working well in production:

public String convert(String value, String fromEncoding, String toEncoding) {
  return new String(value.getBytes(fromEncoding), toEncoding);
}

public String charset(String value, String charsets[]) {
  String probe = StandardCharsets.UTF_8.name();
  for(String c : charsets) {
    Charset charset = Charset.forName(c);
    if(charset != null) {
      if(value.equals(convert(convert(value, charset.name(), probe), probe, charset.name()))) {
        return c;
      }
    }
  }
  return StandardCharsets.UTF_8.name();
}

Full description here: Detect the charset in Java strings.

edited Dec 13 '17 at 10:57

answered Dec 13 '17 at 10:07

Lluís Turró Cutiller

107
1
4

Have you tried this with a large `String` and a large number of charsets in the `for` loop? Is it fast enough? – Thomas Jun 15 '21 at 10:10
Hi @Thomas, I haven't tried. My case was detecting UFT-8 from ISO-8859-1, mainly because of third part JavaScript libraries. Just thought of adding some more code, hoping it will be more useful. – Lluís Turró Cutiller Jun 16 '21 at 11:23

score 8 · Answer 3 · answered Nov 01 '17 at 05:48

8

I recommend Apache.tika CharsetDetector, very friendly and strong.

CharsetDetector detector = new CharsetDetector();
detector.setText(yourStr.getBytes());
detector.detect();  // <- return the result, you can check by .getName() method

Further, you can convert any encoded string to your desired one, take utf-8 as example:

detector.getString(yourStr.getBytes(), "utf-8");

answered Nov 01 '17 at 05:48

Zanecola

1,394
3
15
27

9

This library adds 45 Mb to the final binary! – Yuriy Chernyshov Oct 14 '20 at 14:22
Based on the javadoc for String.getBytes() here:https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#getBytes(), it seems that performing yourStr.getBytes() would return the byte equivalent of string encoded in Default Charset. Therefore, detector.detect() will always return the Default Charset as the Charset of yourStr. I don't see how this can ever work correctly. – saquib-khan Sep 15 '21 at 18:02

score 3 · Answer 4 · answered Dec 19 '20 at 12:06

I highly appreciate Lluís Turró Cutiller's answer (+1), but want to add a variant based on that.

private String convert(String value, Charset fromEncoding, Charset toEncoding) throws UnsupportedEncodingException {
    return new String(value.getBytes(fromEncoding), toEncoding);
}

private boolean probe(String value, Charset charset) throws UnsupportedEncodingException {
    Charset probe = StandardCharsets.UTF_8;
    return value.equals(convert(convert(value, charset, probe), probe, charset));
}

public String convert(String value, Charset charsetWanted, List<Charset> charsetsOther) throws UnsupportedEncodingException {
    if (probe(value, charsetWanted)) {
        return value;
    }
    for (Charset other: charsetsOther) {
        if (probe(value, other)) {
            return convert(value, other, charsetWanted);
        }
    }
    System.err.println("WARNING: Could not convert string: " + value);
    return value;
}

score 2 · Answer 5 · answered Apr 29 '15 at 16:27

Your LDAP database is set up incorrectly. The application putting data into it should convert to a known character set encoding, in your case, likely UTF_16. Pick a standard. All methods of detecting encoding are guesses.

The application writing the value is the only one that knows definitively which encoding it is using and can properly convert to another encoding such as UTF_16.

score 0 · Answer 6 · edited Feb 02 '21 at 09:32

In your web-application, you may declare an encoding-filter that makes sure you receive data in the right encoding.

<filter>
    <description>Explicitly set the encoding of the page to UTF-8</description>
    <filter-name>encodingFilter</filter-name>
    <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
    <init-param>
        <param-name>encoding</param-name>
        <param-value>UTF-8</param-value>
    </init-param>
    <init-param>
        <param-name>forceEncoding</param-name>
        <param-value>true</param-value>
    </init-param>
</filter>

A spring provided filter makes sure that the controllers/servlets receive parameters in UTF-8.

This only applies to a spring application. Also, forcing the encoding may not work if basic authentication is being used. — Rafael Sisto, Apr 14 '15 at 13:16

How to check the charset of string in Java?

6 Answers6

Linked