38

In my application I'm getting the user info from LDAP and sometimes the full username comes in a wrong charset. For example:

ТеÑÑ61 ТеÑÑовиÑ61

It can also be in English or in Russian and displayed correctly. If the username changes it's updated in database. Even if I change the value in the db it wont solve the problem.

I can fix it before saving by doing this

new String(incorrect.getBytes("ISO-8859-1"), "UTF-8");

However, if I will use it for the string including characters in Russian (for ex., "Тест61 Тестович61") I get something like this "????61 ????????61".

Can you please suggest something that can determine the charset of string?

Adilya Taimussova
  • 551
  • 1
  • 6
  • 12

6 Answers6

22

Strings in java, AFAIK, do not retain their original encoding - they are always stored internally in some Unicode form. You want to detect the charset of the original stream/bytes - this is why I think your String.toBytes() call is too late.

Ideally if you could get the input stream you are reading from, you can run it through something like this: http://code.google.com/p/juniversalchardet/

There are plenty of other charset detectors out there as well

M. A. Kishawy
  • 5,001
  • 11
  • 47
  • 72
radai
  • 23,949
  • 10
  • 71
  • 115
10

I had the same problem. Tika is too large and juniversalchardet do not detect ISO-8859-1. So, I did myself and now is working well in production:

public String convert(String value, String fromEncoding, String toEncoding) {
  return new String(value.getBytes(fromEncoding), toEncoding);
}

public String charset(String value, String charsets[]) {
  String probe = StandardCharsets.UTF_8.name();
  for(String c : charsets) {
    Charset charset = Charset.forName(c);
    if(charset != null) {
      if(value.equals(convert(convert(value, charset.name(), probe), probe, charset.name()))) {
        return c;
      }
    }
  }
  return StandardCharsets.UTF_8.name();
}

Full description here: Detect the charset in Java strings.

  • Have you tried this with a large `String` and a large number of charsets in the `for` loop? Is it fast enough? – Thomas Jun 15 '21 at 10:10
  • Hi @Thomas, I haven't tried. My case was detecting UFT-8 from ISO-8859-1, mainly because of third part JavaScript libraries. Just thought of adding some more code, hoping it will be more useful. – Lluís Turró Cutiller Jun 16 '21 at 11:23
8

I recommend Apache.tika CharsetDetector, very friendly and strong.

CharsetDetector detector = new CharsetDetector();
detector.setText(yourStr.getBytes());
detector.detect();  // <- return the result, you can check by .getName() method

Further, you can convert any encoded string to your desired one, take utf-8 as example:

detector.getString(yourStr.getBytes(), "utf-8");
Zanecola
  • 1,394
  • 3
  • 15
  • 27
  • 9
    This library adds 45 Mb to the final binary! – Yuriy Chernyshov Oct 14 '20 at 14:22
  • Based on the javadoc for String.getBytes() here:https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#getBytes(), it seems that performing yourStr.getBytes() would return the byte equivalent of string encoded in Default Charset. Therefore, detector.detect() will always return the Default Charset as the Charset of yourStr. I don't see how this can ever work correctly. – saquib-khan Sep 15 '21 at 18:02
3

I highly appreciate Lluís Turró Cutiller's answer (+1), but want to add a variant based on that.

private String convert(String value, Charset fromEncoding, Charset toEncoding) throws UnsupportedEncodingException {
    return new String(value.getBytes(fromEncoding), toEncoding);
}

private boolean probe(String value, Charset charset) throws UnsupportedEncodingException {
    Charset probe = StandardCharsets.UTF_8;
    return value.equals(convert(convert(value, charset, probe), probe, charset));
}

public String convert(String value, Charset charsetWanted, List<Charset> charsetsOther) throws UnsupportedEncodingException {
    if (probe(value, charsetWanted)) {
        return value;
    }
    for (Charset other: charsetsOther) {
        if (probe(value, other)) {
            return convert(value, other, charsetWanted);
        }
    }
    System.err.println("WARNING: Could not convert string: " + value);
    return value;
}
Thomas Schütt
  • 832
  • 10
  • 14
2

Your LDAP database is set up incorrectly. The application putting data into it should convert to a known character set encoding, in your case, likely UTF_16. Pick a standard. All methods of detecting encoding are guesses.

The application writing the value is the only one that knows definitively which encoding it is using and can properly convert to another encoding such as UTF_16.

Evan Langlois
  • 4,050
  • 2
  • 20
  • 18
0

In your web-application, you may declare an encoding-filter that makes sure you receive data in the right encoding.

<filter>
    <description>Explicitly set the encoding of the page to UTF-8</description>
    <filter-name>encodingFilter</filter-name>
    <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
    <init-param>
        <param-name>encoding</param-name>
        <param-value>UTF-8</param-value>
    </init-param>
    <init-param>
        <param-name>forceEncoding</param-name>
        <param-value>true</param-value>
    </init-param>
</filter>

A spring provided filter makes sure that the controllers/servlets receive parameters in UTF-8.

hong4rc
  • 3,999
  • 4
  • 21
  • 40
sangupta
  • 2,396
  • 3
  • 23
  • 37
  • 2
    This only applies to a spring application. Also, forcing the encoding may not work if basic authentication is being used. – Rafael Sisto Apr 14 '15 at 13:16