2

I have a string, which is returned by the Jericho HTML parser and contains some Russian text. According to source.getEncoding() and the header of the respective HTML file, the encoding is Windows-1251.

How can I convert this string to something readable?

I tried this:

import java.io.UnsupportedEncodingException;

public class Program {
    public void run() throws UnsupportedEncodingException {
        final String windows1251String = getWindows1251String();
        System.out.println("String (Windows-1251): " + windows1251String);
        final String readableString = convertString(windows1251String);
        System.out.println("String (converted): " + readableString);
    }
    private String convertString(String windows1251String) throws UnsupportedEncodingException {
        return new String(windows1251String.getBytes(), "UTF-8");
    }
    private String getWindows1251String() {
        final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
        return new String(bytes);
    }
    public static void main(final String[] args) throws UnsupportedEncodingException {
        final Program program = new Program();
        program.run();
    }
}

The variable bytes contains the data shown in my debugger, it's the result of net.htmlparser.jericho.Element.getContent().toString().getBytes(). I just copy and pasted that array here.

This doesn't work - readableString contains garbage.

How can I fix it, i. e. make sure that the Windows-1251 string is decoded properly?

Update 1 (30.07.2015 12:45 MSK): When change the encoding in the call in convertString to Windows-1251, nothing changes. See the screenshot below.

Screenshot

Update 2: Another attempt:

Second screenshot

Update 3 (30.07.2015 14:38): The texts that I need to decode correspond to the texts in the drop-down list shown below.

Expected result

Update 4 (30.07.2015 14:41): The encoding detector (code see below) says that the encoding is not Windows-1251, but UTF-8.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    System.out.println("Detected encoding: " + encoding);
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}
Glory to Russia
  • 17,289
  • 56
  • 182
  • 325
  • 1
    Did you try `new String(bytes, "Windows-1251") ` ? – Florian Schaetz Jul 30 '15 at 09:43
  • I suspect your String() constructor should specify the encoding in use in your byte array, otherwise you're subject to the JVM encoding for your environment – Brian Agnew Jul 30 '15 at 09:43
  • @FlorianSchaetz Yes, see update 1. – Glory to Russia Jul 30 '15 at 09:48
  • 1
    No, sorry, not there, but in `getWindows1251String`. new String() might already try to produce an UTF-8 string there, see http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#String%28byte[]%29 – Florian Schaetz Jul 30 '15 at 09:53
  • @FlorianSchaetz I noticed it - see update 2. – Glory to Russia Jul 30 '15 at 09:54
  • Is it possible that those bytes aren't encoded in `Windows-1251` (despite what is stated in the header of the HTML file) ? – Glory to Russia Jul 30 '15 at 09:56
  • What are you expecting to see? If you look up the character values manually, they are correct - look at 0xBD, 0xBF and 0xEF here: https://en.wikipedia.org/wiki/Windows-1251 They are the 3 characters you are seeing and correspond to the decimal values -65 -67 and -17 which appear repeatedly in your byte array (after the initial whitespace) – Rodney Jul 30 '15 at 10:25
  • The updates suggest to me that the system encoding is already Windows-1512 which is why specifying it on line 16 makes no difference. Therefore, on line 12, getBytes() returns another Windows-1512 encoded byte array - the same as what you started with so it's pointless. Then when you call new String and specify UTF-8, the decoding fails, because the byte array isn't UTF-8. Which ever way you look at it, the convertString method is pointless. – Rodney Jul 30 '15 at 10:45
  • @Rodney These texts represent entries in a drop-down list. See my update 3. I want to retrieve the same texts in my Java program. – Glory to Russia Jul 30 '15 at 11:41

3 Answers3

3

(In the light of updates I deleted my original answer and started again)

The text which appears

пїЅпїЅпїЅпїЅпїЅпїЅ

is an accurate decoding of these byte values

-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67

(Padded at either end with 32, which is space.)

So either

1) The text is garbage or

2) The text is supposed to look like that or

3) The encoding is not Windows-1215

This line is notably wrong

return new String(windows1251String.getBytes(), "UTF-8");

Extracting the bytes out of a string and constructing a new string from that is not a way of "converting" between encodings. Both the input String and the output String use UTF-16 encoding internally (and you don't normally even need to know or care about that). The only times other encodings come into play are when text data is stored OUTSIDE of a string object - ie in your initial byte array. Conversion occurs when the String is constructed and then it is done. There is no conversion from one String type to another - they are all the same.

The fact that this

return new String(bytes);

does the same as this

return new String(bytes, "Windows-1251");

suggests that Windows-1251 is the platforms default encoding. (Which is further supported by your timezone being MSK)

Rodney
  • 2,642
  • 1
  • 14
  • 15
  • +1 ; I can confirm that the `byte[]` is displayed correctly. I checked them in the Windows-1251 code page. (byte -17 = int 239 = 0xEF = char 'п')https://en.wikipedia.org/wiki/Windows-1251 – bvdb Jul 30 '15 at 12:56
3

I fixed this problem by modifying the piece of code, which read the text from the web site.

private String readContent(final String urlAsString) {
    final StringBuilder content = new StringBuilder();
    BufferedReader reader = null;
    InputStream inputStream = null;
    try {
        final URL url = new URL(urlAsString);
        inputStream = url.openStream();
        reader =
            new BufferedReader(new InputStreamReader(inputStream);

        String inputLine;
        while ((inputLine = reader.readLine()) != null) {
            content.append(inputLine);
        }
    } catch (final IOException exception) {
        exception.printStackTrace();
    } finally {
        IOUtils.closeQuietly(reader);
        IOUtils.closeQuietly(inputStream);
    }
    return content.toString();
}

I changed the line

new BufferedReader(new InputStreamReader(inputStream);

to

new BufferedReader(new InputStreamReader(inputStream, "Windows-1251"));

and then it worked.

Glory to Russia
  • 17,289
  • 56
  • 182
  • 325
  • The thing is that readers already perform a conversion from `byte` to `char` internally. That's also the main difference between a reader and a stream. At this point your data got corrupted. Good solution. – bvdb Jul 30 '15 at 13:36
1

Just to make sure you understand 100% how java deals with char and byte.

byte[] input = new byte[1];

// values > 127 become negative when you put them in an array.
input[0] = (byte)239; // the array contains value -17 now.

// but all 255 values are preserved. 
// But if you cast them to integers, you should use their unsigned value.
// (casting alone isn't enough).
int output = input[0] & 0xFF; // output is 239 again

// you shouldn't cast directly from a single-byte to a char.
// because: char is 16-bit ; but you only want to use 1 byte ; unfortunately your negative values will be applied in the 2nd byte, and break it.
char corrupted = (char) input[0]; // char-code: 65519 (2 bytes are used)
char corrupted = (char) ((int)input[0]); // char-code: 66519 (2 bytes are used)

// just casting to an integer/character is ok for values < 0x7F though
// values < 0x7F are always positive, even when casted to byte
// AND the first 7-bits in any ascii-encodings (e.g. windows-1251) are identical.
byte simple = (byte) 'a';
char chr = (char) ascii_LT_7F; // will result in 'a' again

// But it's still more reliable to use the & 0xFF conversion.
// Because it ensures that your character can never be greater than char code 255 (a single byte), even when the byte is unexpectedly negative (> 0x7F).
char chr = (char) ((byte)simple & 0xFF); // also results in 'a'

// for value 239 (which is 0xEF) it's impossible though.
// a java char is 16-bit encoded internally, following the unicode character set.
// characters 0x00 to 0x7F are identical in most encodings.
// but e.g. 0xEF in windows-1251 does not match 0xEF in UTF-16.
// so, this is a bad idea.
char corrupted = (char) (input[0] & 0xFF);

// And that's something you can only fix by using encodings.
// It's good practice to use encodings really just ALWAYS.
// the encoding indicates what your bytes[] are encoded in NOW.
// your bytes will be converted to 16-bit characters.
String text = new String(bytes, "from-encoding");

// if you want to change that text back to bytes, use an encoding !!
// this time the encoding specifies is the TARGET-ENCODING.
byte[] bytes = text.getBytes("to-encoding");

I hope this helps.

As for the displayed values: I can confirm that the byte[] is displayed correctly. I checked them in the Windows-1251 code page. (byte -17 = int 239 = 0xEF = char 'п')

In other words, your byte values are incorrect, or it's a different source-encoding.

bvdb
  • 22,839
  • 10
  • 110
  • 123