Java - Printing unicode from text file doesn't output corresponding UTF-8 character

Question

I have this text file with numerous unicodes and trying to print the corresponding UTF-8 characters in the console but all it prints is the hex string. Like if I copy any of the values and paste them into a System.out it works fine, but not when reading them from the text file.

The following is my code for reading the file, which contains lines of values like \u00C0, \u00C1, \u00C2, \u00C3 which are printed to the console and not the values I want.

private void printFileContents() throws IOException {
    Path encoding = Paths.get("unicode.txt");
    try (Stream<String> stream = Files.lines(encoding)) {

        stream.forEach(v -> { System.out.println(v); });

    } catch (IOException e) {
        e.printStackTrace();
    }
}

This is the method I used to parse html that had the unicodes in the first place.

private void parseGermanEncoding() {

    try 
    {
        File encoding = new File("encoding.html");

        Document document = Jsoup.parse(encoding, "UTF-8", "http://example.com/");

        Element table = document.getElementsByClass("codetable").first();

        Path f = Paths.get("unicode.txt");

        try (BufferedWriter wr = new BufferedWriter(new FileWriter(f.toFile()))) 
        {
            for (Element row : table.select("tr"))
            {
                Elements tds = row.select("td");

                String unicode = tds.get(0).text();

                if (unicode.startsWith("U+"))
                {
                    unicode = unicode.substring(2);
                }

                wr.write("\\u" + unicode);
                wr.newLine();   

            }   
            wr.flush();
            wr.close();
        }

    } catch (IOException e) 
    {
        e.printStackTrace();
    }
}

Did you just write `\u00C2` and so on in your File? please show us a part of the textfile — Felix, Jul 08 '17 at 17:32
The text file just looks like the following. '\u00C0 \u00C1 \u00C2 \u00C3 \u00C4 \u00C5 \u00C6 \u00C7 \u00C8 \u00C9 \u00CA \u00CB \u00CC \u00CD \u00CE \u00CF \u00D0 \u00D1 \u00D2 \u00D3 \u00D4' — sean le roy, Jul 08 '17 at 17:33
Sorry that's not printing right. Basically each of these values is on a separate line. — sean le roy, Jul 08 '17 at 17:34

score 0 · Answer 1 · answered Jul 08 '17 at 17:47

0

You will need to convert the string from unicode encoded string to UTF-8 encoded string. You could follow the steps, 1.convert the string to byte array using myString.getBytes("UTF-8") and 2.get the UTF-8 encoded string using new String(byteArray, "UTF-8"). The code block needs to be surrounded with try/catch for UnsupportedEncodingException.

answered Jul 08 '17 at 17:47

OTM

656
5
8

Still doesn't work. My method looks like the following now. `Path encoding = Paths.get("unicode.txt"); System.out.println("\u00D9 \u00FC \u00C2 \u00C7 Acme, Inc."); try (Stream stream = Files.lines(encoding)) { stream.forEach(v -> { try { byte[] bytes = v.getBytes("UTF-8"); String str = new String(bytes, "UTF-8"); System.out.println(str); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } })` – sean le roy Jul 08 '17 at 17:59
Code doesn't print well in a comment. I've included another system out in this one with the first one printing the correct characters I want. – sean le roy Jul 08 '17 at 18:03
In your original code in your post can you try with stream.forEach(System.out::println); ? – OTM Jul 08 '17 at 18:12
I actually had it like that originally and was getting the same result. – sean le roy Jul 08 '17 at 18:16
Okay, it looks like the strings read from files arE to be unescaped. Can you try one more approach using apache commons? Can you unescape the string using org.apache.commons.lang3.StringEscapeUtils.unescapeJava(myString) ang print that ? – OTM Jul 08 '17 at 18:32
I actually only realized myself a while ago that it's escaping the escape and therefore not printing what I want. I might try that out but this is part of a project in work and know they won't want me using that library. – sean le roy Jul 08 '17 at 18:38
1

Okay, then you might want to take a look at this post's answer. https://stackoverflow.com/questions/11145681/how-to-convert-a-string-with-unicode-encoding-to-a-string-of-letters – OTM Jul 08 '17 at 18:47
Brilliant, following that post I was able to get it working. I'll post up my working solution. – sean le roy Jul 09 '17 at 19:42

score 0 · Accepted Answer · answered Jul 09 '17 at 19:49

Thanks to OTM's comment above I was able to get a working solution for this. You take the unicode string, convert to hex using Integer.parseInt() and finally casting to char to get the actual value. This solution is based on this post provided by OTM - How to convert a string with Unicode encoding to a string of letters

private void printFileContents() throws IOException {
    Path encoding = Paths.get("unicode.txt");

    try (Stream<String> stream = Files.lines(encoding)) {
        stream.forEach(v -> 
        {
            String output = "";

            // Takes unicode digits and converts to HEX value
            int parse = Integer.parseInt(v, 16);

            // Get the actual value of the hex value
            output += (char) parse; 

            System.out.println(output);
        });

    } catch (IOException e) {
        e.printStackTrace();
    }
}

Java - Printing unicode from text file doesn't output corresponding UTF-8 character

2 Answers2