Reading unicode char codes JAVA

Question

Hi I'm reading file (please, use the link to see the file) that contains this rows:

U+0000
U+0001
U+0002
U+0003
U+0004
U+0005

using this code

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;

 public class fgenerator {    
public static void main(String[] args) {
    try(BufferedReader br  = new BufferedReader(new FileReader(new File("C:\\UNCDUNCD.txt")))){
        String line;
        String[] splited;
        while ((line = br.readLine()) != null){
            splited = line.split(" ");
            System.out.println(splited[0]);
        }
    }catch(Exception e) {
        e.printStackTrace();
    }

}

}

but output is

U+D01C
U+D01D
U+D01E
U+D01F
U+D020
U+D021

why does this happen?
how to get the char of its code

When I run this I get `U+0000 U+0001 U+0002 U+0003 U+0004 U+0005` — GBlodgett, Jun 08 '18 at 22:10
Your question is confusing. If printing the entire line shows the six characters `U+D01C`, then the line obviously contains those six characters. I’m not clear on whether you believe each line contains six ASCII characters, or a single Unicode codepoint. — VGR, Jun 08 '18 at 22:11
Taking your code and wrapping it up into a full class with a `main` method, I get the output, like GBlodgett says, of exactly what the file contains. You say your input is `U+0000` etc. but your output is `U+D01C` etc. _I do not get that result._ I _believe_ your input is `U+D01C` and your output is `U+D01C`, and I further assume that what you want is to read `U+D01C` and output the _Unicode Character_ at code-point D01C. (by the way, there is no character there "U+D01C is not a valid unicode character" according to fileformat.info) — Stephen P, Jun 08 '18 at 22:48
Possible duplicate of [Creating Unicode character from its number](https://stackoverflow.com/questions/5585919/creating-unicode-character-from-its-number) — Stephen P, Jun 08 '18 at 22:59
What you want to do is take the string you read in, such as `U+25C0`, strip off the "U+" part and turn the rest `25C0` into an int: e.g. `Integer.parseInt("25C0", 16);`. At that point the question becomes "[Creating Unicode character from its number](https://stackoverflow.com/q/5585919/17300)" — Stephen P, Jun 08 '18 at 23:02

score 0 · Answer 1 · answered Jun 08 '18 at 21:36

0

change line datatype to char, if doesnt work then String.getBytes()

answered Jun 08 '18 at 21:36

Chris Fodor

119
21

`BufferedReader.readLine()` returns a `String` **not** a `char` - so if you changed `String line;` into `char line;` it wouldn't even compile. – Stephen P Jun 08 '18 at 22:36

Stephen P · Answer 2 · 2018-06-15T21:10:01.743

I am assuming that you want to take the Unicode representation that is on each line of the file and output the actual Unicode character which the code represents.

If we start with your loop that reads each line from the file...

while ((line = br.readLine()) != null){             
    System.out.println( line );
}

... then what we want to do is convert the input line to the character, and print that ...

while ((line = br.readLine()) != null){             
    System.out.println( convert(line) ); <- I just put a method call to "convert()"
}

So, how do you convert(line) into a character before printing it?
As my earlier comment suggested, you want to take the numeric string that follows the U+ and convert it to an actual numeric value. That, then, is the character value you want to print.

The following is a complete program — essentially like yours but I take the filename as an argument rather than hard-coding it. I've also added skipping blank lines, and rejecting invalid strings -- printing a blank space instead.

Reject the line if it does not match the U+nnnn form of a Unicode representation — match against "(?i)U\\+[0-9A-F]{4}", which means:
(?i) - ignore case
U\\+ - match U+, where the + has to be escaped to be a literal plus
[0-9A-F] - match any character 0-9 or A-F (ignoring case)
{4} - exactly 4 times

With your update that includes a linked sample file, which includes # comments, I have modified my original program (below) so it will now strip comments and then convert the remaining representation.

This is a complete program that can be run as:
javac Reader2.java
java Reader2 inputfile.txt

I tested it with a subset of your file, starting inputfile.txt at line 1 with U+0000 and ending at line 312 with U+0138

import java.io.*;

public class Reader2
{
    public static void main(String... args)
    {
        final String filename = args[0];
        try (BufferedReader br = new BufferedReader(
                                    new FileReader(new File( filename ))
                                 )
            )
        {
            String line;
            while ((line = br.readLine()) != null) {
                if (line.trim().length() > 0) { // skip blank lines
                  //System.out.println( convert(line) );
                  final Character c = convert(line);
                  if (Character.isValidCodePoint(c)) {
                        System.out.print  ( c );
                  }
                }
            }
            System.out.println();
        }
        catch(Exception e) {
            e.printStackTrace();
        }
    }

    private static char convert(final String input)
    {
        //System.out.println("Working on line: " + input);
        if (! input.matches("(?i)U\\+[0-9A-F]{4}(\\s+#.*)")) {
            System.err.println("Rejecting line: " + input);
            return ' ';
        }
        else {
            //System.out.println("Accepting line: " + input);
        }
        // else
        final String stripped = input.replaceFirst("\\s+#.*$", "");
        final Integer cval = Integer.parseInt(stripped.substring(2), 16);
        //System.out.println("cval = " + cval);
        return (char) cval.intValue();
    }
}

Original program that assumed a line consisted only of U+nnnn is here.

You would run this as:
javac Reader.java
java Reader input.txt

import java.io.*;

public class Reader
{
    public static void main(String... args)
    {
        final String filename = args[0];
        try (BufferedReader br = new BufferedReader(
                                    new FileReader(new File( filename ))
                                 )
            )
        {
            String line;
            while ((line = br.readLine()) != null) {
                if (line.trim().length() > 0) { // skip blank lines
                  //System.out.println( line );
                    // Write all chars on one line rather than one char per line
                    System.out.print  ( convert(line) );
                }
            }
            System.out.println(); // Print a newline after all chars are printed
        }
        catch(Exception e) {      // don't catch plain `Exception` IRL
            e.printStackTrace();  // don't just print a stack trace IRL
        }
    }

    private static char convert(final String input)
    {
        // Reject any line that doesn't match U+nnnn
        if (! input.matches("(?i)U\\+[0-9A-F]{4}")) {
            System.err.println("Rejecting line: " + input);
            return ' ';
        }
        // else convert the line to the character
        final Integer cval = Integer.parseInt(input.substring(2), 16);
        //System.out.println("cval = " + cval);
        return (char) cval.intValue();
    }
}

Try it using this as your input file:

U+0041
bad line
U+2718
U+00E9
u+0073

Redirect standard error when you run it java Reader input.txt 2> /dev/null or comment out the line System.err.println...
You should get this output: A ✘és

My question has to aspects. 1. I'm reading file that contains U+0020 U+0021 U+0022 U+0023 ............ but when I'm printing the lines console shows this U+D01C U+D01D U+D01E U+D01F why is so? 2. how convert whatever I have to it's char representation? — Arno, Jun 15 '18 at 18:03
@Arno — everybody (in the comments) is reporting that, when they run your code, they get the output that matches the input; that is, `U+0000`, `U+0001` produces `U+0000`, `U+0001` -- it does **not** produce `U+D01C`, `U+D01D` as you say. (and your original question says you are reading `U+0000`, **not** that you are reading `U+0020` as you say here). Nobody (including me) has been able to reproduce what you say is happening. What happens if you compile and run the full standalone program that I have provided? Can you provide a full standalone program that reproduces what you say is happening? — Stephen P, Jun 15 '18 at 18:47

Reading unicode char codes JAVA

2 Answers2