BufferedReader is not parsing ascii control characters when using readline

Question

I am trying split some strings with ascii control characters from a text file and ultimately have the following output:

Record1
Record2
Record3
Record4

My text file looks like this in notepad++:

But when using BufferedReader to get the line from the text file it does not catch the control characters from the file. My code looks like this:

File file = new File("Records.txt");
FileInputStream fis = new FileInputStream(file);
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");

BufferedReader bufferedReader = new BufferedReader(isr);

String text = bufferedReader.readLine();

System.out.println(text);

and the result of my sysout looks like this:

Record1Record2Record3Record4

Should I be using ISO-8859-1 instead of UTF-8?

you could have just tried it with `ISO-8859-1`. what are the characters? — XtremeBaumer, Oct 02 '17 at 11:43
I actually tried with ISO-8859-1 is still have the same result. If I am not mistaken these characters are ascii control codes. https://www.cs.tut.fi/~jkorpela/chars/c0.html the GS is ctl-] is notepad ++ — Nomad, Oct 02 '17 at 11:47
from my google search it should be a unit separator (ctrl ^_) char 31. its not a visible character. as you want to replace it with a linebreak, replace it with such — XtremeBaumer, Oct 02 '17 at 12:04
But the goal is not to modify the file in anyway. Is it really impossible to parse those characters into Java? — Nomad, Oct 02 '17 at 12:07
you do read them but they are no line breaks and tehrefore your output will always be 1 line. only line feed and carriage return start a new line. also you don't modify the file. you read it, modify it in memory and then write it — XtremeBaumer, Oct 02 '17 at 12:08
@Nomad what is the file's encoding? Notepad++ shows it in the bottom right. — Luciano van der Veekens, Oct 02 '17 at 12:09
Can you try wtiting the output to a text file rather than printing on console. Set the encoding of the file same as input file. — nits.kk, Oct 02 '17 at 12:10
You should find out: 1) what *bytes* are in the file (use a hex editor); 2) whether the problem is in display or parsing - dump the UTF-16 code unit value from each `char` in the string that you're reading. — Jon Skeet, Oct 02 '17 at 12:10
@JonSkeet the problem is in display i'd say. If my idea is correct, then my answer would apply — XtremeBaumer, Oct 02 '17 at 12:13
@XtremeBaumer: I'd prefer not to guess, really. The OP should do the diagnostic work here. — Jon Skeet, Oct 02 '17 at 12:13

score 2 · Answer 1 · answered Oct 02 '17 at 12:54

You can read each record separately using the US character as a delimiter:

Scanner scanner = new Scanner(new File("Records.txt")).useDelimiter("\u001F");
while (scanner.hasNext())
    System.out.println(scanner.next());

Output:

Record1
Record2
Record3
Record4

XtremeBaumer · Answer 2 · 2017-10-02T12:16:37.480

0

String s = "test" + (char) 31 + "test2";
String c = String.valueOf((char) 31);
System.out.println(Arrays.asList(s.split(c)));

assuming 31 is the right character (unit separator), this code will split at every occurence of it and therefore it should satisfy your needs

edited Oct 02 '17 at 12:16

answered Oct 02 '17 at 12:07

XtremeBaumer

6,275
3
19
65

score 0 · Answer 3 · answered Oct 02 '17 at 12:16

0

A sidenote: System.out.print .. stuff is not forcibly reliable when applied to "raw" data with control characters especially if displayed on the terminal (cmd window).

answered Oct 02 '17 at 12:16

user2496748

65
2

score 0 · Answer 4 · answered Oct 02 '17 at 13:08

Using the character value of the unit separator gave me a hint. This is what I did from my end

    File file = new File("Records.txt");
    FileInputStream fis = new FileInputStream(file);
    InputStreamReader isr = new InputStreamReader(fis, "ISO-8859-1");

    BufferedReader bufferedReader = new BufferedReader(isr);

    String text = bufferedReader.readLine();

    Character delim = '\037';

    String[] records = text.split(delim.toString());

    for (String string : records) {
        System.out.println(string);
    }

and got my expected output:

Record1
Record2
Record3
Record4

BufferedReader is not parsing ascii control characters when using readline

4 Answers4