0

I am trying split some strings with ascii control characters from a text file and ultimately have the following output:

Record1
Record2
Record3
Record4

My text file looks like this in notepad++:

enter image description here

But when using BufferedReader to get the line from the text file it does not catch the control characters from the file. My code looks like this:

File file = new File("Records.txt");
FileInputStream fis = new FileInputStream(file);
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");

BufferedReader bufferedReader = new BufferedReader(isr);

String text = bufferedReader.readLine();

System.out.println(text);

and the result of my sysout looks like this:

Record1Record2Record3Record4

Should I be using ISO-8859-1 instead of UTF-8?

Nomad
  • 613
  • 2
  • 7
  • 15
  • you could have just tried it with `ISO-8859-1`. what are the characters? – XtremeBaumer Oct 02 '17 at 11:43
  • I actually tried with ISO-8859-1 is still have the same result. If I am not mistaken these characters are ascii control codes. https://www.cs.tut.fi/~jkorpela/chars/c0.html the GS is ctl-] is notepad ++ – Nomad Oct 02 '17 at 11:47
  • from my google search it should be a unit separator (ctrl ^_) char 31. its not a visible character. as you want to replace it with a linebreak, replace it with such – XtremeBaumer Oct 02 '17 at 12:04
  • But the goal is not to modify the file in anyway. Is it really impossible to parse those characters into Java? – Nomad Oct 02 '17 at 12:07
  • you do read them but they are no line breaks and tehrefore your output will always be 1 line. only line feed and carriage return start a new line. also you don't modify the file. you read it, modify it in memory and then write it – XtremeBaumer Oct 02 '17 at 12:08
  • @Nomad what is the file's encoding? Notepad++ shows it in the bottom right. – Luciano van der Veekens Oct 02 '17 at 12:09
  • Can you try wtiting the output to a text file rather than printing on console. Set the encoding of the file same as input file. – nits.kk Oct 02 '17 at 12:10
  • You should find out: 1) what *bytes* are in the file (use a hex editor); 2) whether the problem is in display or parsing - dump the UTF-16 code unit value from each `char` in the string that you're reading. – Jon Skeet Oct 02 '17 at 12:10
  • @JonSkeet the problem is in display i'd say. If my idea is correct, then my answer would apply – XtremeBaumer Oct 02 '17 at 12:13
  • @XtremeBaumer: I'd prefer not to guess, really. The OP should do the diagnostic work here. – Jon Skeet Oct 02 '17 at 12:13

4 Answers4

2

You can read each record separately using the US character as a delimiter:

Scanner scanner = new Scanner(new File("Records.txt")).useDelimiter("\u001F");
while (scanner.hasNext())
    System.out.println(scanner.next());

Output:

Record1
Record2
Record3
Record4
DodgyCodeException
  • 5,963
  • 3
  • 21
  • 42
0
String s = "test" + (char) 31 + "test2";
String c = String.valueOf((char) 31);
System.out.println(Arrays.asList(s.split(c)));

assuming 31 is the right character (unit separator), this code will split at every occurence of it and therefore it should satisfy your needs

XtremeBaumer
  • 6,275
  • 3
  • 19
  • 65
0

A sidenote: System.out.print .. stuff is not forcibly reliable when applied to "raw" data with control characters especially if displayed on the terminal (cmd window).

0

Using the character value of the unit separator gave me a hint. This is what I did from my end

    File file = new File("Records.txt");
    FileInputStream fis = new FileInputStream(file);
    InputStreamReader isr = new InputStreamReader(fis, "ISO-8859-1");

    BufferedReader bufferedReader = new BufferedReader(isr);

    String text = bufferedReader.readLine();

    Character delim = '\037';

    String[] records = text.split(delim.toString());

    for (String string : records) {
        System.out.println(string);
    }

and got my expected output:

Record1
Record2
Record3
Record4
Nomad
  • 613
  • 2
  • 7
  • 15