3

I have a file in the following format, records are separated by newline but some records have line feed in them, like below. I need to get each record and process them separately. The file could be a few Mb in size.

 <?aaaaa>
 <?bbbb
     bb>
 <?cccccc>

I have the code:

 FileInputStream fs = new FileInputStream(FILE_PATH_NAME);
 Scanner scanner = new Scanner(fs);
 scanner.useDelimiter(Pattern.compile("<\\?"));
 if (scanner.hasNext()) {
     String line = scanner.next();
     System.out.println(line);
 } 
 scanner.close();

But the result I got have the begining <\? removed:

aaaaa>
bbbb
   bb>
cccccc>

I know the Scanner consumes any input that matches the delimiter pattern. All I can think of is to add the delimiter pattern back to each record mannully.

Is there a way to NOT have the delimeter pattern removed?

Bohemian
  • 412,405
  • 93
  • 575
  • 722
jlp
  • 1,656
  • 6
  • 33
  • 48

3 Answers3

5

Break on a newline only when preceded by a ">" char:

scanner.useDelimiter("(?<=>)\\R"); // Note you can pass a string directly

\R is a system independent newline
(?<=>) is a look behind that asserts (without consuming) that the previous char is a >

Plus it's cool because <=> looks like Darth Vader's TIE fighter.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • I tested with more records and this approach made some records on the same line. Can you please help ? – jlp Jan 06 '17 at 20:21
  • @jlp do you mean a "missing" newline like in `"\n\n"` between bbb and ccc? – Bohemian Jan 06 '17 at 20:23
  • btw, if that is the case it's easy to handle - just add `?` to the end of the regex – Bohemian Jan 06 '17 at 21:42
  • @Bohemian Hi, sorry, I have another question. If the data is like "" with no space or newline between them, how do I make my regex so that it can break them into 3 lines like then and . Please let me know if I should create another post for this questions. Thanks. – jlp Jan 13 '17 at 03:16
  • @jlp as per previous comment; add `?` to the regex: `scanner.useDelimiter("(?<=>)\\R?");`. This makes the newline *optional*, but will consume it if it's there. – Bohemian Jan 13 '17 at 05:10
1

I'm assuming you want to ignore the newline character '\n' everywhere.

I would read the whole file into a String and then remove all of the '\n's in the String. The part of the code this question is about looks like this:

String fileString = new String(Files.readAllBytes(Paths.get(path)), StandardCharsets.UTF_8);
fileString = fileString.replace("\n", "");
Scanner scanner = new Scanner(fileString);
...  //your code

Feel free to ask any further questions you might have!

Community
  • 1
  • 1
Todd Sewell
  • 1,444
  • 12
  • 26
  • The file may be a few Mb large, not sure if it's going to cause any issues if storing the entire file into a string. – jlp Jan 06 '17 at 18:42
  • @jlp I wouldn't worry about files being a couple of megabytes in size, but you're right that this approach wouldn't scale very well. – Todd Sewell Jan 06 '17 at 18:47
0

Here is one way of doing it by using a StringBuilder:

public static void main(String[] args) throws FileNotFoundException {
    Scanner in = new Scanner(new File("C:\\test.txt"));
    StringBuilder builder = new StringBuilder();

    String input = null;
    while (in.hasNextLine() && null != (input = in.nextLine())) {
        for (int x = 0; x < input.length(); x++) {
            builder.append(input.charAt(x));
            if (input.charAt(x) == '>') {
                System.out.println(builder.toString());
                builder = new StringBuilder();
            }
        }
    }

    in.close();
}

Input:

 <?aaaaa>
 <?bbbb
     bb>
 <?cccccc>

Output:

 <?aaaaa>
 <?bbbb     bb>
 <?cccccc>
user2004685
  • 9,548
  • 5
  • 37
  • 54