3

I hava text file that contains some data. All paragraphs start with four spaces. My aim is to split this text into paragraphs.

First, I read the whole text using:

    public String parseToString(String filePath) throws  IOException{
        return new String(Files.readAllBytes(Paths.get(filePath)), StandardCharsets.UTF_8);
    }

Then I use this code to split the string:

    private static final String PARAGRAPH_SPLIT_REGEX = "(^\\s{4})";
    public void parseText(String text) {
        String[] paragraphs = text.split(PARAGRAPH_SPLIT_REGEX);
        for (int i = 0; i < paragraphs.length; i++) {
            System.out.println("Paragraph: " + paragraphs[i]);
        }
    }

My input file is:

    Hello, World!
    Hello, World!

And the output is:

Paragraph: 
Paragraph: Hello, World!!!
    Hello, World!!!

What am i doing wrong?

StasKolodyuk
  • 4,256
  • 2
  • 32
  • 41

2 Answers2

5

^ by default represents start of the string, not start of the line. If you want to it to represent start of the line you need to add multiline flag to your regex (?m).

Also consider using look-ahead which in Java 8 will automatically get rid of first empty result in your split array.

So try with this regex:

private static final String PARAGRAPH_SPLIT_REGEX = "(?m)(?=^\\s{4})";

To get rid of unwanted separators like spaces or new lines at start or end of your string you can simply use trim method like

public static void parseText(String text) {
    String[] paragraphs = text.split(PARAGRAPH_SPLIT_REGEX);
    for (String paragraph : paragraphs) {
        System.out.println("Paragraph: " + paragraph.trim());
    }
}

Example:

 String s = 
        "    Hello, World!\r\n" + 
        "    Hello, World!\r\n" + 
        "    Hello, World!";
 parseText(s);

Output:

Paragraph: Hello, World!
Paragraph: Hello, World!
Paragraph: Hello, World!

Pre Java 8 version:

If you need to use this code on older versions of Java then you will need to prevent splitting at start of the string (to avoid getting first element empty). To do this you can use (?!^) before miltiline flag. This way ^ before (?m) can still be representing only start of string, not start of the line. Or to be more explicit you can use \A which represents start of String regardless of multiline flag.

So pre Java 8 version of regex can look like

private static final String PARAGRAPH_SPLIT_REGEX = "(?!^)(?m)(?=^\\s{4})";

or

private static final String PARAGRAPH_SPLIT_REGEX = "(?m)(?!\\A)(?=^\\s{4})";
Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269
1

Your regex should be \\s{4} without the ^ in the beginning.

Lidan
  • 93
  • 1
  • 7
  • `^` is necessary here, otherwise we would also split on four spaces inside paragraph (I assume that such text is *possible*, despite being *not very probable*). Also `\\s` represents line separators which means that we could end up with splitting text like `foo\r\n____bar"` (there are 4x spaces represented by `_`) into `"foo"` and `"__bar"` because `\\s{4}` consumed also `\r\n`. Using `^` prevents regex from doing so. – Pshemo Oct 14 '14 at 18:11