2

I have been given a file which has many paragraphs in it. The output I am expecting is that I read one paragraph at a time and perform operations on it.

final String PARAGRAPH_SPLIT_REGEX = "(?m)(?=^\\s{4})";

        String currentLine;

        final BufferedReader bf = new BufferedReader(new FileReader("filename"));


            currentLine = bf.readLine();

            final StringBuilder stringBuilder = new StringBuilder();
            while(currentLine !=null) {

                stringBuilder.append(currentLine);
                stringBuilder.append(System.lineSeparator());
                currentLine = bf.readLine();
            }

            String[] paragraph= new String[stringBuilder.length()];

            if(stringBuilder!=null) {

                final String value = stringBuilder.toString();
                paragraph = value.split(PARAGRAPH_SPLIT_REGEX);
            }

            for (final String s : paragraph) {

                System.out.println(s);
            }

File (Every paragraph has a space of 2 characters before it, and there is no blank line between paragraphs):

                      Story

  Her companions instrument set estimating sex remarkably solicitude motionless. Property men the why smallest graceful day insisted required. Inquiry justice country old placing sitting any ten age. Looking venture justice in evident in totally he do ability. Be is lose girl long of up give.
  "Trifling wondered unpacked ye at he. In household certainty an on tolerably smallness difficult. Many no each like up be is next neat. Put not enjoyment behaviour her supposing. At he pulled object others."
  Passage its ten led hearted removal cordial. Preference any astonished unreserved mrs. Prosperous understood middletons in conviction an uncommonly do. Supposing so be resolving breakfast am or perfectly. Is drew am hill from mr. Valley by oh twenty direct me so.
  Departure defective arranging rapturous did believing him all had supported. Family months lasted simple set nature vulgar him.   "Picture for attempt joy excited ten carried manners talking how. Suspicion neglected he resolving agreement perceived at an."

However, I am not achieving the desired output. The paragraph variable contains only two values

  1. The title of the file
  2. The rest of the contents of the file.

I guess, the regex I am trying to use here is not working. The regex I gathered from here. Splitting text into paragraphs with regex JAVA

I am using java8.

Community
  • 1
  • 1
  • 1
    My fault in formatting the paragraph. There is not blank line between the paragraphs. If I use newLine concept, then it will treat every line as a paragraph, and as there is no blank line, I will not be able to differentiate between a new paragraph. – LifeStartsAtHelloWorld Aug 28 '16 at 23:31
  • Just split on 2 or more linebreaks `\r?\n\s*\r?\n` If you want trimming along with that, just use `\s*\r?\n\s*\r?\n` –  Aug 28 '16 at 23:33
  • @user3369719 I see; I tried to edit it to clean up the formatting but gave up... no word-wrapped `pre` text here, I guess, heh. – Jason C Aug 28 '16 at 23:34
  • 1
    You could I suppose use an accumulate paragraph -> process -> reset accumulated lines concept with `currentLine.startsWith(" ")` but at this point it's no longer that much simpler than just using a regex. – Jason C Aug 28 '16 at 23:38
  • Is this what you're looking for? [DEMO](https://regex101.com/r/jF0tF1/1) – Alan Moore Aug 29 '16 at 03:10
  • Yes @AlanMoore. So is this the regex you used \R\h\h ? – LifeStartsAtHelloWorld Aug 29 '16 at 04:01
  • That's right. I edited the question to show the text as close as possible to what you want. I had to use two ` `'s instead of actual spaces at the beginning of the paragraphs. For future reference, if you want to force a line break without adding a blank line, put two spaces at the end of the line. – Alan Moore Aug 29 '16 at 05:55
  • @user3369719, Try [`PARAGRAPH_SPLIT_REGEX = "(?m)(?=^[\\p{Zs}\t]{2})";`](https://regex101.com/r/lF1dL4/1) – Wiktor Stribiżew Aug 29 '16 at 07:27

3 Answers3

2

You can used Scanner with delimiter, for iterating over text. For example:

Scanner scanner = new Scanner(text).useDelimiter("\n  ");
while (scanner.hasNext()) {
    String paragraph = scanner.next();
    System.out.println("# " + paragraph);
}

The output is:

#                       Story

# Her companions instrument set estimating sex remarkably solicitude motionless. Property men the why smallest graceful day insisted required. Inquiry justice country old placing sitting any ten age. Looking venture justice in evident in totally he do ability. Be is lose girl long of up give.
# "Trifling wondered unpacked ye at he. In household certainty an on tolerably smallness difficult. Many no each like up be is next neat. Put not enjoyment behaviour her supposing. At he pulled object others."
# Passage its ten led hearted removal cordial. Preference any astonished unreserved mrs. Prosperous understood middletons in conviction an uncommonly do. Supposing so be resolving breakfast am or perfectly. Is drew am hill from mr. Valley by oh twenty direct me so.
# Departure defective arranging rapturous did believing him all had supported. Family months lasted simple set nature vulgar him.   "Picture for attempt joy excited ten carried manners talking how. Suspicion neglected he resolving agreement perceived at an."
hahn
  • 3,588
  • 20
  • 31
1

According to Jason's comment, I tried his approach.I think I have the desired outcome, however, I am not pleased with the approach, time and space complexity have increased, I might improvise it later.

currentLine = bf.readLine();

            List<List<String>> paragraphs =  new LinkedList<>();

            int counter = 0;
            while(currentLine !=null) {

                if(paragraphs.isEmpty()) {

                    List<String> paragraph = new LinkedList<>();

                    paragraph.add(currentLine);
                    paragraph.add(System.lineSeparator());

                    paragraphs.add(paragraph);

                    currentLine = bf.readLine();

                    continue;
                }

                if(currentLine.startsWith(" ")) {
                    List<String> paragraph = new LinkedList<>();

                    paragraph.add(currentLine);

                    counter = counter + 1;

                    paragraphs.add(paragraph);

                }else {
                    List<String> continuedParagraph = paragraphs.get(counter);

                    continuedParagraph.add(currentLine);
                }

                currentLine = bf.readLine();
            }

            for (final List<String> story : paragraphs) {

                for(final String s : story) {
                    System.out.println(s);
                }
            }
  • 1
    Hey, nice! But, to be fair: [Your implementation is a bit verbose](http://pastebin.com/nzf65Gtx) (btw you could swap in e.g. a `List` or something instead if you want to make paras be single strings) and, it's not actually increased complexity (or space) requirements over the regex version (you're actually saving a bit of overhead by not reassembling then re-splitting the text in the file). Not that using a regex would be unreasonable but, just wanted to point that out. :) – Jason C Aug 29 '16 at 01:28
  • Thanks @JasonC for your help. Yes, I need to improvise it. Thanks for giving more information about it. – LifeStartsAtHelloWorld Aug 29 '16 at 01:41
0

You could just globally find each indented paragraph, then add to a list.

"(?m)^[^\\S\\r\\n]{2,}\\S.*(?:\\r?\\n|$)(?:^\\S.*(?:\\r?\\n|$))*"

Expanation

 (?m)                     # Multi-line mode ( ^ = begin of line )

 ^ [^\S\r\n]{2,}          # Begin of Paragraph, 2 or more horizontal wsp at BOL
 \S .*                    # Rest of line, must be non-wsp as first letter.
 (?: \r? \n | $ )

 (?:                      # Optional, many more lines of this paragraph
      ^ \S .* 
      (?: \r? \n | $ )
 )*