Newbie in regex patterns. How to capture multiple lines?

Question

Im quite new to regex patterns. Im having difficulty parsing a text file and returning the matches per paragraph. So basically every paragraph is unique.

Here is my example text file

A quick brown
fox jumps over
the lazy dog;
1234;

Here is
the second paragraph
123141

I want is matches[0] to be: #A quick brown fox jumps over the lazy dog; 1234;

matches[1] to be: #Here is the second paragraph 123141

I've tried

regex = re.compile(r"(.*\n)\n", re.MULTILINE)
   with open(file_dir, "r") as file:
      matches = regex.findall(file.read())
print matches

But the result is ['1234;\n']. It doesnt capture the whole paragraph, and it didnt capture the second as well. What is the most efficient way of doing this?

See https://stackoverflow.com/questions/41620093/whats-the-difference-between-re-dotall-and-re-multiline — Thierry Lathuille, Aug 25 '20 at 12:10
What do you want actually? **So basically every paragraph is unique** Is it the separator? Kindly give a more generic input file details without comments — Sayan Dey, Aug 25 '20 at 12:13

Booboo · Answer 1 · 2020-08-25T12:51:59.307

Try (\S[\s\S]*?)(?:\n\n|$):

\S Matches a non-whitespace character
[\s\S]*? Match 0 or more whitespace or non-whitespace characters, i.e. any type of character including newline non-greedily. Items 1 and 2 are in capture group 1.
(?:\n\n|$) Matches two successive newline characters or $ (which matches either the end of string or the newline before the end of string) in a non-capture group.

Regex Demo

The code:

import re

s = """A quick brown
fox jumps over
the lazy dog;
1234;

Here is
the second paragraph
123141"""

matches = re.findall(r'(\S[\s\S]*?)(?:\n\n|$)', s)
print(matches)

Prints:

['A quick brown\nfox jumps over\nthe lazy dog;\n1234;', 'Here is\nthe second paragraph\n123141']

Alternatively, you can use:

\S(?:(?!\n\n)[\s\S])*

Which uses a negative looahead assertion and has about the same cost as the previous regex. This regex first looks for a non-whitespace character and then as long as the following input stream does not contain two successive newline characters will continue to scan one more character.

Regex Demo

Thanks for sharing this. I think the reason I was having a hard time creating a regex expression was because of using the multi line function. Although your second answer works as well in multiline. — Octane, Aug 25 '20 at 13:34
You might be able to get away with using multiline with the second regex version, but it is irrelevant because there is no `^` or `$` being used in the pattern, which is what re.MULTILINE affects. In the first regex `re.MULTILINE` would definitely be an error. — Booboo, Aug 25 '20 at 14:10

score -1 · Answer 2 · answered Aug 25 '20 at 12:21

This is a good start :

(?:.+\s)+

Test it here

Test code:

import re

regex = r"(?:.+\s)+"

test_str = ("A quick brown\n"
    "fox jumps over\n"
    "the lazy dog;\n"
    "1234;\n\n"
    "Here is\n"
    "the second paragraph\n"
    "123141")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Output:

Match 1 was found at 0-49: A quick brown
fox jumps over
the lazy dog;
1234;

Match 2 was found at 50-79: Here is
the second paragraph

You can see that the last line of the last paragraph is truncated. To avoid this, before matching the regex, add a \n at the end of the string, so the regex can detect the end of the paragraph: test_str += '\n'

You can try it here without the \n at the end, and here with it.

No, the code is generated from [regex101.com](https://regex101.com/r/iTSKXu/1), which I used to create the regex examples and python code. — totok, Aug 25 '20 at 12:30
Go to the "generated code" section, you'll have access to multiple programing languages, that you can test on tio.run — totok, Aug 25 '20 at 12:31

Newbie in regex patterns. How to capture multiple lines?

2 Answers2