1

Im quite new to regex patterns. Im having difficulty parsing a text file and returning the matches per paragraph. So basically every paragraph is unique.

Here is my example text file

A quick brown
fox jumps over
the lazy dog;
1234;

Here is
the second paragraph
123141

I want is matches[0] to be: #A quick brown fox jumps over the lazy dog; 1234;

matches[1] to be: #Here is the second paragraph 123141

I've tried

regex = re.compile(r"(.*\n)\n", re.MULTILINE)
   with open(file_dir, "r") as file:
      matches = regex.findall(file.read())
print matches

But the result is ['1234;\n']. It doesnt capture the whole paragraph, and it didnt capture the second as well. What is the most efficient way of doing this?

Octane
  • 149
  • 1
  • 1
  • 11
  • 1
    See https://stackoverflow.com/questions/41620093/whats-the-difference-between-re-dotall-and-re-multiline – Thierry Lathuille Aug 25 '20 at 12:10
  • What do you want actually? **So basically every paragraph is unique** Is it the separator? Kindly give a more generic input file details without comments – Sayan Dey Aug 25 '20 at 12:13

2 Answers2

2

Try (\S[\s\S]*?)(?:\n\n|$):

  1. \S Matches a non-whitespace character
  2. [\s\S]*? Match 0 or more whitespace or non-whitespace characters, i.e. any type of character including newline non-greedily. Items 1 and 2 are in capture group 1.
  3. (?:\n\n|$) Matches two successive newline characters or $ (which matches either the end of string or the newline before the end of string) in a non-capture group.

Regex Demo

The code:

import re

s = """A quick brown
fox jumps over
the lazy dog;
1234;

Here is
the second paragraph
123141"""

matches = re.findall(r'(\S[\s\S]*?)(?:\n\n|$)', s)
print(matches)

Prints:

['A quick brown\nfox jumps over\nthe lazy dog;\n1234;', 'Here is\nthe second paragraph\n123141']

Alternatively, you can use:

\S(?:(?!\n\n)[\s\S])*

Which uses a negative looahead assertion and has about the same cost as the previous regex. This regex first looks for a non-whitespace character and then as long as the following input stream does not contain two successive newline characters will continue to scan one more character.

Regex Demo

Booboo
  • 38,656
  • 3
  • 37
  • 60
  • Thanks for sharing this. I think the reason I was having a hard time creating a regex expression was because of using the multi line function. Although your second answer works as well in multiline. – Octane Aug 25 '20 at 13:34
  • You might be able to get away with using multiline with the second regex version, but it is irrelevant because there is no `^` or `$` being used in the pattern, which is what re.MULTILINE affects. In the first regex `re.MULTILINE` would definitely be an error. – Booboo Aug 25 '20 at 14:10
-1

This is a good start :

(?:.+\s)+

Test it here

Test code:

import re

regex = r"(?:.+\s)+"

test_str = ("A quick brown\n"
    "fox jumps over\n"
    "the lazy dog;\n"
    "1234;\n\n"
    "Here is\n"
    "the second paragraph\n"
    "123141")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Output:

Match 1 was found at 0-49: A quick brown
fox jumps over
the lazy dog;
1234;

Match 2 was found at 50-79: Here is
the second paragraph

You can see that the last line of the last paragraph is truncated. To avoid this, before matching the regex, add a \n at the end of the string, so the regex can detect the end of the paragraph: test_str += '\n'

You can try it here without the \n at the end, and here with it.

totok
  • 1,436
  • 9
  • 28
  • No, the code is generated from [regex101.com](https://regex101.com/r/iTSKXu/1), which I used to create the regex examples and python code. – totok Aug 25 '20 at 12:30
  • Go to the "generated code" section, you'll have access to multiple programing languages, that you can test on tio.run – totok Aug 25 '20 at 12:31