Regex search up to first instance Python

Question

I know that there are a bunch of other similar questions to this, but I have built off other answers with no success. I've dug here, here, here, here, and here but this question is closest to what I'm trying to do, however it's in php and I'm using python3

My goal is to extract a substring from a body text. The body is formatted:

**Header1**   
thing1  
thing2  
thing3  
thing4 

**Header2**  
dsfgs  
sdgsg  
rrrrrr 

**Hello Dolly**  
abider  
abcder  
ffffff

etc.

Formatting on SO is tough. But in the actual text, there's no spaces, just newlines for each line.

I want what's under Header2, so currently I have:

found = re.search("\*\*Header2\*\*\n[^*]+",body)
        if found:
            list = found.group(0)
            list = list[11:]
            list = list.split('\n')
            print(list)

But that's returning "None". Various other regex I've tried also haven't worked, or grabbed too much (all of the remaining headers). For what it's worth I've also tried: \*\*Header2\*\*.+?^\**$ \*\*Header2\*\*[^*\s\S]+\*\* and about 10 other permutations of those.

`\n` doesn't exist after `**Header**` because there are spaces. — ctwheels, Dec 28 '17 at 17:28
@ctwheels removing the \n fixed my issue! If you'd like to post that as an answer I'll accept it — singmotor, Dec 28 '17 at 17:50

ctwheels · Accepted Answer · 2017-12-28T18:02:17.397

Brief

Your pattern \*\*Header2\*\*\n[^*]+ isn't matching because your line **Header2** includes trailing spaces before the newline character. Adding * should suffice, but I've added other options below as well.

Code

See regex in use here

\*{2}Header2\*{2} *\n([^*]+)

Alternatively, you can also use the following regex (which also allows you to capture lines with * in them so long as they don't match the format of your header ^\*{2}[^*]*\*{2} - it also beautifully removes whitespace from the last element under the header - uses the im flags):

See regex in use here

^\*{2}Header2\*{2} *\n((?:(?!^\*{2}[^*]*\*{2}).)*?)(?=\s*^\*{2}[^*]*\*{2}|\s*\Z)

Usage

See code in use here

import re

regex = r"\*{2}Header2\*{2}\s*([^*]+)\s*"

test_str = ("**Header1**   \n"
    "thing1  \n"
    "thing2  \n"
    "thing3  \n"
    "thing4 \n\n"
    "**Header2**  \n"
    "dsfgs  \n"
    "sdgsg  \n"
    "rrrrrr \n\n"
    "**Hello Dolly**  \n"
    "abider  \n"
    "abcder  \n"
    "ffffff")

print(re.search(regex, test_str).group(1))

Explanation

The pattern is practically identical to the OP's original pattern. I made minor changes to allow it to better perform and also get the result the OP is expecting.

\*\* changed to \*{2}: Very minor adjustment for performance
\n changed to *\n: Takes additional spaces at the end of a line into account before the newline character
([^*]+): Captures the contents the OP is expecting into capture group 1

Jan · Answer 2 · 2017-12-28T17:35:10.613

You could use

^\*\*Header2\*\*.*[\n\r]
(?P<content>(?:.+[\n\r])+)

with the multiline and verbose modifier, see a demo on regex101.com.
Afterwards, just grab what is inside content (i.e. using re.finditer()).

Broken down this says:

^\*\*Header2\*\*.*[\n\r]    # match **Header2** at the start of the line 
                            # and newline characters
(?P<content>(?:.+[\n\r])+)  # afterwards match as many non-null lines as possible

In Python:

import re
rx = re.compile(r'''
    ^\*\*Header2\*\*.*[\n\r]
    (?P<content>(?:.+[\n\r])+)
    ''', re.MULTILINE | re.VERBOSE)

for match in rx.finditer(your_string_here):
    print(match.group('content'))

I have the feeling that you even want to allow empty lines between paragraphs. If so, change the expression to

^\*\*Header2\*\*.*[\n\r]
(?P<content>[\s\S]+?)
(?=^\*\*)

See a demo for the latter on regex101.com as well.

my code is never entering the for statement on this :/ – singmotor Dec 28 '17 at 17:45 — singmotor, Dec 28 '17 at 17:45

Ajax1234 · Answer 3 · 2017-12-28T17:48:10.263

0

You can try this:

import re
s = """
**Header1**   
thing1  
thing2  
thing3  
thing4 

**Header2**  
dsfgs  
sdgsg  
rrrrrr 

**Hello Dolly**  
abider  
abcder  
ffffff
"""
new_contents = re.findall('(?<=\*\*Header2\*\*)[\n\sa-zA-Z0-9]+', s)

Output:

['  \ndsfgs  \nsdgsg  \nrrrrrr \n\n']

If you want to remove special characters from the output, you can try this:

final_data = filter(None, re.split('\s+', re.sub('\n+', '', new_contents[0])))

Output:

['dsfgs', 'sdgsg', 'rrrrrr']

edited Dec 28 '17 at 17:48

answered Dec 28 '17 at 17:32

Ajax1234

69,937
8
61
102

this returns an empty array for 'new_contents' for me :/ – singmotor Dec 28 '17 at 17:43
@Acoustic77 are you reading from a text file? I achieved the results above by creating a multiline string from the input that you posted. – Ajax1234 Dec 28 '17 at 17:46

Regex search up to first instance Python

3 Answers3

Brief

Code

Usage

Explanation