1

I know that there are a bunch of other similar questions to this, but I have built off other answers with no success. I've dug here, here, here, here, and here but this question is closest to what I'm trying to do, however it's in php and I'm using python3

My goal is to extract a substring from a body text. The body is formatted:

**Header1**   
thing1  
thing2  
thing3  
thing4 

**Header2**  
dsfgs  
sdgsg  
rrrrrr 

**Hello Dolly**  
abider  
abcder  
ffffff

etc.

Formatting on SO is tough. But in the actual text, there's no spaces, just newlines for each line.

I want what's under Header2, so currently I have:

found = re.search("\*\*Header2\*\*\n[^*]+",body)
        if found:
            list = found.group(0)
            list = list[11:]
            list = list.split('\n')
            print(list)

But that's returning "None". Various other regex I've tried also haven't worked, or grabbed too much (all of the remaining headers). For what it's worth I've also tried: \*\*Header2\*\*.+?^\**$ \*\*Header2\*\*[^*\s\S]+\*\* and about 10 other permutations of those.

Community
  • 1
  • 1
singmotor
  • 3,930
  • 12
  • 45
  • 79

3 Answers3

1

Brief

Your pattern \*\*Header2\*\*\n[^*]+ isn't matching because your line **Header2** includes trailing spaces before the newline character. Adding * should suffice, but I've added other options below as well.


Code

See regex in use here

\*{2}Header2\*{2} *\n([^*]+)

Alternatively, you can also use the following regex (which also allows you to capture lines with * in them so long as they don't match the format of your header ^\*{2}[^*]*\*{2} - it also beautifully removes whitespace from the last element under the header - uses the im flags):

See regex in use here

^\*{2}Header2\*{2} *\n((?:(?!^\*{2}[^*]*\*{2}).)*?)(?=\s*^\*{2}[^*]*\*{2}|\s*\Z)

Usage

See code in use here

import re

regex = r"\*{2}Header2\*{2}\s*([^*]+)\s*"

test_str = ("**Header1**   \n"
    "thing1  \n"
    "thing2  \n"
    "thing3  \n"
    "thing4 \n\n"
    "**Header2**  \n"
    "dsfgs  \n"
    "sdgsg  \n"
    "rrrrrr \n\n"
    "**Hello Dolly**  \n"
    "abider  \n"
    "abcder  \n"
    "ffffff")

print(re.search(regex, test_str).group(1))

Explanation

The pattern is practically identical to the OP's original pattern. I made minor changes to allow it to better perform and also get the result the OP is expecting.

  1. \*\* changed to \*{2}: Very minor adjustment for performance
  2. \n changed to *\n: Takes additional spaces at the end of a line into account before the newline character
  3. ([^*]+): Captures the contents the OP is expecting into capture group 1
ctwheels
  • 21,901
  • 9
  • 42
  • 77
0

You could use

^\*\*Header2\*\*.*[\n\r]
(?P<content>(?:.+[\n\r])+)

with the multiline and verbose modifier, see a demo on regex101.com.
Afterwards, just grab what is inside content (i.e. using re.finditer()).


Broken down this says:
^\*\*Header2\*\*.*[\n\r]    # match **Header2** at the start of the line 
                            # and newline characters
(?P<content>(?:.+[\n\r])+)  # afterwards match as many non-null lines as possible


In Python:
import re
rx = re.compile(r'''
    ^\*\*Header2\*\*.*[\n\r]
    (?P<content>(?:.+[\n\r])+)
    ''', re.MULTILINE | re.VERBOSE)

for match in rx.finditer(your_string_here):
    print(match.group('content'))


I have the feeling that you even want to allow empty lines between paragraphs. If so, change the expression to
^\*\*Header2\*\*.*[\n\r]
(?P<content>[\s\S]+?)
(?=^\*\*)

See a demo for the latter on regex101.com as well.

Jan
  • 42,290
  • 8
  • 54
  • 79
0

You can try this:

import re
s = """
**Header1**   
thing1  
thing2  
thing3  
thing4 

**Header2**  
dsfgs  
sdgsg  
rrrrrr 

**Hello Dolly**  
abider  
abcder  
ffffff
"""
new_contents = re.findall('(?<=\*\*Header2\*\*)[\n\sa-zA-Z0-9]+', s) 

Output:

['  \ndsfgs  \nsdgsg  \nrrrrrr \n\n'] 

If you want to remove special characters from the output, you can try this:

final_data = filter(None, re.split('\s+', re.sub('\n+', '', new_contents[0])))

Output:

['dsfgs', 'sdgsg', 'rrrrrr']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102