How to match text between headers (formatted as number and title)?

Question

I am trying to extract the paragraphs between Result and Conclusion from a research paper using regular expressions. For the following sample, the emphasized paragraphs between "6. Results" and "7. Conclusion" should be matched.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec fermentum orci nec felis. Sed sollicitudin diam id sapien.

6. Results

Ut libero. Vestibulum quam libero, malesuada et, ornare id, aliquet id, tellus.

Nullam dapibus viverra quam. Vestibulum sit amet nunc vel justo dictum pharetra.

7. Conclusion.

Duis imperdiet venenatis purus.

I tried this and it the output is None

x = (re.match(r'^[0-9]\s(Result)\.(.*?)^[0-9]\s(Conclusion)', text))

How could the Python re module be used to extract the paragraphs? This assumes regexes are the most appropriate tool, but they're not required for answers.

Can you not just select the text with the mouse and copy it? — mkrieger1, Oct 04 '22 at 18:46
Do you know the difference between `re.match` and `re.search`? (If not, look it up. You might need to use `re.search`) — mkrieger1, Oct 04 '22 at 18:49
Does this answer your question? "[How do I match any character across multiple lines in a regular expression?](/q/159118/90527)", "[What is the difference between re.search and re.match?](/q/180986/90527)", "[Python and "re"](/q/72393/90527)" — outis, Oct 04 '22 at 19:12
I can’t just select the text because the input as a research paper could be any paper. I just want to extract the results and present it as the output. I don’t know much about regex, I tried whatever I could find here from other answers but I couldn’t find anything suitable for me — everythingispeachy, Oct 06 '22 at 05:13

score 0 · Answer 1 · answered Oct 04 '22 at 21:39

0

instead of using regex, you could also just try splitting the document into a list of strings. Then just combine elements and add them to a new list until you hit a section header. Maybe something like this:

blocks = []

with open('researchpaper.txt', 'r') as f:
  lines = f.readlines()
  block = ''
  for line in lines:
    if re.match('^d*\.\s.*'):
      blocks.append(block)
      block = ''
    else:
      block += line

answered Oct 04 '22 at 21:39

Andrew Lien

102
4

I will try this. Will this help me extract the paragraphs after Result header? – everythingispeachy Oct 06 '22 at 05:14
yea thats the intent – Andrew Lien Oct 12 '22 at 17:17

How to match text between headers (formatted as number and title)?

1 Answers1