0

I am trying to extract the paragraphs between Result and Conclusion from a research paper using regular expressions. For the following sample, the emphasized paragraphs between "6. Results" and "7. Conclusion" should be matched.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec fermentum orci nec felis. Sed sollicitudin diam id sapien.

6. Results

Ut libero. Vestibulum quam libero, malesuada et, ornare id, aliquet id, tellus.

Nullam dapibus viverra quam. Vestibulum sit amet nunc vel justo dictum pharetra.

7. Conclusion.

Duis imperdiet venenatis purus.

I tried this and it the output is None

x = (re.match(r'^[0-9]\s(Result)\.(.*?)^[0-9]\s(Conclusion)', text))

How could the Python re module be used to extract the paragraphs? This assumes regexes are the most appropriate tool, but they're not required for answers.

outis
  • 75,655
  • 22
  • 151
  • 221
  • Can you not just select the text with the mouse and copy it? – mkrieger1 Oct 04 '22 at 18:46
  • Why did you include a second `^`, or any `^` at all? – mkrieger1 Oct 04 '22 at 18:48
  • 1
    Do you know the difference between `re.match` and `re.search`? (If not, look it up. You might need to use `re.search`) – mkrieger1 Oct 04 '22 at 18:49
  • 1
    Does this answer your question? "[How do I match any character across multiple lines in a regular expression?](/q/159118/90527)", "[What is the difference between re.search and re.match?](/q/180986/90527)", "[Python and "re"](/q/72393/90527)" – outis Oct 04 '22 at 19:12
  • I can’t just select the text because the input as a research paper could be any paper. I just want to extract the results and present it as the output. I don’t know much about regex, I tried whatever I could find here from other answers but I couldn’t find anything suitable for me – everythingispeachy Oct 06 '22 at 05:13

1 Answers1

0

instead of using regex, you could also just try splitting the document into a list of strings. Then just combine elements and add them to a new list until you hit a section header. Maybe something like this:

blocks = []

with open('researchpaper.txt', 'r') as f:
  lines = f.readlines()
  block = ''
  for line in lines:
    if re.match('^d*\.\s.*'):
      blocks.append(block)
      block = ''
    else:
      block += line

Andrew Lien
  • 102
  • 4