-2

i have .docx files in a directory and i want to get all text between two paragraphs.

Example:

Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :

I want to get :

The foo is not easy, but we have to do it.
We are looking for new things in our ad libitum way of life. 

I wrote this code :

import docx
import pathlib
import glob
import re

def rf(f1):
    reader = docx.Document(f1)
    alltext = []
    for p in reader.paragraphs:
        alltext.append(p.text)
    return '\n'.join(alltext)


for f in docxfiles:
    try:
        fulltext = rf(f)
        testf = re.findall(r'Foo\s*:(.*)\s*Bar', fulltext, re.DOTALL)
        
        print(testf)
    except IOError:
        print('Error opening',f)

it returns None

What am I doing wrong ?

ricardo
  • 82
  • 1
  • 9

1 Answers1

0

If you loop over all paragraphs and print paragraphs texts you get the document text as is - but the single p.text of your loop does not contain the full documents text.

It works with a string:

t = """Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :"""
      
import re
      
fread = re.search(r'Foo\s*:(.*)\s*Bar', t)
      
print(fread)  # None  - because dots do not match \n
     
fread = re.search(r'Foo\s*:(.*)\s*Bar', t, re.DOTALL)
      
print(fread)
print(fread[1])

Output:

<_sre.SRE_Match object; span=(0, 115), match='Foo :\n\nThe foo is not easy, but we have to do i>


The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

If you use

for p in reader.paragraphs:
    print("********")
    print(p.text)
    print("********")

you see why your regex wont match. Your regex would work on the whole documents text.

See How to extract text from an existing docx file using python-docx how to get the whole docs text.

You could as well look for a paragraph that matches r'Foo\s*:' - then put all following paragraph.text's into a list until you hit a paragraph that matches r'\s*Bar'.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69