I want to find all the occurrences of an specific term (and its variations) in a word document.
- Extracted the text from the word document
- Try to find pattern via regex
The pattern consists of words that start with DOC- and after the - there are 9 digits.
I have tried the following without success:
document variable is the extracted text with the following function:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
- pattern = re.compile('^DOC.\d{9}$')
- pattern.findall(document)
pattern.findall(document)
Can someone help me?
Thanks in advance