I am looking for an elegant solution to find on what page(s) in a document a certain word occurs that I have stored in a python dictionary/list.
I first considered .docx format as an input and had a look at PythonDocx which has a search function, but there's obviously not really a pages attribute in the docx/xml format.
If I parse the document I could look for <w:br w:type="page"/>
occurrences in the xml tree but unfortunately these do not show non-forced page breaks.
I even considered converting files to PDF first and use something like PDFminer to parse the document page-wise.
Is there any straightforward solution to search a .docx document for a string and return the pages it occurs on like
[('foo' ,[1, 4, 7 ]), ('bar', [2]), ('baz', [2, 5, 8, 9 )]