Finding word on page(s) in document

Question

I am looking for an elegant solution to find on what page(s) in a document a certain word occurs that I have stored in a python dictionary/list.

I first considered .docx format as an input and had a look at PythonDocx which has a search function, but there's obviously not really a pages attribute in the docx/xml format. If I parse the document I could look for <w:br w:type="page"/> occurrences in the xml tree but unfortunately these do not show non-forced page breaks.

I even considered converting files to PDF first and use something like PDFminer to parse the document page-wise.

Is there any straightforward solution to search a .docx document for a string and return the pages it occurs on like

[('foo' ,[1, 4, 7 ]), ('bar', [2]), ('baz', [2, 5, 8, 9 )]

I think this is what you're looking for: [link](http://stackoverflow.com/questions/12571905/finding-on-which-page-a-search-string-is-located-in-a-pdf-document-using-python) — Roxy, Dec 30 '15 at 16:45
@mabe02 I haven't found a working solution yet no:/ but would be interested — birgit, Jan 22 '16 at 21:24

score 3 · Answer 1 · edited May 23 '17 at 12:24

Parse the xml files composing the docx

It seems that the biggest challenge in your question is how to be able to parse a document page by page. This answer of a word document is not always the same and it depends on the margins, the paper sheet settings, the application you use to open it etc. A good reasoning on the accuracy of any script for this purpose can be found at google group.

However, if you can be satisfied with a almost 100% accurate, you start to find a solution as suggested in this google group:

I found that I can unzip the .docx file and extract docProps/app.xml, then parse the XML with ElementTree to get the <Pages></Pages> element. I found that most of the time that number is accurate, but I've seen a few instances where the number in that element is not correct.

Use Win32com.Client

Another approach could be to use win32com.client to open the file, paginate it, make your search and then return the results in the format you want it.

You can find an example of the syntax in this answer:

from win32com.client import Dispatch
#open Word
word = Dispatch('Word.Application')
word.Visible = False
word = word.Documents.Open(doc_path)

#get number of sheets
word.Repaginate()
num_of_sheets = word.ComputeStatistics(2)

You can also have a look to this answer regarding find and replace in a word document using win32com.client.

@birgit so does it answer to your question? was it useful? – mabe02 Jan 26 '16 at 08:53 — mabe02, Jan 26 '16 at 08:53

Finding word on page(s) in document

1 Answers1

Parse the xml files composing the docx

Use Win32com.Client