I'm trying to look for Python script that could extract text from the first page of a word document. I found functions that could do paragraphs but not pages, which is not what I need.
Asked
Active
Viewed 1,337 times
1 Answers
2
The problem is, pages in docx format are purely virtual. MS Word decides by itself where and when to put page limiters, based on the text size and another parameters.
It's a little bit easier when user did explicitly set page breaks, as they can be found like it's described there, for example.
As a workaround, you can just calculate the amount of lines per page and trim it by yourself, but as long as I know, there's no "easy" method to do everything with 1 code line.

Дмитрий Клименко
- 214
- 2
- 6
-
I see. I don't think there are page breaks. Can I convert the word doc to pdf and then read the first page using pdftotext function? – L Zh Sep 25 '18 at 14:25
-
1Did that (convert .doc to .pdf and read first page) and it worked! – L Zh Sep 25 '18 at 16:01
-
Great to know that! Wish you the best. – Дмитрий Клименко Sep 25 '18 at 16:05