Read a word document by pages using docx2python package

Asked Dec 15 '21 at 21:24

Active Jan 06 '22 at 08:51

Viewed 1,618 times

How could I read a word document by pages (I want to create a dictionary, where the keys would be the number of pages and their respective values would be the strings corresponding to the pages: {"1": "content 1", "2": "content 2 ", ...}) with docx2python? If it is not possible with this package, with what package could I do it?

This is my code so far, it returns a whole word document as a string. Thank you.

!pip install docx2python

from docx2python import docx2python

def read_word(file_path):
    """
    Function that reads a Word file and returns a string
    """
    
    # Extract docx content, ignore images
    doc = docx2python(file_path, extract_image = False)    

    # Get all text in a single string    
    output = doc.text    
    
    return output

edited Dec 15 '21 at 22:48

asked Dec 15 '21 at 21:24

user140259

1

Page numbers are added by the rendering engine (Word). They are not a permanent part of the file, so no library is likely to extract them. – Shay Dec 22 '21 at 09:04
1

Thanks, the approach I did is to convert them to pdf. – user140259 Dec 22 '21 at 13:15

Read a word document by pages using docx2python package

0 Answers0