0

How could I read a word document by pages (I want to create a dictionary, where the keys would be the number of pages and their respective values would be the strings corresponding to the pages: {"1": "content 1", "2": "content 2 ", ...}) with docx2python? If it is not possible with this package, with what package could I do it?

This is my code so far, it returns a whole word document as a string. Thank you.

!pip install docx2python

from docx2python import docx2python

def read_word(file_path):
    """
    Function that reads a Word file and returns a string
    """
    
    # Extract docx content, ignore images
    doc = docx2python(file_path, extract_image = False)    

    # Get all text in a single string    
    output = doc.text    
    
    return output  
user140259
  • 450
  • 2
  • 5
  • 16
  • 1
    Page numbers are added by the rendering engine (Word). They are not a permanent part of the file, so no library is likely to extract them. – Shay Dec 22 '21 at 09:04
  • 1
    Thanks, the approach I did is to convert them to pdf. – user140259 Dec 22 '21 at 13:15

0 Answers0