I currently have a (large) amount of text data in (hundreds of) .pdf and .docx files. I would like to extract the text per page as later in the analysis, page numbers become relevant.
For the pdf files, I'm using the pdftools package, which works quite well and returns a vector with character strings where each element is the text of one page of the document. Sentences or words that span across two pages might be cut off, but that is less of a problem for now.
pdftools::pdf_text("Test.pdf") # delivers a string voor each page
For the word documents, I would like to have the same output. I'm currently trying the officer package for this. However, this package reads the text per paragraph instead of per page.
# load the file
doc <- officer::read_docx(path = "Test.docx")
# extract the text
doc_text <- officer::docx_summary(doc)$text # delivers a string for each paragraph
Is there any way to change the output from returning paragraphs to returning pages? If necessary by tweaking the underlying read_docx or docx_summary functions to split the text for each page break instead of each paragraph?
Also, recommendations for other packages or methods to achieve the output are welcome. However, if possible, I would to avoid having to transform the word document into a pdf document, though.
A simple test document can be generated with a Lorem Ipsum generator: https://www.lipsum.com/feed/html