1

I currently have a (large) amount of text data in (hundreds of) .pdf and .docx files. I would like to extract the text per page as later in the analysis, page numbers become relevant.

For the pdf files, I'm using the pdftools package, which works quite well and returns a vector with character strings where each element is the text of one page of the document. Sentences or words that span across two pages might be cut off, but that is less of a problem for now.

pdftools::pdf_text("Test.pdf") # delivers a string voor each page

For the word documents, I would like to have the same output. I'm currently trying the officer package for this. However, this package reads the text per paragraph instead of per page.

# load the file
doc <- officer::read_docx(path = "Test.docx")
# extract the text
doc_text <- officer::docx_summary(doc)$text # delivers a string for each paragraph

Is there any way to change the output from returning paragraphs to returning pages? If necessary by tweaking the underlying read_docx or docx_summary functions to split the text for each page break instead of each paragraph?

Also, recommendations for other packages or methods to achieve the output are welcome. However, if possible, I would to avoid having to transform the word document into a pdf document, though.

A simple test document can be generated with a Lorem Ipsum generator: https://www.lipsum.com/feed/html

Rasul89
  • 588
  • 2
  • 5
  • 14
  • 1
    A workaround could be to convert your .docx files to .pdf (check this https://stackoverflow.com/q/49113503/21243518) and use `pdf_text` on all documents. – L-- Mar 03 '23 at 11:03
  • 1
    @L-- is right, convert your docx to pdf and then it will be easier. A solution not listed in the linked post is `doconv:: docx2pdf()` that can be used when Word is installed (I am the author). – David Gohel Mar 03 '23 at 12:03

1 Answers1

0

I have been able to extract the text of a specific page with the following code :

library(RDCOMClient)

wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
path_To_Word_File <- "D:\\Word_File.docx"
doc <- wordApp[["Documents"]]$Open(normalizePath(path_To_Word_File), ConfirmConversions = FALSE)
doc_Selection <-  wordApp$Selection()

list_Text <- list()

for(i in 1 : 40)
{
  print(i)
  error_Term <- tryCatch(wordApp[["ActiveDocument"]]$ActiveWindow()$Panes(1)$Pages(1)$Rectangles(i)$Range()$Select(),
                         error = function(e) NA)
  
  list_Text[[i]] <- tryCatch(doc_Selection$Range()$Text(), error = function(e) NA)
  
  if(!is.null(error_Term))
  {
    break
  }
}

list_Text

The idea is that we loop over all the rectangles of a page and extract the text of all the rectangles.

Emmanuel Hamel
  • 1,769
  • 7
  • 19