1

How can I read a Microsoft .docx file in R and get the text as one field and page number as another?

From the readtext R libraries, I can read the text, but wondering if you know how to get the page number as well?

install.packages("readtext")

library(readtext)

doc <- readtext(system.file("examples/realworld.docx", package="docxtractr"))

So the desired output should be

text                page_number
text from page 1     1
text from page 2     2

Please advise.

Geet
  • 2,515
  • 2
  • 19
  • 42
  • 1
    From looking into it, I'm not sure word actually notes page numbers, it just dynamically flows text onto new pages when it's full. There are page break tags in `xml`, but I think those are only for breaks that are inserted. I'd be interested in knowing if this is possible. https://stackoverflow.com/questions/23980268/find-a-new-page-in-a-word-document – Anonymous coward Jul 30 '18 at 20:39
  • I found that read_pdf function of the textreadr R package does read page number and line number, but then how should I convert .docx to .PDF file using R? – Geet Jul 30 '18 at 20:48
  • 1
    You can see if `pandoc` works. https://stackoverflow.com/questions/49113503/how-to-convert-docx-to-pdf-in-r – Anonymous coward Jul 30 '18 at 21:08
  • You can convert a docx document to a PDF document with the RDOMClient package. See https://stackoverflow.com/questions/49113503/how-to-convert-docx-to-pdf – Emmanuel Hamel Apr 18 '23 at 03:55

1 Answers1

1

I have been able to get the page of the text with the following approach :

library(RDCOMClient)

wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
path_To_Word_File <- "D:\\text.docx"
doc <- wordApp[["Documents"]]$Open(normalizePath(path_To_Word_File), ConfirmConversions = FALSE)
doc_Selection <-  wordApp$Selection()

list_Text <- list()
list_Page_Text <- list()
counter <- 0

for(l in 1 : 2)
{
  for(i in 1 : 40)
  {
    print(i)
    error_Term <- tryCatch(wordApp[["ActiveDocument"]]$ActiveWindow()$Panes(1)$Pages(l)$Rectangles(i)$Range()$Select(),
                           error = function(e) NA)
    
    bool_Null <- !is.null(error_Term)
    bool_NA <- is.na(tryCatch(is.na(error_Term), error = function(e) NA))
    bool_NA <- length(bool_NA) != 0
    
    if(bool_Null == TRUE | bool_NA == TRUE)
    {
      break
    }
    
    counter <- counter + 1
    list_Text[[counter]] <- tryCatch(doc_Selection$Range()$Text(), error = function(e) NA)
    list_Page_Text[[counter]] <- l
  }
}

list_Text
[[1]]
[1] "hi\r"

[[2]]
[1] "\r"

[[3]]
[1] "this is a good text\r"

[[4]]
[1] "\r"

[[5]]
[1] "\r"

[[6]]
[1] "\r"

[[7]]
[1] "here is a word document\r"

[[8]]
[1] "\r"

[[9]]
[1] "\r"

[[10]]
[1] "\r"

[[11]]
[1] "\r"

[[12]]
[1] "\r"

[[13]]
[1] "\r"

[[14]]
[1] "\r"

[[15]]
[1] "\r"

[[16]]
[1] "\r"

[[17]]
[1] "\r"

[[18]]
[1] "\r"

[[19]]
[1] "\r"

[[20]]
[1] "\r"

[[21]]
[1] "\r"

[[22]]
[1] "\r"

[[23]]
[1] "\r"

[[24]]
[1] "\r"

[[25]]
[1] "\r"

[[26]]
[1] "\r"

[[27]]
[1] "\r"

[[28]]
[1] "\r"

[[29]]
[1] "\r"

[[30]]
[1] "\r"

[[31]]
[1] "\r"

[[32]]
[1] "My cat love me\r"

[[33]]
[1] "\r"

[[34]]
[1] "hahah\r"

[[35]]
[1] "\r"

[[36]]
[1] "\r"

[[37]]
[1] "\r"

[[38]]
[1] "\r"

[[39]]
[1] "\r"

list_Page_Text
[[1]]
[1] 1

[[2]]
[1] 1

[[3]]
[1] 1

[[4]]
[1] 1

[[5]]
[1] 1

[[6]]
[1] 1

[[7]]
[1] 1

[[8]]
[1] 1

[[9]]
[1] 1

[[10]]
[1] 1

[[11]]
[1] 1

[[12]]
[1] 1

[[13]]
[1] 1

[[14]]
[1] 1

[[15]]
[1] 1

[[16]]
[1] 1

[[17]]
[1] 1

[[18]]
[1] 1

[[19]]
[1] 1

[[20]]
[1] 1

[[21]]
[1] 1

[[22]]
[1] 1

[[23]]
[1] 1

[[24]]
[1] 1

[[25]]
[1] 1

[[26]]
[1] 1

[[27]]
[1] 1

[[28]]
[1] 1

[[29]]
[1] 1

[[30]]
[1] 2

[[31]]
[1] 2

[[32]]
[1] 2

[[33]]
[1] 2

[[34]]
[1] 2

[[35]]
[1] 2

[[36]]
[1] 2

[[37]]
[1] 2

[[38]]
[1] 2

[[39]]
[1] 2


Emmanuel Hamel
  • 1,769
  • 7
  • 19