-1

So for a bit of weekend fun, I decided I was going to try and read a Microsoft Word .doc file into R. Specifically I have a .doc file version of the PDF below:

http://www.queensu.ca/rarc/services/ASDAssessmentTemplate/AAA/AQ_Scoring_Key.pdf

What I would like to do is extract the table into something like a dataframe in R. Now my initial investigation leads me to believe that the "tm" package could be handy for this, but I can't seem to get it to work.

As usual, any help would be gratefully received.

Edit: This question asks for the specific steps (i.e. code) for reading in a .doc file and thus is not a duplicate of the question that has been linked as a duplicate.

googleplex101
  • 195
  • 2
  • 13
  • http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example?s=1|9.3977 – rawr Feb 21 '15 at 17:50

1 Answers1

3

Here is an example on how one could extract a simple table from a doc/docx file:

require(XML)
download.file(url = "https://www.dropbox.com/s/36ydzz98beluhj8/test.docx?dl=1", 
              destfile = file.path(tempdir(), "test.docx"),
              mode = "wb")
unzip(file.path(tempdir(), "test.docx"), exdir = tempdir()) 
doc <- xmlParse(file.path(tempdir(), "word", "document.xml") )
df <- 
  as.data.frame(
    matrix(
      xpathSApply(doc, "//w:tbl/w:tr/w:tc", xmlValue), 
      ncol = length(getNodeSet(doc, "//w:tbl/w:tr[1]/w:tc")), 
      nrow =  length(getNodeSet(doc, "//w:tbl/w:tr")),
      byrow = TRUE
    )
)

enter image description here

df
#   V1 V2 V3
# 1     2  3
# 2  4  5  6
# 3  7     9

Tweak it according to your needs.

lukeA
  • 53,097
  • 5
  • 97
  • 100
  • Thanks for your input. I have used XML before to look at HTML but I can't seem to make your code work for my .doc file. Can you explain the unzip function and the file.path(tempdir()) functions a bit more please? – googleplex101 Feb 22 '15 at 13:04
  • test.docx is an archive, and unzip extracts the files form it to the standard temporary directory. See `?tempdir`, `?unzip` and `?file.path`. What problems have you got? – lukeA Feb 22 '15 at 17:20
  • NOTE: this solution works proper only when You got one table. Anyone interested in better extraction of more than one table should use: library(docxtractr), here's kinda nice example: https://rdrr.io/cran/docxtractr/man/docx_extract_all.html and it goes like this: real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr")) docx_tbl_count(real_world) – kwadratens Dec 06 '22 at 20:39