12

Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.

I am using the line:

my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')

to try to read an MSWord file containing the following text:

A   20  1000    AA
B   30  1001    BB
C   10  1500    CC

I get a warning message that says:

Warning message: In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") : incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'

and my.data appears to be gibberish:

# [1] "PK\003\004\024" "¤l"             "ÈFÃË‹Átí"

I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.

I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.

Thank you for any suggestions.

Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • As far as I know, reading MS Word files is going to require installing some package from CRAN. Why are you concerned about installing third-party software? – Jason Morgan Jun 20 '12 at 00:31
  • The tm package provides the function readDOC(). This requires installation of an external (non-R) tool named antiword. However, I believe that the package/tool only reads Word files up to version 2003 and will not handle .docx files. readLines() is not the correct solution either; it requires plain ASCII text as input. – neilfws Jun 20 '12 at 00:42
  • 2
    What if you were to save the word document as `html` and then use a web scraping package (eg `XML` or `RCurl`) to extract the text? – mnel Jun 20 '12 at 00:49
  • Thank you for the suggestions. I have never done web scraping, although it is on my list of things to learn. Perhaps this is the motivating factor for me to learn it. – Mark Miller Jun 20 '12 at 00:58
  • 2
    Did you try OCR of the original pdfs with Google Docs? There are other free online OCR services that wouldn't require software installation. – tim riffe Jun 20 '12 at 01:13
  • For passerby who also use python `pip install python-docx` , `import docx` – Brandon Bertelsen Jun 26 '19 at 22:23

4 Answers4

7

First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.

The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.

The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.

Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.

neilfws
  • 32,751
  • 5
  • 50
  • 63
  • 3
    I think this is a reasonable answer, though I think the final sentence is important enough that I'd have put it first. – Glen_b Jun 20 '12 at 07:11
  • 3
    I would rephrase the last sentence to: "Word and PDF are _not_ appropriate formats for storing anything. Ever." Microsoft is infamous for releasing versions of Office that can't read older file formats (Excel4.0, anyone?), and PDF is butt-tugly. ASCII and epub (which is just zipped XML) are far better choices. – Carl Witthoft Jun 20 '12 at 11:32
  • Since this very old answer received recent attention, I'd point out that...it's very old and there may be alternatives now. For example, Word `.docx` is basically a zipped folder of XML files, so could probably be processed using tools for XML. – neilfws Jun 26 '19 at 22:24
7

In case it helps anyone else, https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).

Amit Kohli
  • 2,860
  • 2
  • 24
  • 44
1

I have not figured out how to read the MSWord file into R, but I have gotten the contents into a format that R can read.

  1. I converted a pdf to MSWord with Acrobat X Pro

  2. The original tables had solid vertical lines separating columns. It turns out these vertical lines were disrupting the format of the data when I converted an MSWord file to a text file, but I was able to delete the lines from an MSWord file before creating a text file.

  3. Convert the MSWord file to a text file after deleting vertical lines in Step 2.

  4. Resulting text files still require extensive editing, but at least the data are largely present in a format R can read and I will not have to re-enter all data in the pdfs by hand, saving many hours of work.

Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • 2
    An even better alternative: make one of your grad students do it for you. Of course, this only works if you're the professor and not the student :-) – Carl Witthoft Jun 20 '12 at 11:34
1

You can do this with RDCOMClient very easily. In saying so, some characters will not read in correctly.

require(RDCOMClient)
# Create the connection
wordApp <- COMCreate("Word.Application")
# Let's set visible to true so you can see it run
wordApp[["Visible"]] <- TRUE

# Define the file we want to open
wordFileName <- "c:/path/to/word/doc.docx"
# Open the file
doc <- wordApp[["Documents"]]$Open(wordFileName)
# Print the text
print(doc$range()$text()) 
Khaynes
  • 1,976
  • 2
  • 15
  • 27
  • When I try this code I receive an error `Exception occurred` and `object doc not found`. `setwd('C:/Users/markm/simple R programs'); require(RDCOMClient); wordApp <- COMCreate("Word.Application"); wordApp[["Visible"]] <- TRUE; wordFileName <- "C:/Users/markm/simple R programs/My_test_MSWord_file.docx"; doc <- wordApp[["Documents"]]$Open(wordFileName); print(doc$range()$text());` – Mark Miller Nov 25 '16 at 09:13
  • Mark, you sure you got the file location right? I can only emulate the issue by defining an invalid file location. – Khaynes Dec 19 '16 at 23:44
  • try enclosing the path in normalizePath e.g. `wordFileName <- normalizePath("C:/Users/markm/simple R programs/My_test_MSWord_file.docx");` – user3357059 Feb 10 '20 at 19:15