2

I have a large set of .doc files which give the variables available in a set of corresponding datasets. I would like to scan through these in R and see which datasets contain a variable of interest. I have done this before on plain text files using readLines but this does not work on .doc files.

I have downloaded the tm package which should be able to read .doc files using the readDOC command, but the instructions are quite limited and I can't get it to work. Does anyone know how to use the readDOC command or have another suggestion for how to do this in R? Thanks!

Thank you very much everyone for the replies and suggestions. I thought R might be set up to read in .doc files quite easily, but from what you say I think the easiest thing is to convert all the word files to another format first. I've just downloaded some free software called 'Convert Doc' where I store all the word documents in one folder and it put them all to .txt files very quickly. Now I can automate the searching as I have around 100 datafiles with accompanying word documents that specify the variable coding, which is not always the same in each datafile (eg for yes/no, some use 0/1, others use 1/2) so this allows me to find the right variable and store its coding using readLines, grep and a bit more text processing. Thanks!

Lucy Okell
  • 21
  • 1
  • 3
  • 1
    Do you have antiword installed? Also, show us the code you're using. – Thomas Oct 21 '13 at 10:48
  • Why do you want to use `R` when a simple Word macro could do this just fine? But in any case, you need to give us an example of what constitutes "a variable of interest" so we can suggest ways to detect them. If it's just a word (i.e. character string) there are dozens of simple ways to do so, such as running FileLocatorLite http://www.mythicsoft.com/filelocatorlite on your directory full of Word files. – Carl Witthoft Oct 21 '13 at 11:30
  • Just making sure, .doc or .docx? – Tyler Rinker Oct 21 '13 at 12:56

2 Answers2

5

Your strategy depends upon what you want to do with the documents, and how important the structure of the document is.

If structure is important, then you could convert the Word documents to HTML and then extract the relevant portions using the XML package. If structure isn't important then converting them to plain text and importning them with readLines (as you have done previously) is possibly the better option.

That first conversion step is goin to be the tricky part. You can do this manually by right-clicking and choosing "Save as", which is the easiest technique for a small number of files.

In R, you'll probably have to do something involving a COM connection via the RDCOMClient package. This is often fiddly.

As much as I hate to suggest using VBScript for anything, it's probably a lot better for this task than R, so consider doing the resaving in that language.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • 1
    After few years it's better to use library(docxtractr) and library(readtext). Something like raw_text <- readtext(input_file_location)$text and for tables: word_extracted <- read_docx(input_file_location) tables<- docx_extract_all(word_extracted , guess_header = TRUE, preserve = FALSE, trim = TRUE) – kwadratens Dec 09 '22 at 18:22
2

Try the read_docx function from the qdapTools package.

Miha
  • 2,559
  • 2
  • 19
  • 34
user3357059
  • 1,122
  • 1
  • 15
  • 30
  • do not work for .doc files: Error: XML content does not seem to be XML: 'C:\Users\uzytkownik\AppData\Local\Temp\RtmpURvLDA\file27d454013cc8/word/document.xml' – Mikołaj Apr 20 '19 at 14:26
  • try to convert the doc file to pdf, see this link https://stackoverflow.com/questions/49113503/how-to-convert-docx-to-pdf-in-r, then read the resulting pdf file – user3357059 Apr 30 '19 at 15:29