1

I'm trying to convert DOC or DOCX files to plain text (TXT), ensuring that all formats and styles are ignored and encodings render properly, and that no manual pre-processing needs to be done by the user.

The officer package gets me most of the way. The following code yields a TXT file free from junk characters, and without any text indicating header styles etc:

doc <- officer::read_docx("my_doc.docx")
content <- docx_summary(doc)
writeLines(content$text, file("textout.txt", encoding="UTF-8"))

However, this output shows complete field codes. For example, a date in the input file is rendering as:

"DATE \@ "d MMMM yyyy" 17 July 2019"

And the Table of Contents object is omitted entirely.

Again, I cannot do any manual pre-processing, unless its automatable with code! I'm aware that I can unlink all the fieldcodes, but unless there's a automated way of doing this at the Command Line or in R only, that's not an option.

As an alternative, using pandoc leads to text that fixes the field code problem:

rmarkdown::pandoc_convert(doc_file, to="plain", from="docx")

But the encodings aren't right. Examples:

"those with an affinityÂ"
"Station’s business model?Â"

Can somebody help me sort out a solution here? Personally, I'm happy to incorporate other tools, but an R only approach would be excellent.

mb21
  • 34,845
  • 8
  • 116
  • 142
tef2128
  • 740
  • 1
  • 8
  • 19
  • Perhaps print to PDF [https://superuser.com/questions/393118/how-to-convert-word-doc-to-pdf-from-windows-command-line] and then scrape that with Tabulizer? – Jon Spring Jul 18 '19 at 00:35
  • https://stackoverflow.com/a/33149947? https://word2md.com? https://gist.github.com/vzvenyach/7278543? – r2evans Jul 18 '19 at 05:48
  • are you sure you're viewing the output file as utf-8? it should "just work" see https://pandoc.org/MANUAL.html#character-encoding – mb21 Jul 18 '19 at 10:31

1 Answers1

0

You can consider the following approach :

library(RDCOMClient)
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
path_To_Word_File <- "D:\\text.docx"
doc <- wordApp[["Documents"]]$Open(normalizePath(path_To_Word_File), ConfirmConversions = FALSE)
wordApp[["ActiveDocument"]]$SaveAs2(FileName = "D:\\text.txt", FileFormat = 2)
Emmanuel Hamel
  • 1,769
  • 7
  • 19