I'm trying to convert DOC or DOCX files to plain text (TXT), ensuring that all formats and styles are ignored and encodings render properly, and that no manual pre-processing needs to be done by the user.
The officer package gets me most of the way. The following code yields a TXT file free from junk characters, and without any text indicating header styles etc:
doc <- officer::read_docx("my_doc.docx")
content <- docx_summary(doc)
writeLines(content$text, file("textout.txt", encoding="UTF-8"))
However, this output shows complete field codes. For example, a date in the input file is rendering as:
"DATE \@ "d MMMM yyyy" 17 July 2019"
And the Table of Contents object is omitted entirely.
Again, I cannot do any manual pre-processing, unless its automatable with code! I'm aware that I can unlink all the fieldcodes, but unless there's a automated way of doing this at the Command Line or in R only, that's not an option.
As an alternative, using pandoc leads to text that fixes the field code problem:
rmarkdown::pandoc_convert(doc_file, to="plain", from="docx")
But the encodings aren't right. Examples:
"those with an affinityÂ"
"Station’s business model?Â"
Can somebody help me sort out a solution here? Personally, I'm happy to incorporate other tools, but an R only approach would be excellent.