RTF / doc / docx text extraction in program written in C++/Qt

Question

I am writing some program in Qt/C++, and I need to read text from Microsoft Word/RTF/docx files.

And I am looking for some command-line program that can make that extraction. It may be several programs.

The closest thing I found is DocToText, but it has several bugs, so I can't use it. I have also Microsoft Word installed on the PC. Maybe there is some way to read text using it (have no idea how to use COM)?

score 12 · Answer 1 · answered Aug 11 '09 at 05:35

Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

So that's:

unzip -p file.docx: -p == "unzip to stdout"

grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)

sed 's/<[^<]>//g'*: Remove everything inside tags

grep -v '^[[:space:]]$'*: Remove blank lines

There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.

As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)

This would work for docx files... depends on how well you want to know openxml sdk though... if you just want text without it being too complicated... this would work — jle, Nov 06 '09 at 03:56

score 4 · Answer 2 · answered Jul 26 '09 at 15:01

4

Try Apache Tika

answered Jul 26 '09 at 15:01

raven

2,574
2
27
49

Exactly what I was looking for, can convert .doc and .docx to plain text. – Anssssss Feb 11 '15 at 18:02

vog · Answer 3 · 2009-07-26T14:53:13.623

0

I recommend not to use COM as this would defeat the usage of a portable library like Qt in the first place.

You might want to use the classic catdoc or a similar tool such as wvWare.

Note that although the catdoc author claims that catdoc doesn't work under Windows, there is a posting of 2001 which states the opposite.

edited Jul 26 '09 at 14:53

answered Jul 26 '09 at 14:48

vog

23,517
11
59
75

score 0 · Answer 4 · answered Jul 26 '09 at 15:12

0

To read .doc files you can use the structured storage API. A .doc is basically a structured storage repository with various streams corresponding to the various parts of the document.
Be warned that it is quite a hairy API and that even using this API, a .doc file can be quite messy to look at.
Ofcouse this is still windows only but atleast it's not COM. just a plain old C API.

answered Jul 26 '09 at 15:12

shoosh

76,898
55
205
325

I am trying to do it platform independent . And i think there is several programs out there that do those things, but needed to be found . thanks anyway – Night Walker Jul 26 '09 at 15:36

Beached · Answer 5 · 2009-11-06T04:01:21.913

0

This might help. It is cross-platform and has an API http://www.winfield.demon.nl/

Otherwise the iFilter methods are the way to go if this is windows only. It will allow you to parse anything that has an iFilter on your system. Here is examples of this http://the-lazy-programmer.com/blog/?p=8 . I have used iFilter from the C# end of things quite a bit.

edited Nov 06 '09 at 04:01

answered Nov 06 '09 at 03:52

Beached

1,608
15
18

Also, you can try http://wvware.sourceforge.net for wvLib. It is used by abiword – Beached Nov 06 '09 at 04:04

RTF / doc / docx text extraction in program written in C++/Qt

5 Answers5

Linked