7

Is there any pure C++ library to extract plain text from a .doc file?

I'm developing a C++ program to read .doc and .pdf files. I have to extract plain text from the file and write it into a .txt file.

Evorlor
  • 7,263
  • 17
  • 70
  • 141
ilango j
  • 5,967
  • 2
  • 28
  • 25

4 Answers4

3

You could have a look at the open source C library used by Abiword, wv.

You can also call out to a batch convert tool

gnud
  • 77,584
  • 5
  • 64
  • 78
  • i'm not using vc++. I have to implement this in pure c++ (eg.in Ecliplse cdt). – ilango j Nov 24 '11 at 08:26
  • The three batch tools could be used by calling an external program. That doesn't depend on any specific C compiler. The C library `wv` compiles just fine with GCC, and is used in the cross-platform Abiword. I don't really understand why you think what compiler you use, matters? – gnud Nov 24 '11 at 08:31
  • can you explain how to implement these batch tools in c++? – ilango j Nov 24 '11 at 08:41
  • You don't implement them, you call them. You invoke the program from your program. The simplest way is to use `system()`. – gnud Nov 24 '11 at 08:51
  • Thank you. But i don't want to run other applications in my program. I need code or library to extract plaint text string from .doc file. – ilango j Nov 24 '11 at 09:25
  • Then you should look at the C library I mentioned first, wv. – gnud Nov 24 '11 at 11:09
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/5306/discussion-between-ilango-j-and-gnud) – ilango j Nov 24 '11 at 11:17
1

If you want to manipulate/read .doc files, you can just take the time and learn the format and manipulate the .doc file manually. You can get it at the MSDN page linking to the format-specification (PDF file).
I admit, it's quite a bit of reading to do, but if you're looking to create software to manipulate/read files, you should have the relevant underlying knowledge to back it all up.

Same goes for the pdf format (which is an open format, and as such specifications should be easy to find).

Neowizard
  • 2,981
  • 1
  • 21
  • 39
  • I tried to do this for pdf. Wrote my own simple Pdf parser to extract text, attachments and images. Writing the initial parser was easy but the pdf streams which make up the file can be encoded in a long list of encodings with lots of parameters. This was much more work than is sane and I stopped at this point. – One Man Monkey Squad Jun 29 '20 at 10:44
1

For doc - Use the Word object model to get to the the document and extract the text. This example uses OLE Automation and C . Another link for DOCX that might help you.

For PDF - Use Haru .

Sujay Ghosh
  • 2,828
  • 8
  • 30
  • 47
  • @jmsu - are you sure about that. The documentation says "Document Handle (HPDF_Doc) - The document handle is a handle to operate on a document object." Though I have not worked on HARU, common sense lets me think that once you get the document handle, one can read the document. – Sujay Ghosh Dec 01 '11 at 09:39
1

You could always use OIVT (OutsideIn Viewer Technology, I think) now owned by oracle.

I'll be honest, it's not a cheap solution, and while this product is to allow you view, print, etc... I think if i remember correctly, they do offer an option to extract the content to text or they another product that does that. it can do this from pretty much any document type including doc, docx, pdf (just to name a few) without having to use the "original" application installed as they have their own set of filters.

Here's a link to get you started

Outside In Viewer Technolog

Good luck

Thierry
  • 6,142
  • 13
  • 66
  • 117
  • The Outside In Technology also provides a few other ways to extract text from documents. Content Access and Search Export's SearchText mode. – Dave Newman Mar 05 '16 at 22:46