Is there any pure C++ library to extract plain text from a .doc file?
I'm developing a C++ program to read .doc and .pdf files. I have to extract plain text from the file and write it into a .txt file.
Is there any pure C++ library to extract plain text from a .doc file?
I'm developing a C++ program to read .doc and .pdf files. I have to extract plain text from the file and write it into a .txt file.
You could have a look at the open source C library used by Abiword, wv.
You can also call out to a batch convert tool
If you want to manipulate/read .doc files, you can just take the time and learn the format and manipulate the .doc file manually. You can get it at the MSDN page linking to the format-specification (PDF file).
I admit, it's quite a bit of reading to do, but if you're looking to create software to manipulate/read files, you should have the relevant underlying knowledge to back it all up.
Same goes for the pdf format (which is an open format, and as such specifications should be easy to find).
For doc - Use the Word object model to get to the the document and extract the text. This example uses OLE Automation and C . Another link for DOCX that might help you.
For PDF - Use Haru .
You could always use OIVT (OutsideIn Viewer Technology, I think) now owned by oracle.
I'll be honest, it's not a cheap solution, and while this product is to allow you view, print, etc... I think if i remember correctly, they do offer an option to extract the content to text or they another product that does that. it can do this from pretty much any document type including doc, docx, pdf (just to name a few) without having to use the "original" application installed as they have their own set of filters.
Here's a link to get you started
Good luck