reading Char by char From MS Word

Question

in my program I should read Character by character from a pdf file and put evry word on a database. I doubted, can I do that or not? then I decided to convert the pdf file to a MS WORD file with a converter and then read from that file.

Now still I Don't know how can I read Character by character from a MS Word File. I'm using C++/MFC in my program.

if you give me an sample code it would very help me and I'll be so thanks-full.

Word uses a proprietary format, unlike `.txt` or similar. Can't you automate a conversion from `.docx` (or whatever) to `.txt` and read that? I think that'd be the easiest solution. — Seb Holzapfel, Sep 10 '11 at 09:06
it's trivial to read out of word with automation but should also by easy enough with pdf — David Heffernan, Sep 10 '11 at 21:17

score 0 · Answer 1 · answered Sep 10 '11 at 09:04

Check out IFilter. http://msdn.microsoft.com/en-us/library/ms691105%28v=vs.85%29.aspx

Its a COM interface to extract text from files (each extension has its DLL that the COM returned according to what you need).

An example in C#: http://www.codeproject.com/KB/cs/IFilter.aspx, or http://www.codeproject.com/KB/string/pdf2text.aspx (I've used it in native c++, but I don't have code example...).

Notice that for PDF you might need to down PDF IFilter: http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611

Good Luck!

thanks... I'm still researching and reading about the clues that you gave to me. I think this is the way... — Mohsen, Sep 10 '11 at 20:49

score 0 · Answer 2 · edited May 23 '17 at 12:12

0

If you can convert the source file and you only need the characters, then make it a plain text file and read it using std::ifstream.

To get more sofisticated information from an MS Word file, you should use Office Automation. There are good links in the answers to the following question:

Creating, opening and printing a word file from C++

edited May 23 '17 at 12:12

Community

1
1

answered Sep 10 '11 at 09:19

Don Reba

13,814
3
48
61

thank you... yes maybe this way solve the problem finally, but my program must be register the exact page number that the word exist and other information. if I convert to plain text this information will lost, but maybe finally I do that way that you said. – Mohsen Sep 10 '11 at 20:47
If you export to text directly from PDF, it should keep page numbers. Recognizing them might be a bit error-prone, and if you need more information, you might be better off using Office Automation. – Don Reba Sep 10 '11 at 21:04

reading Char by char From MS Word

2 Answers2