0

in my program I should read Character by character from a pdf file and put evry word on a database. I doubted, can I do that or not? then I decided to convert the pdf file to a MS WORD file with a converter and then read from that file.

Now still I Don't know how can I read Character by character from a MS Word File. I'm using C++/MFC in my program.

if you give me an sample code it would very help me and I'll be so thanks-full.

Mohsen
  • 85
  • 1
  • 11
  • Word uses a proprietary format, unlike `.txt` or similar. Can't you automate a conversion from `.docx` (or whatever) to `.txt` and read that? I think that'd be the easiest solution. – Seb Holzapfel Sep 10 '11 at 09:06
  • it's trivial to read out of word with automation but should also by easy enough with pdf – David Heffernan Sep 10 '11 at 21:17

2 Answers2

0

Check out IFilter. http://msdn.microsoft.com/en-us/library/ms691105%28v=vs.85%29.aspx

Its a COM interface to extract text from files (each extension has its DLL that the COM returned according to what you need).

An example in C#: http://www.codeproject.com/KB/cs/IFilter.aspx, or http://www.codeproject.com/KB/string/pdf2text.aspx (I've used it in native c++, but I don't have code example...).

Notice that for PDF you might need to down PDF IFilter: http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611

Good Luck!

TCS
  • 5,790
  • 5
  • 54
  • 86
  • thanks... I'm still researching and reading about the clues that you gave to me. I think this is the way... – Mohsen Sep 10 '11 at 20:49
0

If you can convert the source file and you only need the characters, then make it a plain text file and read it using std::ifstream.

To get more sofisticated information from an MS Word file, you should use Office Automation. There are good links in the answers to the following question:

Creating, opening and printing a word file from C++

Community
  • 1
  • 1
Don Reba
  • 13,814
  • 3
  • 48
  • 61
  • thank you... yes maybe this way solve the problem finally, but my program must be register the exact page number that the word exist and other information. if I convert to plain text this information will lost, but maybe finally I do that way that you said. – Mohsen Sep 10 '11 at 20:47
  • If you export to text directly from PDF, it should keep page numbers. Recognizing them might be a bit error-prone, and if you need more information, you might be better off using Office Automation. – Don Reba Sep 10 '11 at 21:04