Converting pdf to text

Question

I need to create a C# or C++ (MFC) application that converts pdf files to txt. I need not only to convert, but remove headers, footers, some garbage characters on the left margin etc. Thus the application shold allow the user to set page margins to cut off what is not needed. I actually have already created such an application using xpdf, but it gives me some problems when I am trying to insert custom tags into the extracted text to preserve italics and bold. Maybe somebody could suggest something useful?

Thanks.

there are plenty of libraries out there that do what you describe BUT the ones I tried and succeeded with were all commercial... if you want I could post some links... usually they come with source code samples... is that an option ? — Yahia, Sep 14 '11 at 19:07
I am not sure, I would have to discuss it with my manager... And the price matters, too. Of course, I would prefer free stuff :) — dpreznik, Sep 14 '11 at 19:16
Well good luck with that task. I tried to build an application to extract certain strings from pdfs. I was also more or less successful, until I stumbled upon pdfs generated by Adobe Acrobat - these do not really keep to the Specification published by Adobe at that time. — arne, Sep 15 '11 at 05:29
If anyone still interesting in the question please refer this Post [C# Pdf to Text](https://stackoverflow.com/a/64204097/4273717) — Rajitha Kithuldeniya, Oct 05 '20 at 07:30

score 1 · Answer 1 · answered Sep 14 '11 at 18:43

1

There are shareware and freeware utilities out there. Try fetching their source code, or perhaps use them the way they are.

A public version of the PDF specification can be found here: Adobe PDF Specification

PDF Shareware readers can be found: PDF Reader source code @ SourceForge

answered Sep 14 '11 at 18:43

Thomas Matthews

56,849
17
98
154

Thank you for your answer. But I would need something more specific. I don't see how to fetch the code, plus also I don't see anything written in C# or C++ that does what I need. – dpreznik Sep 14 '11 at 18:55

score 0 · Answer 2 · answered Feb 13 '15 at 21:46

Please look at Podofo. It's a LGPL-licensed library that has many powerful editing features. One of it's examples, txt2pdf IIRC, is a good start: it shows basic text-extraction; From there you can check if pre (in pdf engine) or post (in text) filtering suffices to your goals. I didn't get to use Pdf Hummus, but it's supposed to have these capabilities too, although it's less straightforward.

Converting pdf to text

2 Answers2