7

I have a fairly simple task: I need to read a PDF file and write out its image contents while ignoring its text contents. So essentially I need to do the complement of "save as text".

Ideally, I would prefer to avoid any sort of re-compression of the image contents but if it's not possible, it's ok too.

Are the examples of how to do it?

Thanks!

I Z
  • 5,719
  • 19
  • 53
  • 100

3 Answers3

5

Extracting text from a PDF file with PDFsharp is not a simple task.

It was discussed recently in this thread: https://stackoverflow.com/a/9161732/162529

Community
  • 1
  • 1
4

Extracting text from a PDF with PdfSharp can actually be very easy, depending on the document type and what you intend to do with it. If the text is in the document as text, and not an image, and you don't care about the position or format, then it's quite simple. This code gets all of the text of the first page in the PDFs I'm working with:

var doc = PdfReader.Open(docPath);
string pageText = doc.Pages[0].Contents.Elements.GetDictionary(0).Stream.ToString();

doc.Pages.Count gives you the total number of pages, and you access each one through the doc.Pages array with the index. I don't recommend using foreach and Linq here, as the interfaces aren't implemented well. The index passed into GetDictionary is for which PDF document element - this may vary based on how the documents are produced. If you don't get the text you're looking for, try looping through all of the elements.

The text that this produces will be full of various PDF formatting codes. If all you need to do is extract strings, though, you can find the ones you want using Regex or any other appropriate string searching code. If you need to do anything with the formatting or positioning, then good luck - from what I can tell, you'll need it.

Mason
  • 703
  • 6
  • 20
  • "The text that this produces will be full of various PDF formatting codes." Or with other words: it is easy to get something that is not easy to decipher to get the real text on the page. I have seen PDF2DOC converters that work fine with some PDF files, but fail miserably with others. – I liked the old Stack Overflow Feb 23 '16 at 23:16
  • Yes, in case it wasn't clear enough - it's very easy to extract bits of text for various types of analysis in this way. It's not at all easy to make sense of the overall formatting of the page and display it on-screen or change the layout. – Mason Feb 25 '16 at 04:52
  • 2
    Drawing the text "Hello, World!" can look like `240.2734 427.6833 Td (Hello, World!) Tj` or it can look like `240.2734 427.6833 Td <002B0048004F004F0052000F0003003A00520055004F00470004> Tj` or a bit different. Easy implementations will work with some files, but will fail with other files. – I liked the old Stack Overflow Feb 25 '16 at 19:31
  • I am using PDF sharp Library.It says the PdfReader as class not found.What could be the problem? here is the link to my file – Sudarshan Taparia Aug 31 '16 at 13:34
  • @SudarshanTaparia You should ask that as a new question, you can paste your code in better there. – Mason Aug 31 '16 at 14:02
  • @SudarshanTaparia - PdfSharp.Pdf.IO.PdfReader – Rusty Nail Jan 13 '17 at 20:03
0

Example of PDFSharp libraries extracting images from .pdf file:

link

library

EDIT:

Then if you want to extract text from image you have to use OCR libraries.

There are two good OCRs tessnet and MODI
Link to thread on stack
But I fully can recommend MODI which I am using now. Some sample @ codeproject.

EDIT 2 :

If you don't want to read text from extracted images, you should write new PDF document and put all of them into it. For writing PDFs I use MigraDoc. It is not difficult to use that library.

Community
  • 1
  • 1
Mariusz
  • 3,054
  • 2
  • 20
  • 31
  • I have looked at that example, but I am not sure if it has all the pieces that I need. It looks for "pictures" in the document. I also need to preserve rendering of the text in the image form, I just don't want to have the text behind the image. In other words, I want the output to look exactly like the input but I want to disable the ability to save the text from the output. – I Z Mar 06 '12 at 21:25
  • So as I understand now, you want to read text from images and plain text from pdf? And put them together as what? – Mariusz Mar 06 '12 at 21:29
  • Input PDF can be image-only or image + text behind the image. So I need to take the input and make an image-only PDF out of it. In other words, I want to export all the non-text components of the input PDF into the output PDF and not export the text components. – I Z Mar 06 '12 at 21:49
  • So if you want to export all images from PDF you have to use PDFSharp (with example from my answer). Then you can put them into new PDF with [MigraDoc](http://www.pdfsharp.net/wiki/Images-sample.ashx) for example. Will it be answer for your question? – Mariusz Mar 06 '12 at 21:52
  • The image extraction example only extracts "picture" images, it does not save any sort of pictorial representation of the text. This is why I said that it did not seem to have all the pieces that I need. It seems that what I need to do -- but I may be wrong since I have limited knowledge of the PDF format which is quite complex -- is create a Document object from the original PDF and then somehow remove or replace with empty text all the text objects in the Document. However, I need it to do it in such a way so that I preserve the image representation of that text. Makes sense? – I Z Mar 06 '12 at 23:22