3

I am working on one project where there is a functionality need to implement with PDF

I want to read the text of PDF file in my c#.net project.

Can anyone know what is the way to do so?

user990423
  • 1,397
  • 2
  • 12
  • 32
amit patel
  • 2,287
  • 8
  • 31
  • 45

5 Answers5

3

Hve a look to the following links:

How to read pdf files using C# .NET

and

Reading PDF in C#

Hopefully they can guide you to the correct direction.

Community
  • 1
  • 1
1

Perhaps pdfLib can be used.

From pdfLib homepage

PDFlib TET PDF IFilter (Enterprise PDF Search on Windows) extracts text and metadata from PDF documents and makes it available to search and retrieval software on Windows.

Niels
  • 1,026
  • 9
  • 17
1

Try this library, very easy to use and exactly what you need:

http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-100-NET

Alex
  • 5,971
  • 11
  • 42
  • 80
1

I would much like to use getText() method of PdfTextStripper.To implement this, you can have look over following url:

http://naspinski.net/post/ParsingReading-a-PDF-file-with-C-and-AspNet-to-text.aspx

http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C

0

Short answer, unless you are generating the pdf and are doing it correctly, no.

Pdf files are generated in a manner similar to what is sent to a printer. Not all text is readable in them, and the information about the text can be stored arbitrarily. Also some programs might save the text in vector or bitmap format.

linkerro
  • 5,318
  • 3
  • 25
  • 29
  • Links posted are definitely useful but yes you correctly said not all text can be read. I have few PDF's which have 'vector text' in them, is there any library which reads those? – Sujit Singh Apr 22 '18 at 06:06
  • You would need to raster the pdf (turn it into images) then use some OCR software to read the text of the image. This will not be very reliable and will probably not scale. In short, not really. – linkerro May 02 '18 at 12:37