18

I see many questions and answers about using C# to generate PDF files.
I have a related, but different task.

I have a large number of PDF files already created, and I would like to validate certain parts of the content with Regular Expressions (RegExs). I want to open the PDFs in C#, and be able to read out the text in something approaching a linear fashion.

If headers, footers, any sidebars, etc, get skipped or read out of order, it doesn't matter. I'm just after as much of the main-body text as I can retrieve.

Can you point me towards tools, libraries, API's, etc, that will enable me to programmatically read text in PDF files?

abelenky
  • 63,815
  • 23
  • 109
  • 159
  • Thanks for all the wonderful answers. I will be attempting these packages soon, and hopefull accept a "best answer" shortly after that. – abelenky Mar 11 '10 at 20:36
  • 6
    Labeled as Not Constructive - but it sure helped me understand what is available! If it's not a good fit for Q&A format - where should this type of question be posted? – codeputer Oct 29 '13 at 00:39
  • I recommend that this be migrated to Software Recommendations. This is exactly the case for that site. This is a wonderful question that is and has been very helpful to lots of people, but it doesn't quite fit the format of SO. – demongolem Mar 11 '14 at 20:44
  • 2
    When this question was asked, 4 years ago, I don't think Software Recommendations even existed. – abelenky Mar 11 '14 at 20:55

5 Answers5

8

I have used PDFSharp not later than last automn and found it very easy to use in comparison to others. Home page for PDFSharp.

Will Marcouiller
  • 23,773
  • 22
  • 96
  • 162
3

I have successfully used two different libraries for this purpose. One is PDF Box (part of the Apache project), and also one from Snowtide Informatics.

Both are Java libraries, but you can use then with .NET in combination with IKVM.

Nick
  • 5,875
  • 1
  • 27
  • 38
  • PDFxStream (née PDFTextStream) is also distributed as a .NET assembly (courtesy of IKVM as Nick mentions, though the distribution is precompiled to .DLLs, avoiding the runtime interpretation->compilation step when IKVM is used to consume Java libraries as-is). – cemerick Nov 06 '14 at 18:51
2

There is a library for .NET called PDF Clown

There is also a nice article over at codeProject article that details a few other libraries and approaches for reading PDF documents.

demongolem
  • 9,474
  • 36
  • 90
  • 105
Development 4.0
  • 2,705
  • 1
  • 22
  • 17
0

Here is another one:

http://csharp-source.net/open-source/pdf-libraries

Joe Pitz
  • 2,434
  • 3
  • 25
  • 30
0

Looks like iTextSharp was a popular answer Reading PDF documents in .NET
Also check out Reading/Writing PDF files in Visual C# Windows Forms

Community
  • 1
  • 1
SwDevMan81
  • 48,814
  • 22
  • 151
  • 184