4

Does anyone know of a PDF file parser that I could use to pull out sections of text from the plaintext pdf file? Specifially I want a way to be able to reliably pull out the section of text specific to annotations?

Delphi, C# RegEx I dont mind.

Toby Allen
  • 10,997
  • 11
  • 73
  • 124

6 Answers6

5

The PDF File Parser article on xactpro seems to be exactly what you need. It explains the format of the PDF and comes with full source code for a parser (and another project for visualisation of the model).

The parser uses format-specific terms, but you could easily use the visualiser to learn what to look for.

Richard Szalay
  • 83,269
  • 19
  • 178
  • 237
2

You can also take a look at Xpdf (http://www.foolabs.com/xpdf/download.html)

Mihai Nita
  • 5,547
  • 27
  • 27
1

check out pdfbox

Abhijith
  • 929
  • 8
  • 9
1

Not sure if it supports the functionality you need, but we've been using abcPDF with some success.

Jeremy
  • 44,950
  • 68
  • 206
  • 332
1

abcPDF does let you extract annotations, they have a very good section in the help for it, but the code to handle it is generally :

    for (int objectIndex = 0; objectIndex < theDoc.ObjectSoup.Count; objectIndex++)
        {
            try
            {
                IndirectObject element = theDoc.ObjectSoup.ElementAt(objectIndex);

                string elementType = element.GetType().ToString();
                switch (elementType)
                {
                    case "WebSupergoo.ABCpdf8.Objects.Annotation":
                       //process the annotation, which could be all kinds of stuff
                        WebSupergoo.ABCpdf8.Objects.Annotation annotation = (WebSupergoo.ABCpdf8.Objects.Annotation)element; 

                        ProcessAnnotation(annotation);

...

Mike Edgar
  • 39
  • 2
0

I don't know all the features of these PDF parsers, but Aspose has a pretty comprehensive one. We did, unfortunately, come across two bugs, and I've been waiting a long time for them to be fixed.

ITextSharp seems to be the most common open source PDF parser for .Net.

Stephen Oberauer
  • 5,237
  • 6
  • 53
  • 75