Looking for a PDF file parser

Question

Does anyone know of a PDF file parser that I could use to pull out sections of text from the plaintext pdf file? Specifially I want a way to be able to reliably pull out the section of text specific to annotations?

Delphi, C# RegEx I dont mind.

score 5 · Accepted Answer · answered Feb 09 '09 at 21:48

5

The PDF File Parser article on xactpro seems to be exactly what you need. It explains the format of the PDF and comes with full source code for a parser (and another project for visualisation of the model).

The parser uses format-specific terms, but you could easily use the visualiser to learn what to look for.

answered Feb 09 '09 at 21:48

Richard Szalay

83,269
19
178
237

Link seems to be broken. – automatic Dec 15 '10 at 18:00
1

@automatic - It looks like the entire site is down – Richard Szalay Dec 16 '10 at 07:56

score 2 · Answer 2 · answered Feb 10 '09 at 07:29

2

You can also take a look at Xpdf (http://www.foolabs.com/xpdf/download.html)

answered Feb 10 '09 at 07:29

Mihai Nita

5,547
27
27

score 1 · Answer 3 · answered Dec 01 '09 at 07:33

1

check out pdfbox

answered Dec 01 '09 at 07:33

Abhijith

929
8
9

score 1 · Answer 4 · answered Feb 09 '09 at 21:34

1

Not sure if it supports the functionality you need, but we've been using abcPDF with some success.

answered Feb 09 '09 at 21:34

Jeremy

44,950
68
206
332

I don't think abcPDF supports parsing. – Richard Szalay Feb 09 '09 at 21:41
@Richard Szalay, I wasn't sure. The feature matrix says it supports reading pdfs, but whether it goes you an object model in the api to accesss parts of the pdf is something I can't say for certain. – Jeremy Feb 09 '09 at 21:54
I wouldn't go so far as to reject it's advertised feature set :) It didn't support it when I used it last, but it's writing capabilities certainly did the job well. – Richard Szalay Feb 09 '09 at 22:32
1

ABCpdf does expose an object model, it's what they call Atoms. – Mark S. Rasmussen Feb 10 '09 at 07:58

Mike Edgar · Answer 5 · 2011-12-01T17:28:01.723

abcPDF does let you extract annotations, they have a very good section in the help for it, but the code to handle it is generally :

    for (int objectIndex = 0; objectIndex < theDoc.ObjectSoup.Count; objectIndex++)
        {
            try
            {
                IndirectObject element = theDoc.ObjectSoup.ElementAt(objectIndex);

                string elementType = element.GetType().ToString();
                switch (elementType)
                {
                    case "WebSupergoo.ABCpdf8.Objects.Annotation":
                       //process the annotation, which could be all kinds of stuff
                        WebSupergoo.ABCpdf8.Objects.Annotation annotation = (WebSupergoo.ABCpdf8.Objects.Annotation)element; 

                        ProcessAnnotation(annotation);

...

score 0 · Answer 6 · answered Apr 08 '11 at 10:06

I don't know all the features of these PDF parsers, but Aspose has a pretty comprehensive one. We did, unfortunately, come across two bugs, and I've been waiting a long time for them to be fixed.

ITextSharp seems to be the most common open source PDF parser for .Net.

Looking for a PDF file parser

6 Answers6

Linked