0

I'd like to parse a pdf for texts containing both, binary and clear text data. When I try to do it with PdfReaderContentParser the GetResultantText method returns the right texts for the binary content but whitespaces for the clear text content. Here is the code I use:

        byte[] binaryPdf = File.ReadAllBytes(this.fileName);
        reader = new PdfReader(binaryPdf);

        PdfReaderContentParser parser = new PdfReaderContentParser(reader);

        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            SimpleTextExtractionStrategy simpleStragety = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
            string contentText = simpleStragety.GetResultantText();

            // Do something with the contentText
            // ...
        }

Any idea how to get all content?

seeb
  • 67
  • 1
  • 10
  • Show us the PDF. Also: it is unclear what you mean when you say "ASCII text content". Be very specific when talking about the encoding that was used for the font of the text you refer to (ASCII = 7-bit; characters in a PDF content stream are 8-bit, hence you probably didn't mean to write ASCII). Note: based on your code, I am assuming that you are parsing content streams. Usually, these content streams are compressed (which implies that they are binary). All in all, your question has too many flaws for anyone to answer it. – Bruno Lowagie Nov 30 '15 at 09:00
  • I meant when I open the pdf in an text editor I can see binary and clear text content. You are right, I read the file to a byte array. Therefore the stream is binary. Could this cause the problem? – seeb Nov 30 '15 at 09:34
  • Aren't you confusing the file structure of a PDF (the PDF objects) that are constructed using ASCII with the content streams (which are compressed in stream objects). What exactly do you want to extract? Currently, you are using code to parse the syntax of the content streams. This code won't reveal whatever other data (e.g. metadata) is stored in PDF objects. Your question is still unclear: tell us what you want to read from the PDF. – Bruno Lowagie Nov 30 '15 at 09:37
  • Sorry for my confusing formulation. I just like to get all texts of the pdf. I know I could also use PdfTextExtractor.GetTextFromPage() for this but I also like to get the positions of the texts on the pdf. I realy use an extension class of LocationTextExtractionStrategy instead of SimpleTextExtractionStrategy but I wanted to keep the code simple and SimpleTextExtractionStrategy retruns the same result. – seeb Nov 30 '15 at 10:01
  • The position of each text snippet is stored in the `TextRenderInfo` object. You are using `getResultantText()` which means you get the already processed `TextRenderInfo`. You need to take a step back and use the code at a lower-level. – Bruno Lowagie Nov 30 '15 at 10:04
  • Your question is a duplicate of [Retrieve the respective coordinates of all words on the page](http://stackoverflow.com/questions/13714605/retrieve-the-respective-coordinates-of-all-words-on-the-page-with-itextsharp). – Bruno Lowagie Nov 30 '15 at 10:07
  • Yes I know that. I use the TextRenderInfo in the extension class to get the position (see http://stackoverflow.com/a/8825884/4503749). I just used `GetResultantText()` here for demonstration as this returns the same result as `TestRenderInfo` considering Text: Whitespace characters instead of the original text. – seeb Nov 30 '15 at 10:29
  • Which version of iText are you using? Also: you still expect us to *guess*. As long as you don't share the PDF, you shouldn't expect an answer. I have just noted down a 10 digit number on a block note. If you can guess that number in only 3 attempts, I'll answer your question. If you can't guess the number, you'll have to accept that your question is as hard to answer as mine. – Bruno Lowagie Nov 30 '15 at 10:43
  • I am using version 5.5.7.0 of itextsharp. I added a hyperlink to the pdf. I first had to clarify if I am allowed to publish this pdf with my boss. – seeb Nov 30 '15 at 11:51
  • I did a quick check of the PDF. When I take the stream out of the PDF and uncompress it, the characters are gone. I don't know what's causing this. This needs further investigation and I currently don't have the time. (Note: it's not an ASCII problem: it's a problem of the decompressed binary stream.) – Bruno Lowagie Nov 30 '15 at 13:06
  • OK, thanks for your help anyway. – seeb Nov 30 '15 at 14:14
  • I'll make it a ticket in the paid issue tracker. – Bruno Lowagie Nov 30 '15 at 14:26
  • @seeb I'm currently looking into your PDF. In contrast to Bruno, though, I have not yet understood which contents you mean that vanish. Can you quote some sample content which you miss? – mkl Dec 01 '15 at 17:29
  • @mkl Basically for all descriptions on the left-hand side (e.g. Lifting moment) I get whitespaces instead of the actual text. – seeb Dec 02 '15 at 07:09

1 Answers1

3

Overview

In a comment the OP clarified which texts he was missing in his extracted text:

Basically for all descriptions on the left-hand side (e.g. Lifting moment) I get whitespaces instead of the actual text.

The reason for this is fairly simple: In the page content there are only spaces (if anything at all) on most of the left side. The labels you see actually are read-only form fields.

For example the "Lifting moment" is the value of the form field 13B141032.

If you want text extraction to include these fields, too, you should consider flattening the document in a first step (moving the field appearances into the regular page content stream) and extracting text from this flattened document.

Document analysis

It looks like the major part of the internationalization of the specification labels has been done using form fields.

For an overview I separated the original document

original document

into its regular page content

page content

and the form fields

page fields

There indeed are several strings of spaces in the page content under the form fields.

I would assume that there once was an earlier version of that document (or a template for it) which contained those labels (maybe in only one language or probably two) as page content.

Then there was a task of more dynamic internationalization, so someone replaced the existing labels in the page content by spaces and added new internationalized labels as read-only form-fields, probably because form fields are easier to manipulate.

Considering that the original labels seem to have been replaced by an equal number of spaces, though, one might speculate that there even is another program manipulating the page stream of this and similar documents at hard coded offsets, and to not break this program in the course of internationalization the actual labels had to be created outside the page content. Stranger things have happened...

Flatten and extract

As mentioned above, if you want text extraction to include these fields, too, you should consider flattening the document in a first step (moving the field appearances into the regular page content stream) and extracting text from this flattened document. This can be done like this:

[Test]
public void ExtractFlattenedTextTestSeeb()
{
    FileInfo file = new FileInfo(@"PATH_TO_FILE\41851208.pdf");
    Console.Out.Write("41851208.pdf, flattened before extraction\n\n");

    using (MemoryStream memStream = new MemoryStream())
    {
        using (PdfReader readerOrig = new PdfReader(file.FullName))
        using (PdfStamper stamper = new PdfStamper(readerOrig, memStream))
        {
            stamper.Writer.CloseStream = false;
            stamper.FormFlattening = true;
        }
        memStream.Position = 0;
        using (PdfReader readerFlat = new PdfReader(memStream))
        {
            PdfReaderContentParser parser = new PdfReaderContentParser(readerFlat);

            for (int i = 1; i <= readerFlat.NumberOfPages; i++)
            {
                SimpleTextExtractionStrategy simpleStragety = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
                string contentText = simpleStragety.GetResultantText();

                Console.Write("Page {0}:\n\n{1}\n\n", i, contentText);
            }
        }
    }
}

The result StandardOutput:

41851208.pdf, flattened before extraction

Page 1:

90–120 l/min 
(23.8–31.7 US gal./min) 
60 kg 
(132 lbs) 
115 kg 
(254 lbs) 
350 l 
(92.5 US gal.) 
100 kg 105 kg 
(220 lbs) (231 kg) 
100 kg 
(220 lbs) 
250 l 300 l 
(66.0 US gal.) (79.3 US gal.) 
90 kg 
(198 lbs) 
180 l 
(47.6 US gal.) 
5305kg 
(11695 lbs) 
5265kg 
(11607 lbs) 
5395kg 
(11894 lbs) 
5205kg 
(11475 lbs) 
5010kg 
(11045 lbs) 
4780kg 
(10538 lbs) 
4470kg 
(9854 lbs) 
4190kg 
(9237 lbs) 
3930kg 
(8664 lbs) 
5215kg 
(11497 lbs) 
5045kg 
(11122 lbs) 
4860kg 
(10714 lbs) 
4650kg 
(10251 lbs) 
4350kg 
(9590 lbs) 
4100kg 
(9039 lbs) 
3850kg 
(8488 lbs) 
25.2 m 
(82’ 8") 
23.2 m 
(76’ 1") 
21.0 m 
(68’ 11") 
18.7 m 
(61’ 4") 
16.4 m 
(53’ 10") 
14.1 m 
(46’ 3") 
11.8 m 
(38’ 9") 
9.7 m 
(31’ 10") 
7.7 m 
(25’ 3") 
36.5 MPa (365 bar) 
(5293 psi) 
endlos 
endless 
sans finite 
25.2 m 
31.2 m 
(82’ 8") 
(102’ 4") 
21.0 m 
(68’ 11") 
14900kg 
(32848 lbs) 
403.2 kNm (41.1 mt) 
(297270 ft.lbs) 
49.1 kNm (5.0 mt) 
PK 42002–SH A–G 
(36210 ft.lbs) 
37.3 kNm (3.8 mt) 
PK 42002–SH A–C 
(27510 ft.lbs) 

1GETR 2GETR
PK 42002–SH A – C
KT250 KT300 KT350 KT180



2GETR STZY



+V1
+V2
+2/4
7(F) 8(G) 6(E) 5(D) 4(C) 3(B) 2(A)



+V1
+V2







































(S410–SK–D)
DTS410SHC/03
0100
11/2010



PK 42002–SH
Type Model Modell
Page Page Seite
Chapitre Chapter Kapitel
Edition Edition Ausgabe



Öltank
Mehrgewicht: 
Alle Gewichtsangaben ohne Aufbauzubehör,Zusatzgeräte und Öl. 
Hydr. Ausschübe:
Max. Reichweite + Fly-Jib:
Max. Reichweite: 
Fördermenge der Pumpe: 
Betriebsdruck: 
Schwenkmoment: 
Schwenkbereich: 
Max. Reichweite: 
Max. hydraulische Reichweite: 
Max. Hubkraft: 
Max. Hubmoment:
Gewicht +V ohne 2/4
Krangewicht (R3X,STZS): 
Technische Daten 
Konstruktionsänderungen vorbehalten, fertigungstechn. Toleranzen müssen berücksichtigt werden. 
Oil tank
Excess weight: 
All weights given without assembly accessory,additional devices and oil. 
Hydr. boom extensions:
Max. outreach + Fly-Jib: 
Max. outreach: 
Pump capacity: 
Operating pressure:
Slewing torque: 
Slewing angle: 
Max. outreach: 
Max. hydraulic outreach: 
Max. lifting capacity: 
Lifting moment:
Weight +V without 2/4
Crane weight (R3X,STZS): 
Specifications 
Subject to change, production tolerances have to be taken into account. 
Réservoir
Excessif poids: 
Tous les poids sans huile ni accessoire de montage ni appareils accessoires 
Extensions hydrauliques:
Portee maximale + Fly-Jib: 
Max. portee: 
Debit de pompe: 
Pression d' utilisation:
Couple de rotation: 
Angle de rotation: 
Max. portee: 
Portee hydraulique maximale: 
Capacite maxi de levage:
Couple de levage:
Poids +V sans 2/4
Poids grue (R3X,STZS): 
Données Techniques 
Sous reserve de modifications de conception. Les tolerances relatives a la technique de production doivent etre prises en consideration.

As you see, "Lifting moment" and all the other missing labels are there now.

mkl
  • 90,588
  • 15
  • 125
  • 265