2

I'm using iTextSharp to read a PDF file. I try to read the full text in the first page with this simple code:

var pdfReader = new PdfReader("<fileName>");
var pageText = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new SimpleTextExtractionStrategy());

It returns a string like this:

"\0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 !\n\" \0 \0 \0 \0 \0 \0 # \0 $ \0 % \0 & $ \0 ’ \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 !\n\" \0 \0 \0 (\n\0 \0 \0 ) \0 \0 * \0 + , \0 , \0 \0 & , \0 - \0 . # \0 \0 \0 & $ \0 , \0 /\n+ \0 & & \0 * 0 \0 1 .\n2 \0 3\n4 - \0 5 \0 \0 $ \0 \0 # \0 \0 \0 & $ \0 , \0 * & \0 \0 ’ \0 .\n6\n\0 \0 \0 - \0 \0 \0 \0 & \0 \0 \0 \0 \0 \0 \0 , \0 # \0 \0 \0 & $ \0 , \0 \0 \0 & \0 # \0 \0 & $ ’ ) & \0 \0 \0 \0 # \0 ’ ’ \0 7 - \0 $ \0 \0 7 \0 ’ \0 , \0 8\n9 5 \0 \0 , \0 \0 $ $ \0 \0 \0 \0 \0 ’ \0 \0 3\n\0 \0 \0 ) \0 \0 \0 \0 4 - \0 5 \0 \0 $ \0 \0 * & \0 \0 ’ \0 .\n\0 \0 \0 \0 # \0 $ \0 $ \0 \0 ) \0 \0 \0 : 0 ; \0 ; < ; : 1 ; + \0 = < 9 = < < > \0 ? \0 ? \0 3 \0 (\n@\n\0 \0 # \0 $ \0 % \0 & $ \0 ’ \0 ! 3\n\0 ......"

I can read the original PDF with Acrobat Reader and browsers. The file seems to be a PDF/A.

The code I use works with other PDF.

Does iText have problem with this standard?

Can someone point me to the right direction?

Update

Copy/paste from Acrobat gives me broken text. I don't think it's an iTextSharp (5.5.10) problem.

Update

You can try with this file: PDF Example

danyolgiax
  • 12,798
  • 10
  • 65
  • 116
  • AFAIK iTextSharp works fine with PDF/A. Does this method work fine when you feed it any other PDF or PDF/A? – Equalsk Mar 01 '17 at 15:38
  • Try reading the `byte[]` content of the file yourself and pass it to the `PdfReader` constructor instead. It must be something related to the encoding. – hyankov Mar 01 '17 at 15:39
  • 1
    Can you extract the text with Acrobat? – Paulo Soares Mar 01 '17 at 16:30
  • You don't mention which version of iTextSharp you are using. Older versions didn't read the toUnicode map. iTextSharp doesn't have any problem with the standard, but some PDFs that claim to be PDF (blue ribbon above the pages) aren't real PDF/A file. Did you verify them in Acrobat? – Bruno Lowagie Mar 01 '17 at 17:00
  • @Paulo - Copy/paste from Acrobat gives me broken text. – danyolgiax Mar 01 '17 at 17:16
  • *"Copy/paste from Acrobat gives me broken text"* - In that case the PDF is broken: Its information on which glyph represents which Unicode character are missing or simply wrong. In such a case you may want to look into OCR solutions. – mkl Mar 01 '17 at 21:28
  • @mkl - Ok but I need just a clarification; how browsers and Acrobat can recognize the text and display it? Why a library like iText cannot do the same? – danyolgiax Mar 02 '17 at 08:17
  • Acrobat does not recognise the text! You said yourself that it returns broken text when extracting (via copy&paste). – mkl Mar 02 '17 at 08:26
  • I added a PDF example file. – danyolgiax Mar 02 '17 at 18:29
  • Dead link as August 22, 2019 – Jack Griffin Aug 22 '19 at 08:20

1 Answers1

2

The file does not contain information required for text extraction. Furthermore, the file is invalid as a PDF/A file.

Information for text extraction

The sample file contains a background (located in a form XObject resource) showing the empty form and a foreground (immediately in the page content stream) of filled-in values.

The text in the form XObject is drawn using a Type 3 font without a standard encoding or standard names in its encoding. There also is no ToUnicode map in it.

This means that text drawing instructions in that form XObject have arguments which are sequences of bytes, and for each byte value the Type 3 font object provides a stream containing simple drawing instructions (path definitions using lines and curves; path filling instructions), but there is no information which Unicode value corresponds to that byte value or set of drawing instructions.

Thus, PDF viewers can draw the page but they cannot correctly put a Unicode string of characters into the clipboard which we as humans would read from that drawing, and neither can iTextSharp.

Short of OCR there is no reasonable way to extract text from the form.


The text immediately in the foreground, on the other hand, is drawn using a font with a standard encoding (WinAnsiEncoding) and, therefore, can be extracted. Thus, at the end of the output of the OP's code you'll find

\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000

 ...

\u0000 \u0000 \u0000 x s \u0000 l t n q o x m l \u0000 z \u0000 ~ { \u0000 } } \u0000 l w x
2016
14874587948 DITTA PROVA SRL
CREMA CR 26013 VIA DANTE 17
011110
LPRGCM82T26D150H LEOPARDI GIACOMO
M 26 12 1982 CREMONA CR
MILANO MI F205
28 02 2017
DITTAP0101 / LEOGIA01001

i.e. the filled-in values of the form.

PDF/A conformance

The file indeed claims to be PDF/A-1a but inspecting it one quickly sees that this is a blatant lie. E.g. Adobe Acrobat Preflight says:

Preflight report screenshot

These entries indicate that the document actually does not even try to actually be PDF/A-a1 conform, it merely claims so.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks for the exhaustive explanation. It is possible to recreate the ToUnicode map with iTextSharp? Doing something like this: `font.FontDictionary.Put(PdfName.TOUNICODE,);` Can this fix the problem? – danyolgiax Mar 03 '17 at 11:25
  • If you somehow can conjure up a mapping for the font from glyph code to Unicode, you can do something akin to that dictionary put. But where do you want to get that mapping from? – mkl Mar 03 '17 at 13:06
  • Are you telling me it is almost impossible to recreate even if I know the font used? – danyolgiax Mar 03 '17 at 13:56
  • If you dive deep into the font files you know as sources and compare the glyph definitions therein with those PDF Type 3 font character drawing instructions, you may well re-create the mapping. But this is non-trivial, in particular as the instructions need not fit 100%, there may be small deviations to prevent easy recognition. Thus, while this task might be fun, a generic solution will take quite some time to implement. – mkl Mar 03 '17 at 14:05
  • I can try. But I need to understand how to write a ToUnicode table with iTextSharp. I cannot find any example like this: `font.FontDictionary.Put(PdfName.TOUNICODE,);` can you point me to a snippet? – danyolgiax Mar 03 '17 at 14:13
  • I have no such code at hand right now. But itext does generate its own **ToUnicode** tables when embedding font subsets. I'll try and find that code later. – mkl Mar 03 '17 at 14:54
  • iText creates its own **ToUnicode** tables e.g. in [the TrueTypeFontUnicode.cs method `GetToUnicode`](https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/TrueTypeFontUnicode.cs#L165). – mkl Mar 03 '17 at 18:37
  • As you suggested, I used `GetToUnicode` method from iTextSharp source to create a Stream with a custom UnicodeMap that contains something like this: `var longTag = new Dictionary { {1, new[] {3,3,32}}, {2, new[] {4,65,65}}, {3, new[] {271,98,98}}, {4, new[] {272,272,99}}, ... ... font.FontDictionary.Put(PdfName.TOUNICODE, streamUnicodeMap);` The resultant PDF is broken and it cannot be opened. – danyolgiax Mar 08 '17 at 10:08
  • 1
    `GetToUnicode` returns a stream object. PDF streams must be indirect objects. Thus, you must add the stream to the writer (`obj = writer.AddToBody(STREAMOBJECT)`) and retrieve the indirect reference (`toUnicodeRef = obj.IndirectReference`). You then can add this indirect reference as **ToUnicode** value to the font dictionary. If this does not work, please make that a stack overflow question in its own right. – mkl Mar 08 '17 at 11:52