0

I am using the following code to extract text from the first page of PDF files with iTextSharp :

public static string ExtractTextFromPDFFirstPage(string fileName)
{
    string text = null;
    using (var pdfReader = new PdfReader(fileName))
    {
        ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

        text = PdfTextExtractor.GetTextFromPage(pdfReader,1,strategy);

        text = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));

    }
    return text;
}

It works quite well for many PDF, but not for some other ones.

Working PDF : http://data.hexagosoft.com/LFBO.pdf

Not working PDF : http://data.hexagosoft.com/LFBP.pdf

These two PDF seems to be quite similar, but one is working and the other is not. I guess the fact that their producer tag is not the same is a clue here. Another clue is that this function works for any other page of the PDF without a chart.

I also tried with ghostscipt, without success.

The Encoding line seems to be useless as well.

How can i extract the text of the first page of the non working PDF, using iTextSharp ?

Thanks

Sebastien
  • 1
  • 3
  • Both links return a 503 error... – Jan Slabon Jan 28 '16 at 18:53
  • Sorry, it seems filebin.ca is not reliable ... I hosted the files elsewhere and edited my message – Sebastien Jan 28 '16 at 19:01
  • 2
    Not directly related to your problem but completely remove the line `text = Encoding.UTF8.GetString...` because it isn't doing what you think it might be doing. [See this for more.](http://stackoverflow.com/a/10191879/231316) – Chris Haas Jan 28 '16 at 19:11

2 Answers2

0

Both documents use fonts with inofficial glyph names in their Encoding/Differences array and both do not use a ToUnicode map. The glyph naming seems to be somewhat straight: the number following the MT prefix is the ASCII code of the used glyph.

The first document works, because the mapping is not changed at all and iText will use the default encoding (I guess):

/Differences[65/MT65/MT66/MT67 71/MT71/MT72/MT73 76/MT76 78/MT78 83/MT83]

The other document really changes the mapping:

/Differences [2 /MT76 /MT105 /MT103 /MT104 /MT116 /MT110 /MT32 /MT97 /MT100 /MT115 /MT58 ]

This means: E.g. the character code 2 should map to the glyph named MT76 which is an inofficial/private glyph name that iText doesn't know, so it doesn't have more information but the character code 2 and will use this code for the final result (I guess).

It's impossible without implementing a logic for the MT prefixed glyph names to get the correct text out of this document. Anyhow it is nowhere defined that a glyph name beginning with MT followed by an integer can be mapped to the ASCII value... That's simply by accident or implemented by the font designer/creation tool, whatever it came from.

Jan Slabon
  • 4,736
  • 2
  • 14
  • 29
  • Thanks for your answer. Does it mean there is no way to extract text from this file ? And why the text extraction from the other pages of the same document is working ? – Sebastien Jan 29 '16 at 17:42
  • Only by explicitly follwing the undocumentated and unspecified logic: MT{ASCII}. The first document works, because eg. 65 maps to MT65, 66 to MT66... the character code is the same as the mapped ASCII value. Which is not the case for the 2nd document: 2 to MT76, 3 to MT105,... – Jan Slabon Jan 29 '16 at 20:36
  • So, is there a way to manually create the mapping ? How can i do that ? I assume that all other files i'll need to extract text from will have the same custom mapping. – Sebastien Feb 04 '16 at 16:59
  • I must admit that I'm not familiar with iText at all. So I cannot answer if and how it would be possible to use custom glyph names, sorry. – Jan Slabon Feb 04 '16 at 19:08
0

The 2nd PDF (LFBP.pdf) contains the incorrect mapping from glyphs to text, i.e. you see correct glyphs but the text representation was not correctly encoded for some reason during the generation of this PDF. If you have lot of files like this then the working approach could be:

  • detect broken pages while extracting text by searching some phrase that should appear on every page, maybe like "service"
  • process these pages separately using OCR with tools like Tesseract with .NET Wraper
Eugene
  • 2,820
  • 19
  • 24
  • I've tried this LFBP.pdf with ByteScout PDF Multitool utility with (OCR mode in *Repair Broken Fonts* + French language selected) and it works OK for non rotated text but rotated text is not working well with OCR . Note - I am affiliated with ByteScout – Eugene Feb 02 '16 at 21:00