Try Open a PDF , i get the Error "Cannot find image data or EI" with lib ItextSharp 7

Question

I am trying to extract the text from the pdf (attachment link) with the code (lib itext7) below:

       public static PageDescribe GetTextFromPage(PdfDocument fullDoc, int pageNum)
    {
        if (pageNum < 1)
            return null;
        else
        {
            PdfPage page = fullDoc.GetPage(pageNum);
            if (page == null)
                return null;                                                
            else
            {


                LocatedTextStrategy lStrat = new LocatedTextStrategy();
                string s = PdfTextExtractor.GetTextFromPage(page, lStrat,);

                DateTime _startPoint = DateTime.Now;
                lStrat.Points.Defragmentation();

                PageDescribe _res = new PageDescribe(pageNum, lStrat.Points);  
                return _res;                                                
            }
        }
    }
}

but i get the error Cannot find image data or EI :

See Image Error

If I Manually remove the initial logo of the pdf this error does not occur. But I can not change the source system that provides this files.

Sample of pdf here

Anyone have any suggestions?

Could you show part / relevant code of `LocatedTextStrategy`? — Keyur PATEL, Jul 06 '17 at 02:43
Also it seems `PdfTextExtractor.GetTextFromPage();` takes 3 arguments, [such as here](https://stackoverflow.com/a/5003230/6741868). — Keyur PATEL, Jul 06 '17 at 02:49
Hi @Keyur, I using itext7, there are 3 overloads. First overload receive only Pdfpage, second overload receive PdfPage and ITextExtractionStrategy (my code). The thrid overload receive PdfPage, ITextExtractionStrategy and IDictionary. The PdfReader , like your link, is not necessary — Deivid Cristian Nascimento, Jul 06 '17 at 03:04
@KeyurPATEL I try SimpleTextExtractionStrate too, but doesn't work — Deivid Cristian Nascimento, Jul 06 '17 at 03:11

score 0 · Answer 1 · answered Jul 06 '17 at 03:37

0

I downloaded your PDF file, and tried with the following code, it works for me (I tried for page 1):

public string GetTextFromPage(string path, int pagenum)
{
    PdfReader reader = new PdfReader(path);
    string text = PdfTextExtractor.GetTextFromPage(reader, pagenum, new LocationTextExtractionStrategy());
    reader.Close();
    return text;
}

You can modify the method above to return your PageDescribe class.

answered Jul 06 '17 at 03:37

Keyur PATEL

2,299
1
15
41

Thanks @KeyurPATEL . Which version of itext did you use? I am using the version itext7 (nuget) and there is no pdfreader parameter in PdfTextExtractor.GetTextFromPage – Deivid Cristian Nascimento Jul 06 '17 at 20:27
Remember to add `using iTextSharp.text; using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser;`, since I am using iTextSharp 7 as well, downloaded using Nuget specially to test the code :) – Keyur PATEL Jul 07 '17 at 01:20
iTextSharp is other lib, I'm using itext7 . The namespace is using iText.Kernel.Pdf; using iText.Kernel.Pdf.Canvas.Parser; using iText.Kernel.Pdf.Canvas.Parser.Listener; Are you right @Keyur PATEL , with iTextSharp work, but itext7 no. it's crazy no? lol – Deivid Cristian Nascimento Jul 07 '17 at 01:33
1

Spent an hour of searching and trying (downloaded itext7 using nuget, ran into the same error as you). Similar to you, I concluded its a problem with the inline image (weird how extracting text causes that error), and that the support for iText is not good. I would suggest moving over to itextsharp. Sorry I couldn't help :( – Keyur PATEL Jul 07 '17 at 03:13
Thanks , I will change to itextsharp. I tried debug itext7 lib but is very complex to solve. – Deivid Cristian Nascimento Jul 08 '17 at 14:46

Try Open a PDF , i get the Error "Cannot find image data or EI" with lib ItextSharp 7

1 Answers1