0

I am trying to extract the text from the pdf (attachment link) with the code (lib itext7) below:

       public static PageDescribe GetTextFromPage(PdfDocument fullDoc, int pageNum)
    {
        if (pageNum < 1)
            return null;
        else
        {
            PdfPage page = fullDoc.GetPage(pageNum);
            if (page == null)
                return null;                                                
            else
            {


                LocatedTextStrategy lStrat = new LocatedTextStrategy();
                string s = PdfTextExtractor.GetTextFromPage(page, lStrat,);

                DateTime _startPoint = DateTime.Now;
                lStrat.Points.Defragmentation();

                PageDescribe _res = new PageDescribe(pageNum, lStrat.Points);  
                return _res;                                                
            }
        }
    }
}

but i get the error Cannot find image data or EI :

See Image Error

If I Manually remove the initial logo of the pdf this error does not occur. But I can not change the source system that provides this files.

Sample of pdf here

Anyone have any suggestions?

1 Answers1

0

I downloaded your PDF file, and tried with the following code, it works for me (I tried for page 1):

public string GetTextFromPage(string path, int pagenum)
{
    PdfReader reader = new PdfReader(path);
    string text = PdfTextExtractor.GetTextFromPage(reader, pagenum, new LocationTextExtractionStrategy());
    reader.Close();
    return text;
}

You can modify the method above to return your PageDescribe class.

Keyur PATEL
  • 2,299
  • 1
  • 15
  • 41
  • Thanks @KeyurPATEL . Which version of itext did you use? I am using the version itext7 (nuget) and there is no pdfreader parameter in PdfTextExtractor.GetTextFromPage – Deivid Cristian Nascimento Jul 06 '17 at 20:27
  • Remember to add `using iTextSharp.text; using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser;`, since I am using iTextSharp 7 as well, downloaded using Nuget specially to test the code :) – Keyur PATEL Jul 07 '17 at 01:20
  • iTextSharp is other lib, I'm using itext7 . The namespace is using iText.Kernel.Pdf; using iText.Kernel.Pdf.Canvas.Parser; using iText.Kernel.Pdf.Canvas.Parser.Listener; Are you right @Keyur PATEL , with iTextSharp work, but itext7 no. it's crazy no? lol – Deivid Cristian Nascimento Jul 07 '17 at 01:33
  • 1
    Spent an hour of searching and trying (downloaded itext7 using nuget, ran into the same error as you). Similar to you, I concluded its a problem with the inline image (weird how extracting text causes that error), and that the support for iText is not good. I would suggest moving over to itextsharp. Sorry I couldn't help :( – Keyur PATEL Jul 07 '17 at 03:13
  • Thanks , I will change to itextsharp. I tried debug itext7 lib but is very complex to solve. – Deivid Cristian Nascimento Jul 08 '17 at 14:46