0

I have 2 pdf libraries which I am reading all docs and parsing specific information from. One library processes without issues. THe other library only returns the footer of all the pages as follows: Page 1 of 6Page 2 of 6Page 3 of 6Page4 of 6..... The library which is working has one document with multiple pages.

The following is the pdfreader I am using. Has anyone experienced this behavior before and what is different between the documents and how should I handle the case where footer only is returned.

     static string ReadPdfFile(string fileName)
     {
         string curFile = @fileName;
         // Console.WriteLine(curFile);
         // Console.WriteLine(File.Exists(curFile) ? "File exists." : "File does not exist.");

         StringBuilder text = new StringBuilder();

         if (File.Exists(curFile))
         {
             Console.Error.WriteLine("in: " + fileName);
             PdfReader pdfReader = new PdfReader(fileName);

             for (int page = 1; page <= pdfReader.NumberOfPages; page++)
             {
                 ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                 string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                 currentText =
                     Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8,
                         Encoding.Default.GetBytes(currentText)));
                 text.Append(currentText);
             }
             pdfReader.Close();
         }
         return text.ToString();
     }
gotfocus
  • 11
  • 2
  • Not related to your question but see this post for why you should not be doing the multiple encoding thing. It is the string equivalent of `int.Parse(float.Parse(int.Parse("64").ToString()).ToString())`. http://stackoverflow.com/a/10191879/231316 – Chris Haas May 22 '14 at 20:38
  • 1
    We would need to see the PDFs in order to tell you what's different between the two. Just a guess, but maybe the first has text as images? – Chris Haas May 22 '14 at 20:39
  • Thanks Chris. The text and fonts are similar between the 2. The problem library does appear to have an image on the first page of each document. What is recommended to avoid the resulting issue? – gotfocus May 22 '14 at 20:48
  • 1
    Maybe the unnamed library performs OCR whereas iTextSharp doesn't. – Bruno Lowagie May 23 '14 at 05:34
  • @user3666656, when you say "library" are you talking about a collection of PDFs or "Dynamic Link Library"? – Chris Haas May 23 '14 at 13:13

0 Answers0