0

I currently use iTextSharp to read in some PDF files and parse them by using the string I receive. I have encountered a strange behavior with some PDF files. When getting the string back of a for example 4 page PDF, the string is filled with the pages in the following order:

1 2 1 3 1 4

My code for reading the files is as follows:

using (PdfReader reader = new PdfReader(fileStream))
{
     StringBuilder sb = new StringBuilder();

     ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
     for (int page = 0; page < reader.NumberOfPages; page++)
     {
         string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
         if (!string.IsNullOrWhiteSpace(text))
             sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
     }

     Debug.WriteLine(sb.ToString());
}

Here is a link to a file with which this behaviour occurs:

https://onedrive.live.com/redir?resid=D9FEFF3BF45E05FD!1536&authkey=!AFLRlskAvlg89yY&ithint=file%2cpdf

Hope you guys can help me out!

Y C
  • 103
  • 7
  • I don't have an answer for you but I can tell you that the line that does all your transcoding is actually incorrect (although most examples on the internet seem to use it still) and will break very easily. See [this post](http://stackoverflow.com/a/10191879/231316) for more details but basically, `text` is a 100% perfect .Net string at the moment you get it out of a PDF, guaranteed. At best, the transcoding will do nothing and at worst you'll turn text into gibberish. – Chris Haas May 12 '15 at 14:10
  • When I was creating the sample I was looking at the line and thought as much, but because it is part of so many samples I left it in for reference sake. I changed it though and it didn't make a difference. – Y C May 12 '15 at 14:28
  • 1
    Another thing is that text extraction strategies generally aren't intended to be reused (unless you have a specific reason to). The reason for this is that they don't have an explicit "new page" command that wipes their internal state. By reusing that object you are actually working in "append mode". Instead of instantiating it outside of the loop try creating a new one each time `var text = PdfTextExtractor.GetTextFromPage(reader, page + 1, new SimpleTextExtractionStrategy());` – Chris Haas May 12 '15 at 15:02
  • As @ChrisHaas says, by reusing the strategy you add the content of all pages to the strategy. Thus, your `StringBuilder` eventually should not merely contain *1 2 1 3 1 4* but instead *1 1 2 1 2 3 1 2 3 4*. – mkl May 12 '15 at 15:30

1 Answers1

0

Thanks to Chris Haas I found out was going wrong. The samples found online on how to use iTextSharp.Pdf are incorrect or incorrect for my implementation.

The SimpleTextExtractionStrategy needs to be instantiated for every page you try to read. Not doing this will multiply each previous page in the resulting string.

Also the line where the StringBuilder is being appended can be changed from:

sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));

to

sb.Append(text);

Thus the following code gives the correct result:

using (PdfReader reader = new PdfReader(fileStream))
{
    StringBuilder sb = new StringBuilder();

    for (int page = 0; page < reader.NumberOfPages; page++)
    {
        string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, new SimpleTextExtractionStrategy());
        if (!string.IsNullOrWhiteSpace(text))
            sb.Append(text);
    }
    Debug.WriteLine(sb.ToString());                    
}
Y C
  • 103
  • 7