PDF to Text: iTextSharp: Duplicate Pages in Extraction Results

Question

Thanks in advance.

The Background:

I'm working on a console application that extracts data from specific sections in pdf documents. To do this I first need to convert that pdf into a string to work with. To do this I turned to iTextSharp. The pdfs are laid out with two columns per page so I'm using the SimpleTextExtractionStratgey() (I tried iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); but found it ineffective for the page layout).

Description of content being converted to text:

The pages I seem to be having trouble with have a "header" posted up on the side of the page. Pages with headers are intermittently dispersed through the document.

Image of page layout: http://postimg.org/image/b7i25v0g1/

The Problem:

It seems when it finishes looking through the columns on the page then moves on to that side header. It would then jump to the next page with a side header, convert that to text, then start again from the top of the page where the first header was encountered.

I'd end up with text that looks like:

Page 1 Content

First Header

Second Header

Page 1 Content

Page 2 Content

etc.

Here is the pdf: http://www.filedropper.com/dd35-completeadventurer

I'm not married to iTextSharp I just need a reliable way to convert documents with this format to text. A work around or alternate method would be appreciated.

    static public string ToTxt(string @filePath)
    {
        string strText = string.Empty;
        try
        {
            PdfReader reader = new PdfReader(filePath);

            for (int page = 1; page <= reader.NumberOfPages; page++)
            {

                Widgets.ProgressBar(page);

                //Convert PDF to Text
                ITextExtractionStrategy its = new SimpleTextExtractionStrategy(); //iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
                String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
                strText = strText + s;
            }
            reader.Close();
            Console.WriteLine("File Extracted");
        }
        catch (Exception e)
        {
            Console.WriteLine("Exception: " + e.Message);
        }
        finally
        {
            Console.Clear();
        }
        return strText;
     }

Although not a fix for your problem, [see this](http://stackoverflow.com/a/10191879/231316) for why you want to get rid of the entire re-encoding line for your `s` variable — Chris Haas, Jul 27 '15 at 20:45
Otherwise your code appears correct as far as I can tell. At first it sounded like [this problem](http://stackoverflow.com/a/30217081/231316) but you are correctly create a new text extraction strategy each time. Just to be clear, your output makes it appear to parse part of a page, then jump to another page and then jump back to the first page, right? Can you post the actual PDF somewhere? — Chris Haas, Jul 27 '15 at 20:51
Most likely the reason for that content repetition is that *the content indeed is there twice*, the second time probably outside the page boundaries or at the same position as the first time. Can you share the PDF to verify this? — mkl, Jul 27 '15 at 21:27

score 1 · Accepted Answer · answered Jul 28 '15 at 08:55

As already conjectured in a comment, the duplicate text already is present in the PDF content!

Details

The page contents of pairs of pages facing each other in your document often are identical, each time the contents of the whole spread, and the individual pages merely display only the left or the right half respectively.

E.g. consider the two pages 6 and 7. Their contents are identical:

spread of pages 6 and 7

filling the area of their identical MediaBox. Merely by setting the CropBox (and the ArtBox, BleedBox, and TrimBox) to the left or right half respectively, only the expected content is shown for page 6:

and page 7:

Neither the iText(Sharp) parser framework nor the SimpleTextExtractionStrategy automatically restrict to these boxes, they extract all text drawn anywhere in the content. Thus, the duplicate text.

Preventing duplicate text in the extraction result

Knowing the cause for the text duplication, there are multiple ways to prevent it:

You can try and extract the content only of every other PDF page. Unfortunately the above said is not true for all pages, at least the initial pages (title page, contents, ...) are not created using the scheme explained above, and further into the book there are some artwork pages not following the scheme either. Thus, this option would require quite some management of exceptional pages.
You can extract the contents of each page but keep the contents of the previously processed page in some variable. Now only add the newly extracted content to the result if it does not equal the content of the prior page.
You can use the iText(Sharp) parser filters. If you restrict the text chunks processed by your strategy to only those drawn inside the crop box of the current page, you prevent duplicate text caused by off-page content. You can find an example filtering by region here: ExtractPageContentArea.java / ExtractPageContentArea.cs.

Awesome, I implemented the page check and it works like a charm. I like this because it's a) easily implemented, b) it will work across different formats (unlike the every other page or constant area filter). I'll post the code below in another comment. Thanks so much mkl! — CodeHead, Jul 28 '15 at 17:45

score 1 · Answer 2 · answered Jul 28 '15 at 17:42

Using mkl's second method (checking each page for repeat) I came up with the following and it works brilliantly; an easy fix:

    string strText = string.Empty;
        try
        {
            PdfReader reader = new PdfReader(filePath);
            string prevPage = "";
            for (int page = 1; page <= reader.NumberOfPages; page++)
            {
                Widgets.ProgressBar(page);
                //Convert PDF to Text
                ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
                String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
                if (prevPage != s)
                    strText += s;
                prevPage = s;
            }
            reader.Close();
            Console.WriteLine("File Extracted");
        }
        catch (Exception e)
        {
            Console.WriteLine("Exception: " + e.Message);
        }
        finally
        {
            Console.Clear();
        }
        return strText;
    }

PDF to Text: iTextSharp: Duplicate Pages in Extraction Results

2 Answers2

Details

Preventing duplicate text in the extraction result