1

Using iTextSharp, I am trying to extract the text from the following pdf file:

https://www.treasury.gov/ofac/downloads/sdnlist.pdf

This is the code:

var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, 2, new SimpleTextExtractionStrategy());
                   if (currentText.Length > 0)
                            {
                                var capture = new Capture();
                                capture.Text = currentText;

                                // write the results to the DB, if any data was found
                                _dataService.AddCapture(capture);
                            }

Using the SimpleTextExtractionStrategy, the results are written to the database with myriads of unwanted spaces within words. The first several lines of of page 2 write as:

OFFICE OF FOREIGN ASSETS CONTROL SPECIALLY DESIGNATED NATIONALS & BLOCKED PERSONS February 3, 2017 - 2 - A.A. RASPLET IN; a .k. a. AL MAZ -AN TEY MSDB; a .k.a . AL MAZ -ANTEY PV O 'AI R DEFENSE' CO NCERN LEAD SYSTE M S DESIGN BUREAU OAO ' OPEN JO INT -STOCK COMPANY' IMENI ACADEMIC IAN A.A . RASPLETIN; a.k .a. GO LOVNOYE SISTEMN OYE KONS TRUKT ORSKOY E BYURO OPEN J OIN T-S TOCK C OMP ANY OF ALMAZ -AN TEY PVO C ONCERN I MEN I ACADEMICIAN A .A. RASPLE TIN; a.k. a. JO INT STOCK C OMPANY A LMA Z-AN TEY AI R DEFENSE CON CERN MA IN SYSTE M DESIGN BUREAU NAMED BY ACADE MICIAN A.A.

See for example the word "JO INT" in the 4th & 6th lines, and the word "CON CERN" in the 2nd to last line. These types of spaces occur throughout the entire results. This will make querying the text impossible, unfortunately.

Does anyone have any idea why this does this and how to resolve this?

Stpete111
  • 3,109
  • 4
  • 34
  • 74

1 Answers1

3

why this does this

The cause actually is a feature of the text extraction strategy which in your case does not work as desired.

A bit of background: What you perceive as a space between words in a PDF file does not necessarily come into being due to an instruction drawing a space character, it can also be the result of an instruction shifting the text insertion position a little to the right. Thus, text extraction strategies usually add a space character when finding a sufficiently large right-shift like that. For some more on this (in particular the "sufficiently large" part) confer e.g. this answer.

In case of your document, though, the text body font has too small font width information (if used as is, the characters appear glued together with no space in-between whatsoever); thus, there are small right shifts between each couple of consecutive characters, some of these shifts wide enough to be falsely identified as word separation by the mechanism explained above.

how to resolve this

As word separations in your PDF are created by instructions drawing a space character, you do not need the feature explained above. Thus, the easiest way to resolve the issue is to use a text extraction strategy without that feature.

You can create such a strategy by copying the source code of the SimpleTextExtractionStrategy (e.g. from here) and comment out some lines from the method RenderText as below:

public virtual void RenderText(TextRenderInfo renderInfo)
{
    [...]

    if (hardReturn)
    {
        //System.out.Println("<< Hard Return >>");
        AppendTextChunk('\n');
    }
    else if (!firstRender)
    {
//        if (result[result.Length - 1] != ' ' && renderInfo.GetText().Length > 0 && renderInfo.GetText()[0] != ' ')
//        { // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
//            float spacing = lastEnd.Subtract(start).Length;
//            if (spacing > renderInfo.GetSingleSpaceWidth() / 2f)
//            {
//                AppendTextChunk(' ');
//                //System.out.Println("Inserting implied space before '" + renderInfo.GetText() + "'");
//            }
//        }
    }
    else
    {
        //System.out.Println("Displaying first string of content '" + text + "' :: x1 = " + x1);
    }

    [...]
}

Using this simplified extraction strategy, your text is properly extracted.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • I have seen you around these parts and you are certainly the King of this subject matter. Thank you for your great wisdom and assistance! – Stpete111 Feb 07 '17 at 15:06