Using iTextSharp, I am trying to extract the text from the following pdf file:
https://www.treasury.gov/ofac/downloads/sdnlist.pdf
This is the code:
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, 2, new SimpleTextExtractionStrategy());
if (currentText.Length > 0)
{
var capture = new Capture();
capture.Text = currentText;
// write the results to the DB, if any data was found
_dataService.AddCapture(capture);
}
Using the SimpleTextExtractionStrategy, the results are written to the database with myriads of unwanted spaces within words. The first several lines of of page 2 write as:
OFFICE OF FOREIGN ASSETS CONTROL SPECIALLY DESIGNATED NATIONALS & BLOCKED PERSONS February 3, 2017 - 2 - A.A. RASPLET IN; a .k. a. AL MAZ -AN TEY MSDB; a .k.a . AL MAZ -ANTEY PV O 'AI R DEFENSE' CO NCERN LEAD SYSTE M S DESIGN BUREAU OAO ' OPEN JO INT -STOCK COMPANY' IMENI ACADEMIC IAN A.A . RASPLETIN; a.k .a. GO LOVNOYE SISTEMN OYE KONS TRUKT ORSKOY E BYURO OPEN J OIN T-S TOCK C OMP ANY OF ALMAZ -AN TEY PVO C ONCERN I MEN I ACADEMICIAN A .A. RASPLE TIN; a.k. a. JO INT STOCK C OMPANY A LMA Z-AN TEY AI R DEFENSE CON CERN MA IN SYSTE M DESIGN BUREAU NAMED BY ACADE MICIAN A.A.
See for example the word "JO INT" in the 4th & 6th lines, and the word "CON CERN" in the 2nd to last line. These types of spaces occur throughout the entire results. This will make querying the text impossible, unfortunately.
Does anyone have any idea why this does this and how to resolve this?