I'm using iText 5.5.8 for Java. Following the default, straightforward text extraction procedures, i.e.
PdfTextExtractor.getTextFromPage(reader, pageNumber)
I was surprised to find several mistakes in the output, specifically all letter ds come out as os.
So how does text extraction in iText really work? Is it some kind of OCR?
I took a look under the hood, trying to grasp how TextExtractionStrategy
works, but I couldn't figure out much. SimpleTextExtractionStrategy
for example seems to just determine the presence of lines and spaces, whereas it's TextRenderInfo
that provides text by invoking some decode method on a GraphicsState
's font
field and that's as far as I could go without getting a major migraine.
So who's my man? Which class should I override or which parameter should I tweak to be able to tell iText "hey, you're reading all ds wrong!"
edit:
sample PDF can be found at http://www.fpozzi.com/stampastopper/download/ name of file is 0116_LR.pdf Sorry, can't share a direct link. This is some basic code for text extraction
import java.io.File;
import java.io.IOException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class Import
{
public static void importFromPdf(final File pdfFile) throws IOException
{
PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());
try
{
for (int i = 1; i <= reader.getNumberOfPages(); i++)
{
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
System.out.println("----------------------------------");
}
}
catch (IOException e)
{
throw e;
}
finally
{
reader.close();
}
}
public static void main(String[] args)
{
try
{
importFromPdf(new File("0116_LR.pdf"));
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
edit after @blagae and @mkl answers
Before starting to fiddle with iText I have tried text extraction from Apache PDFBox (a project similar to iText I just discoreved) but it does have the same issue.
Understanding how these programs treat text is way beyond my dedication, so I have written a simple method to extract text from raw page content, that is whatever stands between BT and ET markers.
import java.io.File;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.parser.ContentByteUtils;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class Import
{
private final static Pattern actualWordPattern = Pattern.compile("\\((.*?)\\)");
public static void importFromPdf(final File pdfFile) throws IOException
{
PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());
Matcher matcher;
String line, extractedText;
boolean anyMatchFound;
try
{
for (int i = 1; i <= 16; i++)
{
byte[] contentBytes = ContentByteUtils.getContentBytesForPage(reader, i);
RandomAccessFileOrArray raf = new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytes));
while ((line = raf.readLine()) != null && !line.equals("BT"));
extractedText = "";
while ((line = raf.readLine()) != null && !line.equals("ET"))
{
anyMatchFound = false;
matcher = actualWordPattern.matcher(line);
while (matcher.find())
{
anyMatchFound = true;
extractedText += matcher.group(1);
}
if (anyMatchFound)
extractedText += "\n";
}
System.out.println(extractedText);
System.out.println("+++++++++++++++++++++++++++");
String properlyExtractedText = PdfTextExtractor.getTextFromPage(reader, i);
System.out.println(properlyExtractedText);
System.out.println("---------------------------");
}
}
catch (IOException e)
{
throw e;
}
finally
{
reader.close();
}
}
public static void main(String[] args)
{
try
{
importFromPdf(new File("0116_LR.pdf"));
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
It appears, at least in my case, that characters are correct. However the order of words or even letters is messy, super messy in fact, so this approach is unusable either.
What really surprises me is that all methods I have tried so far to retrieve text from PDFs, including copy/paste from Adobe Reader, screw something up.
I have come to the conclusion that the most reliable way to get some decent text extraction may also be the most unexpected: some good OCR. I am now trying to: 1) transform pdf into an image (PDFBox is great at doing that - do not even bother to try pdf-renderer) 2) OCR that image I will post my results in a few days.