4

I'm using iText 5.5.8 for Java. Following the default, straightforward text extraction procedures, i.e.

PdfTextExtractor.getTextFromPage(reader, pageNumber)

I was surprised to find several mistakes in the output, specifically all letter ds come out as os.

So how does text extraction in iText really work? Is it some kind of OCR?

I took a look under the hood, trying to grasp how TextExtractionStrategy works, but I couldn't figure out much. SimpleTextExtractionStrategy for example seems to just determine the presence of lines and spaces, whereas it's TextRenderInfo that provides text by invoking some decode method on a GraphicsState's font field and that's as far as I could go without getting a major migraine.

So who's my man? Which class should I override or which parameter should I tweak to be able to tell iText "hey, you're reading all ds wrong!"

edit:

sample PDF can be found at http://www.fpozzi.com/stampastopper/download/ name of file is 0116_LR.pdf Sorry, can't share a direct link. This is some basic code for text extraction

import java.io.File;
import java.io.IOException;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class Import
{

    public static void importFromPdf(final File pdfFile) throws IOException
    {
        PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());

        try
        {

            for (int i = 1; i <= reader.getNumberOfPages(); i++)
            {
                System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
                System.out.println("----------------------------------");
            }

        }
        catch (IOException e)
        {
            throw e;
        }
        finally
        {
            reader.close();
        }
    }

    public static void main(String[] args)
    {
        try
        {
            importFromPdf(new File("0116_LR.pdf"));
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

edit after @blagae and @mkl answers

Before starting to fiddle with iText I have tried text extraction from Apache PDFBox (a project similar to iText I just discoreved) but it does have the same issue.

Understanding how these programs treat text is way beyond my dedication, so I have written a simple method to extract text from raw page content, that is whatever stands between BT and ET markers.

import java.io.File;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.parser.ContentByteUtils;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class Import
{

    private final static Pattern actualWordPattern = Pattern.compile("\\((.*?)\\)");

    public static void importFromPdf(final File pdfFile) throws IOException
    {
        PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());

        Matcher matcher;
        String line, extractedText;
        boolean anyMatchFound;
        try
        {
            for (int i = 1; i <= 16; i++)
            {
                byte[] contentBytes = ContentByteUtils.getContentBytesForPage(reader, i);
                RandomAccessFileOrArray raf = new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytes));
                while ((line = raf.readLine()) != null && !line.equals("BT"));

                extractedText = "";
                while ((line = raf.readLine()) != null && !line.equals("ET"))
                {
                    anyMatchFound = false;
                    matcher = actualWordPattern.matcher(line);
                    while (matcher.find())
                    {
                        anyMatchFound = true;
                        extractedText += matcher.group(1);
                    }
                    if (anyMatchFound)
                        extractedText += "\n";
                }
                System.out.println(extractedText);
                System.out.println("+++++++++++++++++++++++++++");
                String properlyExtractedText = PdfTextExtractor.getTextFromPage(reader, i);
                System.out.println(properlyExtractedText);
                System.out.println("---------------------------");
            }
        }
        catch (IOException e)
        {
            throw e;
        }
        finally
        {
            reader.close();
        }
    }

    public static void main(String[] args)
    {
        try
        {
            importFromPdf(new File("0116_LR.pdf"));
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

It appears, at least in my case, that characters are correct. However the order of words or even letters is messy, super messy in fact, so this approach is unusable either.

What really surprises me is that all methods I have tried so far to retrieve text from PDFs, including copy/paste from Adobe Reader, screw something up.

I have come to the conclusion that the most reliable way to get some decent text extraction may also be the most unexpected: some good OCR. I am now trying to: 1) transform pdf into an image (PDFBox is great at doing that - do not even bother to try pdf-renderer) 2) OCR that image I will post my results in a few days.

  • Please share the PDF in question. Most likely the mistakes are already in it, albeit hidden. – mkl Jan 03 '16 at 20:33
  • When I click on the link to your PDF, I get a 403 status code. – Brian Snow Jan 04 '16 at 13:37
  • thanks mkl, added link for PDF (sorry, it's in italian) – Henry Chinaski Jan 04 '16 at 13:37
  • @brian sorry brian, you should append the name of the file 0116_LR.pdf (don't want the file to be seen by search engines) – Henry Chinaski Jan 04 '16 at 13:39
  • @HenryChinaski *What really surprises me is that all methods I have tried so far to retrieve text from PDFs, including copy/paste from Adobe Reader, screw something up* - The reason is that your PDF **intentionally** tries to mislead text extractor. As a result, following best practices will result in errors. – mkl Jan 08 '16 at 20:33
  • @mkl I'm not quite convinced about intentionality in word scrambling. The pdfs I'm working on have been generated by Adobe Indesign (I know because it's written inside the pdf), which means they were manually compiled by some graphic designer. The randomness in the positioning of words may reflect the order in which said person added text layers to the source and the strange character mapping may be the result of some inscrutable software decision. Text accessibility is an issue that any software of a certain regard cannot intentionally overlook (e.g. text-to-speech for visually impaired). – Henry Chinaski Jan 09 '16 at 14:38
  • *intentionality* - the issue identified by @blagae is intentional, I'm not talking about the order here. *Adobe Indesign (I know because it's written inside the pdf)* - that does not have to be true. In particular the software named in the file needs not be the only software used on the file. – mkl Jan 09 '16 at 15:15

2 Answers2

5

Your input document has been created in a strange (but 'legal') way. There is a Unicode mapping in the resources that maps arbitrary glyphs to Unicode points. In particular, character number 0x64, d in ASCII, is mapped to the glyph with Unicode point 0x6f (UTF-8), which is o, in this font. This is not a problem per se - any PDF viewer can handle it - but it is strange, because all other glyphs that are used are not "cross-mapped". e.g. character 0x63 is mapped to Unicode point 0x63 (which is c), etc.

Faulty Unicode entry

Now for the reason that Acrobat does the text extraction correctly (except for the space), and the others go wrong. We'll have to delve into the PDF syntax for this:

[p, -17.9, e, -15.1, l, 1.4, l, 8.4, i, -20,  m, 5.8, i, 14, st, -17.5, e, 31.2, ,, -20.1,  a] TJ
<</ActualText <fffffffeffffffff00640064> >> BDC
5.102 0 Td
[d, -14.2, d] TJ
EMC

That tells a PDF viewer to print p-e-l-l-i- -m-i-st-e- -a on the first line of code, and d-d after that on the fourth line. However, d maps to o, which is apparently only a problem for text extraction. Acrobat does do the text extraction correctly, because there is a content marker /ActualText which says that whatever we write between the BDC and EMC markers must be parsed as dd (0x64,0x64).

So to answer your question: iText does this on the same level as a lot of well-respected viewers, which all ignore the /ActualText marker. Except for Acrobat, which does respect it and overrules the ToUnicode mapping.

And to really answer your question: iText is currently looking into parsing the /ActualText marker, but it will probably take a while before it gets into an official release.

blagae
  • 2,342
  • 1
  • 27
  • 48
  • thank you so much! I know nothing about the internal structure of PDFs and I would've never figure that out by myself. Now I'm trying to think about a possible workaround... not necessarily a clean, powerful solution, but I ought to assume that this weird mapping may be different -or may not be there at all- in the next pdf (I have absolutely no idea on how these pdfs are generated). Any suggestion? For example is there a way to get that mapping through iText? – Henry Chinaski Jan 04 '16 at 22:02
  • Ah, so this issue essentially is a duplicate of [this one](http://stackoverflow.com/a/22688775/1729265). – mkl Jan 05 '16 at 09:06
  • You can get to the ToUnicode mapping, but it's impossible for a computer to guess that a certain mapping is 'off', because in a lot of situations all mappings are non-trivial and essential for text extraction. Your best bet is to look into writing your own iText TextExtractionStrategy, as shown by the link in @mkl 's answer – blagae Jan 05 '16 at 09:22
0

This probably has to do with how the PDF with OCR'd in the first place, rather than with how iTextSharp is parsing the PDF's contents. Try copy/pasting the text from the PDF into Notepad, and see if the "ds -> os" transformation still occurs. If this is the case, you're going to have to do the following when parsing text from this particular PDF:

  1. Identify all occurrences of the string "os".
  2. Decide whether or not the word of which the given "os" instance is a constituent is a valid English/German/Spanish/ word.
  3. If it IS a valid word, do nothing.
  4. If it is NOT a valid word, perform the reverse "os -> ds" transformation, and check against the dictionary in the language of your choice again.
Brian Snow
  • 1,133
  • 1
  • 12
  • 23
  • "Try copy/pasting the text from the PDF into Notepad" did that and, surprise surprise, all letters are correct... (sorry have to get back to my real job -sigh- see you again in a few hours!) – Henry Chinaski Jan 04 '16 at 13:44
  • When I do this, all letters are not correct -- there are many instances of OCR mistakes. For example, on page 16, you have the line "pelli miste, addolcenti," which pastes into Notepad as "pelli miste, aooolcenti," – Brian Snow Jan 04 '16 at 13:49
  • I get "pelli miste, add olcenti" (from Adobe Acrobat Pro XI), and "pelli miste, aooolcenti" on both Sumatra PDF and on Foxit Reader – blagae Jan 04 '16 at 14:30
  • @blagae I see. Having read your answer, my answer seems to be very wrong. Do you think I should delete it? Or should I leave it here for posterity? – Brian Snow Jan 04 '16 at 15:06
  • @BrianSnow I have no problem with leaving it here, because your answer was equally likely to be correct for the general case. Anyone googling with the right search terms in the future might want to know that an imperfect OCR job is also a very probable root cause. – blagae Jan 04 '16 at 15:11
  • thank you Brian, unfortunately dictionary lookup would be problematic because some of the words are brand names and, in all likelihood, they would not be recognized. I will consider this path as my last resort. – Henry Chinaski Jan 04 '16 at 22:14