How to extract text from a PDF and decode characters?

Question

I am using itextsharp to extract text from a pdf document using this code:

public static bool does_document_text_have_keyword(string keyword, 
                       string pdf_src, Report report_object)  // TEST
{
    try
    {
        PdfReader pdfReader = new PdfReader(pdf_src);
        string currentText;
        int count = pdfReader.NumberOfPages;
        for (int page = 1; page <= count; page++)
        {
           ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
           currentText = PdfTextExtractor.GetTextFromPage
                           (pdfReader, page, strategy);
           currentText = Encoding.UTF8.GetString
                           (ASCIIEncoding.Convert
                             (Encoding.Default,                                 
                              Encoding.UTF8, 
                              Encoding.Default.GetBytes(currentText)));

           report_object.log(currentText);  // TEST

           if (currentText.IndexOf
                (keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
        }
        pdfReader.Close();
        return false;
    }
    catch
    {
        return false;
    }
}

But the problem is, when I extract text, the text has no white spaces, it's as if the white spaces has been replaced with an empty string. Yet in the pdf document, there are white spaces in it. Does anyone know whats happening here?

What do you have in `currentText` right after calling `PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)`? — cheesemacfly, Dec 20 '12 at 16:21
@ ray cheng, I need spaces because I extract text to search for words — omega, Dec 20 '12 at 16:23
What about this solution: http://stackoverflow.com/a/8448889/1443490 — cheesemacfly, Dec 20 '12 at 16:26
@omega, how about if you extracted "ABCDEFG" and your search words are "BC EFG", then you just tokenize your search words by space to search for "BC" and then "EFG". that way, there's no need to have spaces anymore. — Ray Cheng, Dec 20 '12 at 16:40
I figured it out, but I have a new problem which I asked here http://stackoverflow.com/questions/13977738/c-sharp-itextsharp-which-is-the-right-method-to-text-extraction-stratedy — omega, Dec 20 '12 at 17:51
You might want to have a look at this [answer](http://stackoverflow.com/a/13645183/1729265) to a similar question: *The reason for such missing space characters is that the space you see in the rendered PDF does not necessarily correspond to a space character in the page content description of the PDF. Instead you often find an operation in PDFs which after rendering one word moves the current position slightly to the right before rendering the next word. Unfortunately the same mechanism also is used to enhance the appearance of adjacent glyphs: In some letter combinations, for a good...* — mkl, Dec 20 '12 at 23:30
Furthermore, for definitive advice, please supply sample documents demonstrating the issues. — mkl, Dec 21 '12 at 00:20

score 2 · Answer 1 · answered Dec 20 '12 at 16:35

I believe your issue is the SimpleTextExtractionStrategy. From the API documentation at http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/SimpleTextExtractionStrategy.html

If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.

Try using the LocationTextExtractionStrategy. It's documentation states:

A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.

While the `LocationTextExtractionStrategy` generally is less dependent on the order text fragments appear in the page content in, the `SimpleTextExtractionStrategy` is unlikely to be at fault here. For words in the same line they use very similar mechanisms to determine whether there are spaces inbetween or not. — mkl, Dec 20 '12 at 23:13

How to extract text from a PDF and decode characters?

1 Answers1

Linked