2

I am using itextsharp to extract text from a pdf document using this code:

public static bool does_document_text_have_keyword(string keyword, 
                       string pdf_src, Report report_object)  // TEST
{
    try
    {
        PdfReader pdfReader = new PdfReader(pdf_src);
        string currentText;
        int count = pdfReader.NumberOfPages;
        for (int page = 1; page <= count; page++)
        {
           ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
           currentText = PdfTextExtractor.GetTextFromPage
                           (pdfReader, page, strategy);
           currentText = Encoding.UTF8.GetString
                           (ASCIIEncoding.Convert
                             (Encoding.Default,                                 
                              Encoding.UTF8, 
                              Encoding.Default.GetBytes(currentText)));

           report_object.log(currentText);  // TEST

           if (currentText.IndexOf
                (keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
        }
        pdfReader.Close();
        return false;
    }
    catch
    {
        return false;
    }
}

But the problem is, when I extract text, the text has no white spaces, it's as if the white spaces has been replaced with an empty string. Yet in the pdf document, there are white spaces in it. Does anyone know whats happening here?

SteveC
  • 15,808
  • 23
  • 102
  • 173
omega
  • 40,311
  • 81
  • 251
  • 474
  • What do you have in `currentText` right after calling `PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)`? – cheesemacfly Dec 20 '12 at 16:21
  • @cheesemacfly, its the same result – omega Dec 20 '12 at 16:23
  • @ ray cheng, I need spaces because I extract text to search for words – omega Dec 20 '12 at 16:23
  • What about this solution: http://stackoverflow.com/a/8448889/1443490 – cheesemacfly Dec 20 '12 at 16:26
  • @omega, how about if you extracted "ABCDEFG" and your search words are "BC EFG", then you just tokenize your search words by space to search for "BC" and then "EFG". that way, there's no need to have spaces anymore. – Ray Cheng Dec 20 '12 at 16:40
  • I figured it out, but I have a new problem which I asked here http://stackoverflow.com/questions/13977738/c-sharp-itextsharp-which-is-the-right-method-to-text-extraction-stratedy – omega Dec 20 '12 at 17:51
  • You might want to have a look at this [answer](http://stackoverflow.com/a/13645183/1729265) to a similar question: *The reason for such missing space characters is that the space you see in the rendered PDF does not necessarily correspond to a space character in the page content description of the PDF. Instead you often find an operation in PDFs which after rendering one word moves the current position slightly to the right before rendering the next word. Unfortunately the same mechanism also is used to enhance the appearance of adjacent glyphs: In some letter combinations, for a good...* – mkl Dec 20 '12 at 23:30
  • 1
    Furthermore, for definitive advice, please supply sample documents demonstrating the issues. – mkl Dec 21 '12 at 00:20

1 Answers1

2

I believe your issue is the SimpleTextExtractionStrategy. From the API documentation at http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/SimpleTextExtractionStrategy.html

If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.

Try using the LocationTextExtractionStrategy. It's documentation states:

A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.

Sean Kornish
  • 808
  • 8
  • 11
  • While the `LocationTextExtractionStrategy` generally is less dependent on the order text fragments appear in the page content in, the `SimpleTextExtractionStrategy` is unlikely to be at fault here. For words in the same line they use very similar mechanisms to determine whether there are spaces inbetween or not. – mkl Dec 20 '12 at 23:13