Highlighting words are not displayed correctly in OCR PDF

Question

I've highlighted "F O R M - 2" text and "Title of the Invention :". The first string highlighted correctly but second string "itle of the Invention :" only highlighted. I used below code to highlight the word.

  private void highlightPDFAnnotation(string outputFile, string highLightFile, int pageno, string[] splitText)
{
    try
    {
        PdfReader reader = new PdfReader(outputFile);

        using (FileStream fs = new FileStream(highLightFile, FileMode.Create, FileAccess.Write, FileShare.None))
        {
            using (PdfStamper stamper = new PdfStamper(reader, fs))
            {
                myLocationTextExtractionStrategy strategy = new myLocationTextExtractionStrategy();

                string currentText = PdfTextExtractor.GetTextFromPage(reader, pageno, strategy);
                for (int i = 0; i < splitText.Length; i++)
                {
                    List<iTextSharp.text.Rectangle> MatchesFound = strategy.GetTextLocations(splitText[i].Trim(), StringComparison.CurrentCultureIgnoreCase);
                    foreach (Rectangle rect in MatchesFound)
                    {

                        float[] quad = { rect.Left , rect.Bottom, rect.Right, rect.Bottom, rect.Left , rect.Top , rect.Right, rect.Top  };
                        //Create our hightlight
                        PdfAnnotation highlight = PdfAnnotation.CreateMarkup(stamper.Writer, rect, null, PdfAnnotation.MARKUP_HIGHLIGHT, quad);
                        //Set the color
                        highlight.Color = BaseColor.YELLOW;

                        PdfAppearance appearance = PdfAppearance.CreateAppearance(stamper.Writer, rect.Width, rect.Height);
                        PdfGState state = new PdfGState();
                        state.BlendMode = new PdfName("Multiply");
                        appearance.SetGState(state);
                        appearance.Rectangle(0, 0, rect.Width, rect.Height);
                        appearance.SetColorFill(BaseColor.YELLOW);
                        appearance.Fill();

                        highlight.SetAppearance(PdfAnnotation.APPEARANCE_NORMAL, appearance);

                        //Add the annotation
                        stamper.AddAnnotation(highlight, pageno);
                    }
                }
            }
        }
        reader.Close();
        File.Copy(highLightFile, outputFile,true);
        File.Delete(highLightFile);
    }
    catch (Exception ex)
    {
        throw;
    }

}

Show your code, it'll be easier for others to answer. And have a look at [mcve]. — Arghya C, Dec 01 '15 at 12:30
No, I didn't recommend anything. I asked you to add your code in the question, so that others can look at that and try to find problems. — Arghya C, Dec 01 '15 at 13:12
Please share the PDF. At first glance I assume that the `myLocationTextExtractionStrategy` (still the one from [Jcis' answer](http://stackoverflow.com/a/11076968/1729265) I presume) is not perfect and has an issue with your document. Ah, and name the PDF viewer you use... — mkl, Dec 01 '15 at 13:49
Yes, i take myLocationTextExtractionStrategy from Jcis' answer. pdf link http://nsktex.com/pdf.zip and when i open pdf in adobe reader also not displaying correctly. PDF viewer is pdf.js(firefox). It's not displaying correctly because of OCR PDF? — Karthik, Dec 01 '15 at 13:55

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

As you already guessed,

It's not displaying correctly because of OCR PDF

or more precisely because the letters drawn below the image during OCR are incorrectly positioned compared to the image but your code inspects those very letters for positioning a marker.

In more detail

Comparing a stripe around the "Title of the Invention" in the scanned image

and the corresponding stripe in the underlying OCR'ed information

one immediately recognizes that "Title of the Invention" appears a bit off to the right in the latter.

@BrunoLowagie made the difference even clearer:

I've brought the text to the foreground and made it red so that you see how much difference there is between the image and the OCR:

As you retrieve the position by text extraction, the position you retrieve also is a bit off to the right.

A quicker check

If you simply search for "Title of the Invention" in Adobe Reader, you can also recognize the issue:

The whole page

Looking at the OCR'ed information of the whole page, one recognizes that its quality is not that good. Thus, you will observe many issues when processing this document.

The whole scanned page

The OCR'ed information of the whole page

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 01 '15 at 14:40

mkl

90,588
15
125
265

You've beaten me in speed @mkl ;-) I had just made a screen shot. I've added it to your answer. – Bruno Lowagie Dec 01 '15 at 14:50
Thank you for your valuable information. You said myLocationTextExtractionStrategy is not correct. Do you have any code for extracting text location? – Karthik Dec 01 '15 at 14:50
@Karthik *You said myLocationTextExtractionStrategy is not correct.* - I assumed that **before I inspected the PDF**. Now I stand corrected, `myLocationTextExtractionStrategy` is correct here, the OCR'ed information in the PDF are highly incorrect. – mkl Dec 01 '15 at 14:53
@Bruno ;) thanx, your image illustrates the difference even better. – mkl Dec 01 '15 at 14:55
Thank you for all your help. – Karthik Dec 01 '15 at 14:57

Highlighting words are not displayed correctly in OCR PDF

1 Answers1

In more detail

A quicker check

The whole page