0

I have a small PDF document which is a cropped version from a larger PDF document. I was surprised to see that itext5 (using custom location strategy) was still re-producing all the text that had been left out after cropping. None of this text is visible through Acrobat reader.

How could I make itext 5 detect and ignore such hidden text? Link to PDF with hidden text

EDIT 1 - wrong document was hyperlinked

EDIT 2 - Code snippet attached

public class MyLocationTextExtractionStrategy : 
LocationTextExtractionStrategy
{
  public void RenderText(TextRenderInfo renderInfo)
  {
   string text = renderInfo.GetText();
  }
}

thanks, Saurabh

Sau001
  • 1,451
  • 1
  • 18
  • 25
  • what do you mean by "re-produce" ? – blagae May 14 '18 at 12:30
  • The text is still there, even when you open the document in Adobe Acrobat (there is no such thing as Acrobat Reader anymore). Just ask Acrobat to show hidden text, and the text will reappear. Note: this functionality isn't available in Adobe Reader. Since you talk about Acrobat Reader; it's hard to know if you're talking about Adobe Acrobat or Adobe Reader. – Bruno Lowagie May 14 '18 at 12:32
  • If you want to extract the text within a certain rectangle (such as the crop box), you need to limit the location that is examined by iText to that rectangle. It depends on the version of iText 5 you are using whether or not that's possible. I'm pretty sure iText 7 supports it. In any case: you're not showing any code, so we can't check which `Strategy` you are using to extract the text. – Bruno Lowagie May 14 '18 at 12:33
  • Thanks for the quick response. I just realized that I had hyperlinked with the document which was the original and not the cropped. Corrected now. – Sau001 May 14 '18 at 12:44
  • public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy { public void RenderText(TextRenderInfo renderInfo) { string text = renderInfo.GetText(); } } – Sau001 May 14 '18 at 12:47
  • Coding approach nearly identical to what has been demonstrated in this SFO https://stackoverflow.com/questions/23909893/getting-coordinates-of-string-using-itextextractionstrategy-and-locationtextextr – Sau001 May 14 '18 at 12:52
  • Hi @blagae, By "re-produce" , I meant that the API of itextsharp was emitting text which is not visible at all through Acrobat. Sample document hyperlinked. – Sau001 May 14 '18 at 12:59
  • Hi @BrunoLowagie , I am using **Adobe Acrobat Reader DC** , Version 18. Thank you. – Sau001 May 14 '18 at 13:01
  • Which version of iText are you using (you are using iText 5.x.y, but what are the vlaues of is x and y)? Also: in the code snippet you share in the comments, I don't see you limiting the text extraction strategy to the crop box. That means that you want *all* the content stored in the document, not just the cropped content. – Bruno Lowagie May 14 '18 at 13:12
  • Hi @BrunoLowagie, You are right. I am using itext 5. You are also right in my understanding that I want all the visible text in the document. When I say visible , I mean text which is visible through **Adobe Acrobat Reader DC**. I do not want to limit text to any specific area. Just all the visible text. Thank you. – Sau001 May 14 '18 at 13:17
  • I think you don't understand what I'm saying. Part of the content is cropped so that it isn't visible in Adobe Reader. You *need* to limit the text extraction to that area defined by the crop box. Of course; that's impossible when you say you do not want to limit text to any specific area. In other words: what you need is technically possible, but you make it impossible because you do not want what you need. That's a pity. It makes that no one can help you. – Bruno Lowagie May 14 '18 at 14:27
  • *"I want all the visible text in the document. When I say visible , I mean text which is visible through Adobe Acrobat Reader DC. I do not want to limit text to any specific area. Just all the visible text."* - Then text extraction probably is the wrong approach. Text extraction is more like Ctrl-A Ctrl-C from Adobe Reader and less like reading what there is to see. Text can be invisible for a number of reasons, e.g. drawing white on white, invisible glyphs in a font, text covered by some image, ... Text extraction will extract all those "invisible" text pieces... – mkl May 14 '18 at 21:53
  • You never answered the question which version of iText 5 you are using. – Amedee Van Gasse May 14 '18 at 21:54
  • Hi @AmedeeVanGasse , iText version is 5.5.12.0. Thanks. – Sau001 May 14 '18 at 22:07
  • Hi @BrunoLowagie, I am happy to use your approach. However, considering the document that I have hyperlinked - how would I know what is are the coordinates of the rectangular region which is actively visible to the user. The **GetPageSize** method of **iTextSharp.text.pdf.PdfReader** class gives me the full page width and height. These are the dimensions of the original document from which I created the cropped version. Obviously, this would once again end up in grabbing all the text - visible and invisible. Thanks. – Sau001 May 14 '18 at 22:13
  • Hi @mkl, Thanks for responding. If you refer to the sample document that I have hyperlinked. You will observe that doing a **CTRL+A and CTRL+C** ends up grabbing only the text that is visible to the human eyes. However, iText will many more blocks of text. – Sau001 May 14 '18 at 22:18
  • 1
    Indeed, the `GetPageSize()` method returns the `/MediaBox`, but I didn't tell you to look at the `/MediaBox`. I told you to look at the `/CropBox`. – Bruno Lowagie May 15 '18 at 06:35
  • 1
    @BrunoLowagie *"there is no such thing as Acrobat Reader anymore"* - unfortunately Adobe swerves back and forth concerning its naming policy. The very early Reader versions were called "Adobe Acrobat Reader", then for quite a number of versions it was "Adobe Reader", and now it's "Adobe Acrobat Reader" again. – mkl May 15 '18 at 07:38

1 Answers1

0

From the combination of

I was surprised to see that itext5 (using custom location strategy) was still re-producing all the text that had been left out after cropping.

and of your code snippet

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
  public void RenderText(TextRenderInfo renderInfo)
  {
   string text = renderInfo.GetText();
  }
}

I assume that you actually are surprised that in the RenderText method of your MyLocationTextExtractionStrategy you retrieve TextRenderInfo objects for text beyond the crop box.

But exactly that comes naturally! Your RenderText method implements that method of the IRenderListener interface, and the methods of this interface are called for each matching drawing instruction in the page content, no matter whether their result eventually will be visible or not.

How could I make itext 5 detect and ignore such hidden text?

You can detect and ignore text outside the crop box fairly easily by checking the text coordinates against the coordinates of the crop box of the current document page.

iText actually contains a filter architecture which allows to exclude text chunks to reach the strategy which don't fulfill some criterion.

If you e.g. currently use your strategy like this:

MyLocationTextExtractionStrategy strategy = new MyLocationTextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);

you can apply a crop box region filter like this:

MyLocationTextExtractionStrategy strategy = new MyLocationTextExtractionStrategy();
FilteredTextRenderListener strategyWithFilter = new FilteredTextRenderListener(strategy,
        new RegionTextRenderFilter(pdfReader.GetCropBox(1)));
PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategyWithFilter);

As an aside:

I want all the visible text in the document. When I say visible , I mean text which is visible through Adobe Acrobat Reader DC. I do not want to limit text to any specific area. Just all the visible text.

  • Text can be invisible for a number of reasons other than being beyond the crop box borders, e.g.

    • it may be drawn in the same color as the background, e.g. white on white,
    • some setting or operation may transform text color and background color to the same color even though they may differ originally,
    • a text rendering mode may be used that doesn't draw anything to start with,
    • the glyphs of the font used for the text may be invisible,
    • the text may be covered by some image,
    • ...

    Text extraction will extract all those "invisible" text pieces.

    (Up to a certain point you can extend your text extraction framework to recognize this, you can find many questions and answers on such extensions here on stack overflow, but there'll always be some case you did not cover.)

  • Also text may be only partially covered. E.g. consider the case of the letter 'R' and some white rectangle covering the right leg of the 'R' making it look like 'P'

    Text extraction will return 'R' even though Adobe Reader displays something you recognize as 'P'.

  • Fonts may have incomplete or outright wrong information on which Unicode character one of its glyphs corresponds to.

    Text extraction will return a wrong output for text using such a font or probably no output at all.

  • Text in a PDF may not be drawn using text drawing instructions but instead using vector graphics like arbitrary forms.

    Text extraction won't extract such "text" at all.

  • ...

If these issues are a show stopper for you, text extraction is the wrong technology for you and you should use OCR instead.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you. I have accepted your answer. Just one last question. In your code snippet , `pdfReader.GetCropBox(1)` , how did you arrive at the argument 1. Could there be more than 1 crop boxes ? – Sau001 May 15 '18 at 14:17
  • 1
    @Sau Each page can have its own crop box, the `1` you ask about is the page number. It had better match the second parameter of `PdfTextExtractor.GetTextFromPage` which also denotes the page number. – mkl May 15 '18 at 14:53