From the combination of
I was surprised to see that itext5 (using custom location strategy) was still re-producing all the text that had been left out after cropping.
and of your code snippet
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
public void RenderText(TextRenderInfo renderInfo)
{
string text = renderInfo.GetText();
}
}
I assume that you actually are surprised that in the RenderText
method of your MyLocationTextExtractionStrategy
you retrieve TextRenderInfo
objects for text beyond the crop box.
But exactly that comes naturally! Your RenderText
method implements that method of the IRenderListener
interface, and the methods of this interface are called for each matching drawing instruction in the page content, no matter whether their result eventually will be visible or not.
How could I make itext 5 detect and ignore such hidden text?
You can detect and ignore text outside the crop box fairly easily by checking the text coordinates against the coordinates of the crop box of the current document page.
iText actually contains a filter architecture which allows to exclude text chunks to reach the strategy which don't fulfill some criterion.
If you e.g. currently use your strategy like this:
MyLocationTextExtractionStrategy strategy = new MyLocationTextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
you can apply a crop box region filter like this:
MyLocationTextExtractionStrategy strategy = new MyLocationTextExtractionStrategy();
FilteredTextRenderListener strategyWithFilter = new FilteredTextRenderListener(strategy,
new RegionTextRenderFilter(pdfReader.GetCropBox(1)));
PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategyWithFilter);
As an aside:
I want all the visible text in the document. When I say visible , I mean text which is visible through Adobe Acrobat Reader DC. I do not want to limit text to any specific area. Just all the visible text.
Text can be invisible for a number of reasons other than being beyond the crop box borders, e.g.
- it may be drawn in the same color as the background, e.g. white on white,
- some setting or operation may transform text color and background color to
the same color even though they may differ originally,
- a text rendering mode may be used that doesn't draw anything to start with,
- the glyphs of the font used for the text may be invisible,
- the text may be covered by some image,
- ...
Text extraction will extract all those "invisible" text pieces.
(Up to a certain point you can extend your text extraction framework to recognize this, you can find many questions and answers on such extensions here on stack overflow, but there'll always be some case you did not cover.)
Also text may be only partially covered. E.g. consider the case of the letter 'R' and some white rectangle covering the right leg of the 'R' making it look like 'P'
Text extraction will return 'R' even though Adobe Reader displays something you recognize as 'P'.
Fonts may have incomplete or outright wrong information on which Unicode character one of its glyphs corresponds to.
Text extraction will return a wrong output for text using such a font or probably no output at all.
Text in a PDF may not be drawn using text drawing instructions but instead using vector graphics like arbitrary forms.
Text extraction won't extract such "text" at all.
...
If these issues are a show stopper for you, text extraction is the wrong technology for you and you should use OCR instead.