The cause
Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).
This actually is the reason for your issue. The iText parser classes forward the text to the render listeners in the pieces they find as continuous strings in the content stream. The filter mechanism you use filters these pieces. Thus, that whole sentence is accepted by the filter.
What you need, therefore, is some pre-processing step which splits these pieces into their individual characters and forwards these individually to your filtered render listener.
This actually is fairly easy to implement. The argument type in which the text pieces are forwarded, TextRenderInfo,
offers a method to split itself up:
/**
* Provides detail useful if a listener needs access to the position of each individual glyph in the text render operation
* @return A list of {@link TextRenderInfo} objects that represent each glyph used in the draw operation. The next effect is if there was a separate Tj opertion for each character in the rendered string
* @since 5.3.3
*/
public List<TextRenderInfo> getCharacterRenderInfos() // iText / Java
virtual public List<TextRenderInfo> GetCharacterRenderInfos() // iTextSharp / .Net
Thus, all you have to do is create and use a RenderListener
/ IRenderListener
implementation which forwards all the calls it gets to another listener (your filtered listener in your case) with the twist that renderText
/ RenderText
splits its TextRenderInfo
argument and forwards the splinters one by one individually.
A Java sample
As the OP asked for more details, here some more code. As I'm predominantly working with Java, though, I'm providing it in Java for iText. But it is easy to port to C# for iTextSharp.
As mentioned above a pre-processing step is needed which splits the text pieces into their individual characters and forwards them individually to your filtered render listener.
For this step you can use this class TextRenderInfoSplitter
:
package stackoverflow.itext.extraction;
import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
public class TextRenderInfoSplitter implements TextExtractionStrategy
{
public TextRenderInfoSplitter(TextExtractionStrategy strategy)
{
this.strategy = strategy;
}
public void renderText(TextRenderInfo renderInfo)
{
for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
{
strategy.renderText(info);
}
}
public void beginTextBlock()
{
strategy.beginTextBlock();
}
public void endTextBlock()
{
strategy.endTextBlock();
}
public void renderImage(ImageRenderInfo renderInfo)
{
strategy.renderImage(renderInfo);
}
public String getResultantText()
{
return strategy.getResultantText();
}
final TextExtractionStrategy strategy;
}
If you have a TextExtractionStrategy strategy
(like your new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter)
), you now can feed it with single-character TextRenderInfo
instances like this:
String textInsideRect = PdfTextExtractor.getTextFromPage(reader, pageNo, new TextRenderInfoSplitter(strategy));
I tested it with the PDF created in this answer for the area
Rectangle rect = new Rectangle(200, 600, 200, 135);
For reference I marked the area in the PDF:

Text extraction filtered by area without the TextRenderInfoSplitter
results in:
I am trying to create a PDF file with a lot
of text contents in the document. I am
using PDFBox
Text extraction filtered by area with the TextRenderInfoSplitter
results in:
to create a PDF f
ntents in the docu
n g P D F
BTW, you here see a disadvantage of splitting the text into individual characters early: The final text line is typeset using very large character spacing. If you keep the text segments from the PDF as they are, text extraction strategies still easily can see that the line consists of the two words using and PDFBox. As soon as you feed the text segments character by character into the text extraction strategies, they are likely to interpret such widely set words as many one-letter words.
An improvement
The highlighted word "eliminate" is for instance extracted as "o eliminate t". This has been highlighted by double clicking the word and highlighted in Adobe Acrobat Reader.
Something similar happens in my sample above, letters barely touching the area of interest make it into the result.
This is due to the RegionTextRenderFilter
implementation of allowText
allowing all text to continue whose baseline intersects the rectangle in question, even if the intersection consists of merely a single dot:
public boolean allowText(TextRenderInfo renderInfo){
LineSegment segment = renderInfo.getBaseline();
Vector startPoint = segment.getStartPoint();
Vector endPoint = segment.getEndPoint();
float x1 = startPoint.get(Vector.I1);
float y1 = startPoint.get(Vector.I2);
float x2 = endPoint.get(Vector.I1);
float y2 = endPoint.get(Vector.I2);
return filterRect.intersectsLine(x1, y1, x2, y2);
}
Given that you first split the text into characters, you might want to check whether their respective base line is completely contained in the area in question, i.e. implement an own
RenderFilter
by copying RegionTextRenderFilter
and then replacing the line
return filterRect.intersectsLine(x1, y1, x2, y2);
by
return filterRect.contains(x1, y1) && filterRect.contains(x2, y2);
Depending on how exactly exactly text is highlighted in Adobe Acrobat Reader, though, you might want to change this in a completely custom way.