I want to highlight several keywords in a set of PDF files. Firstly, we have to identify the single words and match them with my keywords. I found an example:
class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
List<string> topicTerms;
public MyLocationTextExtractionStrategy(List<string> topicTerms)
{
this.topicTerms = topicTerms;
}
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
//filter the meaingless words
string text = renderInfo.GetText();
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
However, I found so many words are broken. For example, "stop" will be "st" and "op". Are there any other method to identify a single word and its position?