I'm working with PDF and using iTexhSharp. So far, I could get data from a specific area already. But, I would like to make more flexible by make a it find the coordinator of the first letter(or number) of desired word and then from that coordinator make rectangle to crop around desired word. It would be good if anyone can give me a short example. Thank you.
Asked
Active
Viewed 372 times
1
-
Which iText version do you use, a 5.5.x or a 7.0.x? – mkl Nov 14 '17 at 16:35
-
@mkl I'm using 5.5.x, sir – tumsd923 Nov 15 '17 at 01:37
-
Ah. Joris' answer uses iText 7. – mkl Nov 15 '17 at 05:28
-
@mkl is it different ? – tumsd923 Nov 15 '17 at 06:21
-
The API of iText 7 is completely redesigned. You can use the ideas of the iText 7 code but the implementation looks decidedly different in iText 5.5.x. – mkl Nov 15 '17 at 07:51
-
@mkl There is no HorizontalTextExtractionStrategy in this iTextSharp 5.5.10 ? Since, I couldn't use it. I'm facing issue about text alignment also. – tumsd923 Nov 16 '17 at 03:32
-
The `HorizontalTextExtractionStrategy` originally presented in [this answer](https://stackoverflow.com/a/33697745/1729265) for iText(Sharp) up to version 5.5.8 has therein already being ported to versions 5.5.9 and up for Java as `HorizontalTextExtractionStrategy2`. It should not be too difficult to do the same for the .Net version. If you indeed mean that strategy, I can look into that port. – mkl Nov 16 '17 at 08:42
-
@mkl I looked from your answer in [this topic](https://stackoverflow.com/questions/35344982/itext-extracted-text-from-pdf-file-using-locationtextextractionstrategy-is-in-w). Are there anyway to use it C# ? Thanks. – tumsd923 Nov 17 '17 at 01:16
-
*"Are there anyway to use it C# ?"* - as I already said in my previous comment: It should not be too difficult to do the same for the .Net version. If you indeed mean that strategy, I can look into that port. – mkl Nov 17 '17 at 21:57
1 Answers
1
The basic idea here is to use IEventListener to get notified of TextRenderInfo events. Then split these into CharacterRenderInfo, and then ask for the bounding box of each of those.
class CharacterRenderInfoGetter implements IEventListener {
private List<CharacterRenderInfo> characterRenderInfoList = new ArrayList<>();
@Override
public void eventOccurred(IEventData iEventData, EventType eventType) {
if(eventType == EventType.RENDER_TEXT)
{
TextRenderInfo tri = (TextRenderInfo) iEventData;
for(TextRenderInfo subTri : tri.getCharacterRenderInfos())
{
characterRenderInfoList.add(new CharacterRenderInfo(subTri));
}
}
}
public List<CharacterRenderInfo> getCharacterRenderInfoList()
{
java.util.Collections.sort(characterRenderInfoList);
return characterRenderInfoList;
}
@Override
public Set<EventType> getSupportedEvents() {
return null;
}
}
You can then use this class like so:
File inputFile = getInputFiles()[0]; // provide your own implementation of course
// create an iText PdfDocument out of the File
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
// construct the IEventListener that will measure character distances
CharacterRenderInfoGetter characterRenderInfoGetter = new CharacterRenderInfoGetter();
PdfCanvasProcessor processor = new PdfCanvasProcessor(characterRenderInfoGetter);
/* Here we explicitly tell the IEventListener to process page 1 (the first page of the document
* you can loop over all pages if you want to repeat this
*/
processor.processPageContent(pdfDocument.getPage(1));
I know this code is written in Java. But the .NET equivalent should be very similar. At the very least it's good pseudo-code.

Joris Schellekens
- 8,483
- 2
- 23
- 54