Extract PDF text by coordinates

Question

I'd like to know if there's some PDF library in Microsoft .NET being able of extracting text by giving coordinates.

For example (in pseudo-code):

PdfReader reader = new PdfReader();
reader.Load("file.pdf");

// Top, bottom, left, right in pixels or any other unit
string wholeText = reader.GetText(100, 150, 20, 50);

I've tried to do so using PDFBox for .NET (that one working on top of IKVM) with no luck, and it seems to be very outdated and undocumented.

Perhaps anyone has a good sample of doing so with PDFBox, iTextSharp or any other open-sourced library, and he/she can give me a hint.

Thank you in advance.

Don't you think that zooming a view would change what text is at the designated coordinates? Pulling data based on their position in the representation, especially when it might change, seems to me like a functionality that the lib developers wouldn't just bother to realize in their application. — Maxim V. Pavlov, Sep 13 '11 at 16:36
don't know of any opensource library capable of this... IF a commercial library is an option I could provide one or two links... — Yahia, Sep 13 '11 at 16:37
@Maxim You're right, but my project will have a fixed-size PDF viewer, so I believe this isn't the situation you're talking about. For example, in Adobe Reader, when you select something like an image and you zoom-in, and zoom-out, the selection gets resized too. Maybe this can be achieved someway with some library. In fact, Apache PDFBox has something like selecting regions providing a rectangle, meaning that I'm not as crazy as you thought :D — Matías Fidemraizer, Sep 13 '11 at 16:40
@Yahia, it'll depend on pricing, but give me these hints in comments and I'll take a look. — Matías Fidemraizer, Sep 13 '11 at 16:41

score 8 · Accepted Answer · answered Sep 13 '11 at 17:00

8

Well, thank you for your effort anyone.

I got it using Apache's PDFBox on top of IKVM compilation, and this is the final code:

PDDocument doc = PDDocument.load(@"c:\invoice.pdf");

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion("testRegion", new java.awt.Rectangle(0, 10, 100, 100));
stripper.extractRegions((PDPage)doc.getDocumentCatalog().getAllPages().get(0));

string text = stripper.getTextForRegion("testRegion");

And it works like a charm.

Thank you anyway and I hope my own answer will help others. If you need further details, just comment out here and I'll update this answer.

answered Sep 13 '11 at 17:00

Matías Fidemraizer

63,804
18
124
206

I want the same thing in pdfsharp or something in c#.please help me. – Vivek Parikh Apr 17 '12 at 10:13
1

To use the code in the answer above you must setup IKVM. It is simple though taken some time for me to investigate. You need to reference following libraries: IKVM.OpenJDK.Core.dll, IKVM.OpenJDK.SwingAWT.dll, pdfbox-1.8.2.dll (get it by ikvmc -target:library pdfbox-1.8.2.jar command), IKVM.OpenJDK.Util.dll, IKVM.Runtime.dll – Alexander Smirnov Nov 08 '13 at 07:41
It worked like a charm for me in Java as well. Only thing I had to hit and trail the coordinates for the text which I wanted to pull. Any thoughts how can I fetch those from a pdf. Tried few options from SO but didn't work quite well.(https://stackoverflow.com/questions/8971243/free-tool-for-watching-coordinates-in-pdf) – Gaurav Parek Feb 28 '20 at 05:22

score 3 · Answer 2 · answered Aug 03 '12 at 08:17

This should work:

RenderFilter[] filters = new RenderFilter[1];
LocationTextExtractionStrategy regionFilter = new LocationTextExtractionStrategy();
filters[0] = new RegionTextRenderFilter(new Rectangle(llx,lly,urx,ury));
FilteredTextRenderListener strategy = new FilteredTextRenderListener(regionFilter, filters);

String result = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
Console.WriteLine(result);

score 3 · Answer 3 · answered Sep 13 '11 at 16:44

It's not open source, but hopefully this helps you (and potentially anyone else using ABCPDF!)

I did this earlier today by looping over the available fields in the PDF. This means that the PDF you are using needs to be created properly and you need to know the field name that you want to get the text for (you could work this out by adding a breakpoint and looping through the available fields).

WebSupergoo.ABCpdf6.Doc newPDF = new WebSupergoo.ABCpdf6.Doc();
newPDF.Read("existing_file.pdf");

foreach ( WebSupergoo.ABCpdf6.Objects.Field field in newPDF.Form.Fields )
{
    if ( field.Name == "Text1" )
    {
        // update "Text1"
        field.Value = "new value for Text1";
    }
}

newPDF.Save("new_file.pdf");

newPDF.Clear();

In the example, "Text1" is the name of the field that is being updated. Note I am also providing an example for saving out updated field(s).

Hopefully that at least gives you an idea of how to approach this problem.

Uhm, it's not the goal of my question iterating fields. I need to give coordinates and get "what's with a rectangle". Sorry. But thank you anyway. — Matías Fidemraizer, Sep 13 '11 at 16:46
The field object exposes a property "Rect". If you know the position they clicked, you could return the field that matches the co-ordinates by looking at top/bottom/right/left of the Rect. There might be a better way of doing it with different libraries, but this might work if you get stuck. — Ben Pearson, Sep 13 '11 at 16:53
Well, it's good to know and another resource. I was looking for some library for doing such selection in a more arbitrary way, but it's ok. — Matías Fidemraizer, Sep 14 '11 at 06:45

score 2 · Answer 4 · answered Sep 13 '11 at 18:37

2

iText's RegionTextRenderFilter is precisely what you're looking for.

So you want something like this (forgive my Java, but it should be trivial to translate):

PdfReader reader = new PdfReader(path);

FilteredTextExtractionStrategy regionFilter = 
  new FilteredTextExtractionStrategy( new SimpleTextExtrationStrategy, 
                                      new RegionTextRenderFilter( someRect ) );
String regionText = PdfTextExtractor.getTextFromPage(reader, 0, regionFilter );

answered Sep 13 '11 at 18:37

Mark Storer

15,672
3
42
80

Well, good to know there's another option to solve the same problem, so if using PDFBox for .NET has some problem, I'll take a look. Anyway, +1 for your contribution. – Matías Fidemraizer Sep 14 '11 at 06:32
1

Hey, I was trying out your solution but it seems that iTextSharp (.NET version) doesn't have such strategy... – Matías Fidemraizer Sep 14 '11 at 08:58
Same problem as Marias -- how can this be done in .NET? – Richard West Jan 06 '14 at 20:36

score 1 · Answer 5 · edited Mar 08 '19 at 11:13

This code will work in itext 7

PdfReader reader = new PdfReader("D:/Sample2.pdf");
PdfDocument pdfDoc = new PdfDocument(reader);
Rectangle rect = new Rectangle(208, 508, 235, 519);
TextRegionEventFilter regionFilter = new 
TextRegionEventFilter(rect.SetBbox(208, 508, 235, 519));
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy = listener.AttachEventListener(new LocationTextExtractionStrategy(), regionFilter);
new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetPage(1));
String text = extractionStrategy.GetResultantText();

score 0 · Answer 6 · answered Feb 12 '20 at 02:33

You may wanna look at this sample. It uses itextsharp

var pdfFilename = @"PathToYourPDF\random.pdf";
var textToFind = "Lombok";
var pageNumber = 1;
var point = PdfTools.GetTextCoordinate(textToFind, pdfFilename , pageNumber);
Console.WriteLine($"{point.X},{point.Y}");

Extract PDF text by coordinates

6 Answers6

Linked