0

I have got code that is meant to extract text from a user created rectangle on a PDF.

I am using ITextSharp for this.

The user inputs the co-ordinates of where they want the rectangle to be and they can 'preview' the rectangle, which draws a red rectangle over their pdf, or 'generate' a new pdf, which is meant to capture text within that rectangle and add an extra page to the pdf with just this text.

My issue is, the text is being captured from an area completely seperate to the preview rectangle. Both rectangles are created in the same way:

  //Preview rectangle code
  var xfer = ConvertToPoint(Convert.ToDouble(ULTB.Text));
  var yfer = ConvertToPoint(Convert.ToDouble(LLTB.Text));
  var uxfer = ConvertToPoint(Convert.ToDouble(URTB.Text));
  var uyfer = ConvertToPoint(Convert.ToDouble(LRTB.Text));

  iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle((float)xfer, (float)yfer, (float)uxfer, (float)uyfer);

This rectangle is then drawn onto the user document.

(ConvertToPoint just converts the user input into a point rather than mm)

Using the exact same user input, the rectangle created by the following code is in a different location:

var xfer = ConvertToPoint(Convert.ToDouble(ULTB.Text));
var yfer = ConvertToPoint(Convert.ToDouble(LLTB.Text));
var uxfer = ConvertToPoint(Convert.ToDouble(URTB.Text));
var uyfer = ConvertToPoint(Convert.ToDouble(LRTB.Text));
RenderFilter[] filters = new RenderFilter[1];
LocationTextExtractionStrategy regionFilter = new LocationTextExtractionStrategy();
filters[0] = new RegionTextRenderFilter(new iTextSharp.text.Rectangle((float)xfer, (float)yfer, (float)uxfer, (float)uyfer));

FilteredTextRenderListener strategy = new FilteredTextRenderListener(regionFilter, filters);
String result = PdfTextExtractor.GetTextFromPage(reader, x, strategy);

The above code should get the text from the position from the users coordinates but is not, any ideas?

I've attached the PDF onto Google drive, with a lot of text redacted

The red rectangle is what i get via the preview code and the text towards the bottom of the document is whats being picked up by the text capture

File

Ben Bodie
  • 75
  • 12
  • I have never used ITextSharp, so my comment my be meaningless, but PDFs in most PDF readers allow the user to zoom in and out. Could be that the coordinates as displayed on the screen need to be translated to zoomed (and possibly panned) coordinates on the actual PDF? – B.O.B. Apr 25 '22 at 14:12
  • Hi B.O.B, the zoom is the same on both, I just change what button is pressed :) – Ben Bodie Apr 25 '22 at 14:58
  • Please share a PDF and coordinate values to allow reproducing the issue. – mkl Apr 25 '22 at 15:12
  • Shared the file, the coordinates im doing are 27 mm lower left, 50 mm upper right, 58 mm lower left y 93 upper right y – Ben Bodie Apr 26 '22 at 07:46

1 Answers1

2

The problem is that the page rotation property is not 0 here.

iTextSharp has the "feature" of by default transforming the coordinates in changes you apply to the content to align to the page rotation. It does not likewise transform the coordinates during text extraction.

Fortunately iTextSharp allows to switch off that transformation, if you have a PdfStamper pdfStamper, simply set

pdfStamper.RotateContents = false;

right after initializing the stamper.

Of course this means that you have to take the page rotation into account in your code. But it also means that you can do so consistently.


Related questions:
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you, its my code that does the rotation of the pdf, so to combat this what i've done is allowed the user to create both rectangles on the original pdf, before its rotated, and then rotate everything as a last step. Thanks a lot :) – Ben Bodie Apr 26 '22 at 13:12