1

I am using PDFTextStripper to extract text from a PDF. I want to get the width and height, in millimeters, for each TextPosition. This can be found from a given TextPostion tp using tp.getWidth() and tp.getHeight(). My problem is that the value returned is in display unit. I tried to look around to find the right conversion factor but I got confused. I know that PDFs uses different coordinate systems as explained in the PDF documentation (picture below). enter image description here

I also found this post but It may be deprecated since I am using PDFBox 2.0.12. The variables described in this post does not exists anymore in the PDPage class but I found these constants in the PDRectangle class

/** user space units per inch */
private static final float POINTS_PER_INCH = 72;

/** user space units per millimeter */
private static final float POINTS_PER_MM = 1 / (10 * 2.54f) * POINTS_PER_INCH;

My question is: In which space a display unit is defined? and How can I convert it to millimiters.

Many thanks,

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
Mr. D
  • 657
  • 1
  • 8
  • 21
  • 1 unit = 1/72 inch. – Tilman Hausherr Jan 03 '19 at 11:01
  • @TilmanHausherr so a display unit is the user space unit? if so what is the unit of the device space? – Mr. D Jan 03 '19 at 16:42
  • 1
    The display depends on what device you are using. I think the javadoc is confusing but I don't have a better idea. – Tilman Hausherr Jan 03 '19 at 17:00
  • 1
    Strictly speaking it can differ from 1/72": if the page in question contains a **UserUnit** entry with value *x*, the default user space unit is *x × 1/72"*. – mkl Jan 03 '19 at 20:28
  • 1
    @mkl How to know if a page contains a UserUnit? – Mr. D Jan 04 '19 at 16:25
  • `page.getCOSObject().getFloat(COSName.getPDFName("UserUnit");`. I've never had a PDF with it. Your problems likely have a different cause. See the `DrawPrintTextLocations.java` example from the source code. – Tilman Hausherr Jan 06 '19 at 08:45
  • 2
    @Tilman is right, the **UserUnit** entry is seldom used. So indeed, your issues most likely have a different cause. I considered writing an answer pointing out a number of details to consider; but then I refrained from doing so as a generic list would be a lot of work. If you shared some code and described your issue using an example PDF and expected and observed results, we could more easily give specific help. And indeed, you should have a look at the `DrawPrintTextLocations` example! – mkl Jan 07 '19 at 11:00
  • 1
    Related, on text extraction coordinates: [Why PDFBox text extraction coordinates are as they are](https://stackoverflow.com/a/28114320/1729265), [Text coordinates when stripping from PDFBox](https://stackoverflow.com/a/46507350/1729265) etc – mkl Jan 07 '19 at 11:19
  • Thanks a lot for your answers. Indeed DrawPrintTextLocations.java got everything I need. – Mr. D Jan 08 '19 at 14:44

0 Answers0