20

I am new to PDFBox (and PDF generation) and I am having difficulties generating my own PDF.

I do have text with certain coordinates in inches/centimeters and I need to convert them to the units PDFBox uses. Any suggestions/utilities than can do this automatically?

PDPageContentStream.moveTextPositionByAmount(x,y) is making no sense to me.

Robby F
  • 397
  • 1
  • 2
  • 13

1 Answers1

38

In general PDFBox uses the PDF user space coordinates when creating a PDF. This means:

  1. The coordinates of a page are delimited by its CropBox defaulting to its MediaBox, the values increasing left to right and bottom to top. Thus, if you create a page using new PDPage() or new PDPage(PDPage.PAGE_SIZE_*) the origin of the coordinate system starts in the lower left corner of the page.

  2. The unit in user space starts as the default user space unit which is defined by the UserUnit of the page. Most often (e.g. if you create a page using any of the PDPage constructors and don't explicitly change that value) it is not explicitly set and, therefore, its default kicks in which is 1⁄72 inch.

  3. The user space coordinate system can be changed pretty arbitrarily by concatenating some matrix to the current transformation matrix. The current transformation matrix starts as the identity matrix.

    In PDFBox you do this using one of the PDPageContentStream.concatenate2CTM() overloads.

  4. As soon as you switch to text mode using PDPageContentStream.beginText(), the coordinate system used is furthermore influenced by the transformation introduced by the text matrix.

    In PDFBox you set the text matrix using one of the PDPageContentStream.setTextMatrix() overloads.

As you are new to PDFBox (as you say) and new to PDF in general (as I presume because otherwise you would likely have recognized the coordinates), I would advise you to initially refrain from using transformations wherever possible and, therefore, remain in state where the coordinate system starts in the lower left, is neither rotated nor skewed, and has a unit length of 1/72 inch.

For this context you actually can use constants provided by PDFBox for conversion:

  • Multiply coordinates in inch by PDPage.DEFAULT_USER_SPACE_UNIT_DPI to get default user space coordinates.
  • Multiply coordinates in mm by PDPage.MM_TO_UNITS to get default user space coordinates.

If you want to have fun with coordinates, though, look at the PDF specification ISO-32000-1 and study the sections 8.3 Coordinate Systems and 9.4.4 Text Space Details.


The PDPage constants pointed to above used to be accessible in early PDFBox 1.8.x versions but then got hidden (private), and eventually were removed in the transition to PDFBox 2.x.

For reference, the constants were defined as

private static final int DEFAULT_USER_SPACE_UNIT_DPI = 72;

private static final float MM_TO_UNITS = 1/(10*2.54f)*DEFAULT_USER_SPACE_UNIT_DPI;
mkl
  • 90,588
  • 15
  • 125
  • 265
  • 1
    There does not appear to be MM_TO_UNITS or DEFAULT_USER_SPACE_UNIT_DPI defined on PDPage. – Thayne May 30 '14 at 17:15
  • @Thayne Which version of PDFBox are you using? I see the constants both in 1.8.4 and in the current 2.0.0. – mkl Jun 02 '14 at 10:16
  • I'm using 1.8.5. I didn't know there was a 2.0.0 – Thayne Jun 02 '14 at 11:43
  • 2
    Looking at the source code those constants are defined but are private – Thayne Jun 02 '14 at 11:54
  • Ah, right, visibilities changed. They used to be public – mkl Jun 02 '14 at 14:57
  • 5
    Assuming units are _points_, which they usually are: `mm = pt * 0.352778` and `pt = mm / 0.352778` – LateralFractal Jan 04 '16 at 06:06
  • @mlk I am trying to create a PDF file using pdfbox. When I start adding text and images, like you say, it start adding them at the bottom of the document (lower left). How can I start to add everything on the top? – Erick May 24 '16 at 22:05
  • 1
    @Erich this actually is an altogether new question which should not be treated in a comment. That been said, you have to move the text position first using the crop box or media box coordinates and sizes as guidelines. – mkl May 25 '16 at 04:48
  • @Erick *When I start adding text and images, like you say, it start adding them at the bottom of the document (lower left). How can I start to add everything on the top?* - You might want to look at [this answer](http://stackoverflow.com/a/19683618/1729265) which illustrates how to draw some text across multiple lines on a given page. (In that answer I, too, used the media box; you might want to replace it by the crop box.) – mkl May 25 '16 at 06:38
  • @mkl I've recently started using PDFBox to extract text from regions via `org.apache.pdfbox.text.PDFTextStripperByArea`. In this situation I found that the `addRegion( String, Rectangle)` required the rectangles co-ordinates to be defined with co-ordinates starting at the top left corner of the page. Not sure whether this is a special case, but perhaps worth mentioning. – beldaz Oct 11 '17 at 19:43
  • @beldaz yes, the pdfbox text extraction code uses its own coordinate system. This can make life difficult for you if your task is not pure text extraction but instead you need the retrieved coordinates for a different task like highlighting some found words in the pdf. For all other tasks you need the original PDF user space coordinate system. This often has its origin in the lower left. – mkl Oct 12 '17 at 04:19
  • 2
    As of 2.0.4 (or earlier, that's what I'm using) DEFAULT_USER_SPACE_UNIT_DPI and MM_TO_UNITS are not found in PDPage. DEFAULT_USER_SPACE_UNIT_DPI value was 72 (pixels per inch). – Michael Ressler Nov 25 '17 at 14:11