I'm having trouble understanding the behavior of PDFBox when attempting to append text in a page's content stream. I am using a sample scanned PDF which is just a raster image overlayed on the page. My working knowledge of PDF internals is somewhat basic, so I may be on the wrong track.
http://solutions.weblite.ca/pdfocrx/scansmpl.pdf
I am using PDFBox 2.0.11 with sbt
: "org.apache.pdfbox" % "pdfbox" % "2.0.11"
My first step is to create a content stream and write "hello world" on the PDF, which I accomplished with the following:
// val pdf: PDDocument
val page = pdf.getPages(0)
val contentStream = new PDPageContentStream(pdf, page, false, true)
contentStream.beginText()
contentStream.newLineAtOffset(0, 0)
contentStream.setFont(PDType1Font.COURIER, 12)
contentStream.showText("Hello, world!")
contentStream.endText()
contentStream.close()
This works, and the text shows up in the bottom left, which is where I expected it to be. But it of course overwrites the raster image, which is not what I want. So, I change the PDPageContentStream
constructor to (pdf, page, true, true)
to make it append to the content stream.
Now I get bizarre behavior that I don't understand. The text shows up huge. So big that I can only see the bottom corner of the H
because it is at least 10x larger than the page itself. I guess this means there's some dangling matrix transformation that is occurring? I'm not sure that I fully understand how the transformation operations work within a PDF. PDFBox seems to imply that calling setTextMatrix
replaces the existing matrix with the new one, rather than it being relative to the existing text matrix. I can get the text to be visible (and close to normal size) with this:
val affine = new AffineTransform()
affine.setToIdentity()
affine.scale(0.002, 0.002)
// code
contentStream.setTextMatrix(new Matrix(affine))
Which I only discovered through trial and error. I don't see anyway to get the current transformation matrix state other than the page-wide .getMatrix()
, but that appears to return the identity regardless of whether I'm appending or overwriting, so I don't think it is that. Additionally, if I apply another text matrix with the exact same call as the last line in the previous block, it appears to scale it relative to the previous scale, so I end up with a second text block that is scaled too small to see.
How can I get the current transformation matrix so that I can invert it to reach the actual desired scaling?
Thanks!