0

I'm having trouble understanding the behavior of PDFBox when attempting to append text in a page's content stream. I am using a sample scanned PDF which is just a raster image overlayed on the page. My working knowledge of PDF internals is somewhat basic, so I may be on the wrong track.

http://solutions.weblite.ca/pdfocrx/scansmpl.pdf

I am using PDFBox 2.0.11 with sbt: "org.apache.pdfbox" % "pdfbox" % "2.0.11"

My first step is to create a content stream and write "hello world" on the PDF, which I accomplished with the following:

// val pdf: PDDocument
val page = pdf.getPages(0)
val contentStream = new PDPageContentStream(pdf, page, false, true)
contentStream.beginText()
contentStream.newLineAtOffset(0, 0)
contentStream.setFont(PDType1Font.COURIER, 12)
contentStream.showText("Hello, world!")
contentStream.endText()
contentStream.close()

This works, and the text shows up in the bottom left, which is where I expected it to be. But it of course overwrites the raster image, which is not what I want. So, I change the PDPageContentStream constructor to (pdf, page, true, true) to make it append to the content stream.

Now I get bizarre behavior that I don't understand. The text shows up huge. So big that I can only see the bottom corner of the H because it is at least 10x larger than the page itself. I guess this means there's some dangling matrix transformation that is occurring? I'm not sure that I fully understand how the transformation operations work within a PDF. PDFBox seems to imply that calling setTextMatrix replaces the existing matrix with the new one, rather than it being relative to the existing text matrix. I can get the text to be visible (and close to normal size) with this:

val affine = new AffineTransform()
affine.setToIdentity()
affine.scale(0.002, 0.002)
// code
contentStream.setTextMatrix(new Matrix(affine))

Which I only discovered through trial and error. I don't see anyway to get the current transformation matrix state other than the page-wide .getMatrix(), but that appears to return the identity regardless of whether I'm appending or overwriting, so I don't think it is that. Additionally, if I apply another text matrix with the exact same call as the last line in the previous block, it appears to scale it relative to the previous scale, so I end up with a second text block that is scaled too small to see.

How can I get the current transformation matrix so that I can invert it to reach the actual desired scaling?

Thanks!

Brian
  • 444
  • 1
  • 4
  • 16

1 Answers1

0

It appears that this was the issue. I did not see the constructor with the 5th argument for resetContext before. I'm still unsure how you would get the current context if you for some reason needed to do something relative to that context, though. In my case, adding the 5th argument solves the problem.

PDFBox : PDPageContentStream's append mode misbehaving

Brian
  • 444
  • 1
  • 4
  • 16