How to extract rotation/transformation information for PDF extracted images (i.e. How does viewers know to rotate 180 )

Question

I am using a ScanSnap scanner which generates PDF-1.3 where it will auto-correct the orientation (rotate 0 or 180 degrees) of scanned documents when the PDF is viewed within Adobe Reader. OCR is done by the scanning software and I am assuming the orientation is determined then and encoded into the PDF.

Note that I know I can use Tesseract or other OCR tools to determine if rotation is needed, but I do not want to use it as the scanner software seems to have already determined it and telling PDF viewers if rotation is needed (or not).

When I use image extraction tools (like xpdf pdfimages, python libraries) it does not properly rotate jpeg images 180 degrees (if needed).

NB: pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.

I have scanned a document twice with rotation (0 degrees, and 180 degrees). I cannot seem to reverse engineer what is telling Adobe/Foxit to rotate (or not) the image when viewing. I have looked at the PDF-1.3 specification doc, and compared the PDF binary data between the orientation-corrected and not-corrected. I can not determine what is correcting the orientation?

No /Page/Rotate (defaults to 0) in PDF
No EXIF orientation in JPEG
I do not see any transformation matrix (cm operator) in PDF

In both cases the PDF binary looks like the following (stopped at the JPEG streamed data)

UPDATED: links to PDF files rotated-180 rotated-0

%PDF-1.3
%âãÏÓ
1 0 obj
<</Metadata 20 0 R/Pages 2 0 R/Type/Catalog>>
endobj
2 0 obj
<</MediaBox[0.0 0.0 606.6 794.88]/Count 1/Type/Pages/Kids[4 0 R]>>
endobj
4 0 obj
<</Parent 2 0 R/Contents 18 0 R/PieceInfo<</PSL<</Private<</V(3.2.9)>>/LastModified(D:20190201125524-00'00')>>>>/MediaBox[0.0 0.0 606.6 794.88]/Resources<</XObject<</Im0 5 0 R>>/Font<</C0_0 11 0 R/T1_0 16 0 R>>/ProcSet[/PDF/Text/ImageC]>>/Type/Page/LastModified(D:20190201085524-04'00')>>
endobj
5 0 obj
<</Subtype/Image/Length 433576/Filter/DCTDecode/Name/X/BitsPerComponent 8/ColorSpace/DeviceRGB/Width 1685/Height 2208/Type/XObject>>stream

Does anyone know how PDF viewers know to rotate an image 180 (or not). Is it meta-data within the PDF or JPEG image which can be extracted? Does Adobe and other viewers do something dynamically on opening a document to determine if orientation correction is needed?

I'm no expert with PDF specification. But I was hoping someone may have already found a solution to this problem.

Please post a link to a rotated PDF file so we can take a look at it. — Mihai Iancu, Feb 02 '19 at 06:16
Indeed, the data you posted do not make a pdf viewer rotate anything. — mkl, Feb 02 '19 at 06:35
Thanks for the comments - I updated the Question to include links to the "rotated-180" PDF. Also added in "rotated-0" PDF as well. — user297500, Feb 04 '19 at 14:44

score 3 · Accepted Answer · answered Feb 05 '19 at 16:30

The image Im0 in the resources of the page in "internetfile-180.pdf" is not rotated:

But the image Im0 in the resources of the page in "internetfile.pdf" is rotated:

In the viewer both look upright, so in "internetfile.pdf" a technique must be used that rotates the image.

There are two major techniques for this:

Setting the Rotate property of the page accordingly, i.e. here to 180.
Applying a rotation transformation to the current transformation matrix in the content stream of the page.

Let's look at the page dictionary first, a bit pretty-printed:

4 0 obj
<<
  /Parent 2 0 R
  /Contents 13 0 R
  /PieceInfo
  <<
    /PSL
    <<
      /Private <</V (3.2.9)>>
      /LastModified (D:20190204142537-00'00')
    >>
  >>
  /MediaBox [0.0 0.0 608.64 792.24]
  /Resources
  <<
    /XObject <</Im0 5 0 R>>
    /Font <</T1_0 11 0 R>>
    /ProcSet [/PDF /Text /ImageC]
  >>
  /Type /Page
  /LastModified (D:20190204102537-04'00')
>>

As we see, there is no Rotate entry present. Thus, we'll have to look at the page content stream. According to the page dictionary it's in object 13, generation 0.

That object is a stream object with deflated stream data:

13 0 obj
<<
  /Length 4014
  /Filter /FlateDecode
>>
stream
H‰”WÛŽÛF}Ÿ¯Ð[lÀÓÓ÷Ë¾e½
[...]
ÿüòÛÿ ´ß
endstream
endobj

After inflating the stream data, they start like this:

q
-608.3999939 0 0 -792.9600067 608.3999939 792.9600067 cm
/Im0 Do
Q
[...]

And this is indeed an application of the second technique, the cm instruction applies the rotation and the Do instruction paints the image with the rotation active!

In detail, the cm instruction applies the affine transformation represented by the matrix

-608.3999939    0            0
   0         -792.9600067    0
 608.3999939  792.9600067    1

In other words:

x' = -608.3999939 * x + 608.3999939
y' = -792.9600067 * y + 792.9600067

This transformation actually is a combination of a rotation by 180°, a horizontal scaling by 608.3999939 and a vertical scaling by 792.9600067, and a translation by 608.3999939 horizontally and 792.9600067 vertically.

The Do instruction now paints the image. Here one needs to know that this instruction first scales the image to fit into the unit 1×1 square at the origin and then applies the current transformation matrix.

Thus, the image is drawn rotated by 180°, effectively filling the whole 608.64×792.24 MediaBox of the page.

score 0 · Answer 2 · edited May 01 '20 at 14:25

mkl answered the question correctly doing all the hard work decoding the PDF for me.

I thought I would add in my python (PyPDF2) code to search for the found rotation condition in case it helps someone else.

input1 = PyPDF2.PdfFileReader(open(filepath, "rb"))
totalPages = input1.getNumPages()
for pgNum in range(0,totalPages):
    page0 = input1.getPage(pgNum)

    # Lets look to see if the page contains a transformation matrix to rotate it 180 degress 
    # (ScanScap iX500 encoded the PDF with a cm transformation matrix to rotate 180 degrees in PDF viewers
    # @see https://stackoverflow.com/questions/54483013/how-to-extract-rotation-transformation-information-for-pdf-extracted-images-i-e
    # @see 'PDF 1.3 Reference Manual March 11, 1999' Section 3.10 Transformation matrices which is applied to the scanned image
    #                                          [[a b 0]
    #                                           [c d 0]
    #                                           [e f 1]] 
    isPageRotated180 = False
    pgContent = page0['/Contents'].getData().decode('utf-8')
    FLOAT_REG = '([-+]?\d*\.\d+|\d+)'
    m = re.search( '{} {} {} {} {} {} cm'.format(FLOAT_REG,FLOAT_REG,FLOAT_REG,FLOAT_REG,FLOAT_REG,FLOAT_REG), pgContent )
    if m:
        (a,b,c,d,e,f) = list(map(float,m.groups()))
        isPageRotated180 = (a == -e and d == -f)

How to extract rotation/transformation information for PDF extracted images (i.e. How does viewers know to rotate 180 )

2 Answers2