I am using a ScanSnap scanner which generates PDF-1.3 where it will auto-correct the orientation (rotate 0 or 180 degrees) of scanned documents when the PDF is viewed within Adobe Reader. OCR is done by the scanning software and I am assuming the orientation is determined then and encoded into the PDF.
Note that I know I can use Tesseract or other OCR tools to determine if rotation is needed, but I do not want to use it as the scanner software seems to have already determined it and telling PDF viewers if rotation is needed (or not).
When I use image extraction tools (like xpdf pdfimages, python libraries) it does not properly rotate jpeg images 180 degrees (if needed).
NB: pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.
I have scanned a document twice with rotation (0 degrees, and 180 degrees). I cannot seem to reverse engineer what is telling Adobe/Foxit to rotate (or not) the image when viewing. I have looked at the PDF-1.3 specification doc, and compared the PDF binary data between the orientation-corrected and not-corrected. I can not determine what is correcting the orientation?
- No /Page/Rotate (defaults to 0) in PDF
- No EXIF orientation in JPEG
- I do not see any transformation matrix (cm operator) in PDF
In both cases the PDF binary looks like the following (stopped at the JPEG streamed data)
UPDATED: links to PDF files rotated-180 rotated-0
%PDF-1.3
%âãÏÓ
1 0 obj
<</Metadata 20 0 R/Pages 2 0 R/Type/Catalog>>
endobj
2 0 obj
<</MediaBox[0.0 0.0 606.6 794.88]/Count 1/Type/Pages/Kids[4 0 R]>>
endobj
4 0 obj
<</Parent 2 0 R/Contents 18 0 R/PieceInfo<</PSL<</Private<</V(3.2.9)>>/LastModified(D:20190201125524-00'00')>>>>/MediaBox[0.0 0.0 606.6 794.88]/Resources<</XObject<</Im0 5 0 R>>/Font<</C0_0 11 0 R/T1_0 16 0 R>>/ProcSet[/PDF/Text/ImageC]>>/Type/Page/LastModified(D:20190201085524-04'00')>>
endobj
5 0 obj
<</Subtype/Image/Length 433576/Filter/DCTDecode/Name/X/BitsPerComponent 8/ColorSpace/DeviceRGB/Width 1685/Height 2208/Type/XObject>>stream
Does anyone know how PDF viewers know to rotate an image 180 (or not). Is it meta-data within the PDF or JPEG image which can be extracted? Does Adobe and other viewers do something dynamically on opening a document to determine if orientation correction is needed?
I'm no expert with PDF specification. But I was hoping someone may have already found a solution to this problem.