3

I have a PDF that contains a long list numbers, that was compressed using the JBIG2 algorithm. When I look up the the internal file structure of my file I can find, that my pages are being built with two different XObjects: Pictured is Adobe Acrobat Preflight -> Internal structure.

(Pictured is Adobe Acrobat Preflight -> Internal structure.)

I can easily look at the specifics of the first one called "XIPLAYER0" (not pictured) it even gives me the information bit by bit if I want to. The second one is the one I am interested in tho. In it I can see that the image is built using 2 "Symbol Dictionaries" (first one marked grey). Is it possible to see the different entries in this dictionary? Or maybe even get some metadata for just one of them?

Sample PDF(Outside link)

SirHawrk
  • 610
  • 1
  • 4
  • 17
  • Can you include a sample PDF? Also, how do you want to view the symbols, in Acrobat? – Zach Young May 24 '22 at 13:33
  • @ZachYoung I don't really care about where I can see the symbols. I am comfortable with python and I'd guess that would be the most used language for something like this. I also included a sample PDF. It is an outside Link tho – SirHawrk May 24 '22 at 13:40
  • 1
    @KJ I am not entirely certain I follow but I am interested in the specific files as this is a faulty Xerox scan (yes from that story ~ 9 years ago) – SirHawrk May 24 '22 at 17:25
  • Ah no it really is faulty. The numbers are not the same ones as in the original that was scanned lol – SirHawrk May 24 '22 at 19:25
  • This input of yours is not helpful. I __know__ that it is faulty. I am writing a paper about __why__ it is faulty and what __mistakes__ were made by the printer company – SirHawrk May 25 '22 at 04:53

2 Answers2

1

This is not really about PDF, PDF is just the container for the JBIG2 format and its symbols dictionary, which is what you're really interested in.

But, as a first step, you'll need to get the JBIG2 images out of the PDF:

Extract images from PDF, how to handle JBIG2 encoded

That SO mentions poppler, and poppler does have a Python binding/wrapper:

https://pypi.org/project/python-poppler/

Once you get those JBIG2 files, maybe this can help:

jbig2_symbol_dict.c

The bigger project has a command-line util which has a "dump" option, but the source says it's not implemented^1:

case dump:
    fprintf(stderr, "Sorry, segment dump not yet implemented\n");
    break;

So if you're just curious/this is an academic question, the answer looks like "not really". If you need to read the text, how about OCR?

Zach Young
  • 10,137
  • 4
  • 32
  • 53
  • This sadly is an academic question in the sense that I need this for university. I will check these things out tomorrow; I am already at home but big thanks already – SirHawrk May 24 '22 at 17:26
0

The File in question has a known problem in that the scan as JBIG2 is supposed to be highly compressed clean pixel scan without some of the issues that a jpeg may introduce when its low quality. However the format as used by some commercial scanners can notoriously infill 6 to look like 8 as seen in this sequence from page 1. see https://en.wikipedia.org/wiki/JBIG2#Disadvantages

enter image description here

For several reasons it is suggested by some organisations it not be used for critical documents where image fidelity needs to be as generated by more conventional TIFF GIF or PNG Monochrome scans.

To extract such an image requires 2 lines of code using 2 libraries

poppler\bin>pdfimages -all 7535-7pt.pdf out

and a for loop in this case 001-81 for the 243 out-puts similar to

jbig2\Library\bin>jbig2dec -o out-001 -t pbm out-001.jb2g out-001.jb2e

Meta data for first 3 pages can be seen here (where a poor 200 dpi equivalence had been used)

23.01.0\Library\bin>pdfimages -list 7535-7pt.pdf  

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1184   832  gray    1   8  jpeg   no         6  0   100   100  554B 0.1%
   1     1 stencil  1967  1230  -       1   1  jbig2  no         8  0   200   200 7885B 2.6%
   2     2 image    1184   832  gray    1   8  jpeg   no        13  0   100   100  573B 0.1%
   2     3 stencil  1966  1200  -       1   1  jbig2  no        15  0   200   200 7415B 2.5%
   3     4 image    1184   832  gray    1   8  jpeg   no        19  0   100   100  552B 0.1%
   3     5 stencil  1967  1201  -       1   1  jbig2  no        21  0   200   200 7829B 2.7%

the 81 pbm's will be a faithful copy of the poor variable inputs typically (

/MediaBox [0 0 842 596] /Rotate 270 
/Image
/BitsPerComponent 1
/Width 1967
/Height 1230
/ImageMask true
/Filter
/JBIG2Decode

) and the old 243 images can be discarded (PDF file should have been discarded anyway, and paper source rescanned at higher resolution) as images are of no use except to show the errors as above.

K J
  • 8,045
  • 3
  • 14
  • 36