7

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error:

pdf2txt.py 2.pdf 

Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 115, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/pdf2txt.py", line 109, in main
    interpreter.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = self.get_font(None, subspec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
    font = PDFCIDFont(self, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__
    StringIO(self.fontfile.get_data()))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

While the similar file (1.pdf) doesn't cause a problem.

I can't find any information about the error. I added an issue on the pdfminer GitHub repository, but it remained unanswered. Can someone explain to me why this is happening? What can I do to parse 2.pdf?


Update: I get a similar error with BytesIO instead of StringIO after installing pdfminer directly from the GitHub repository.

    $ pdf2txt.py 2.pdf 
Traceback (most recent call last):
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main
    interpreter.process_page(page)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents
    self.init_resources(resources)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
    font = self.get_font(None, subspec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = PDFCIDFont(self, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__
    BytesIO(self.fontfile.get_data()))
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
jacefarm
  • 6,747
  • 6
  • 36
  • 46
Danil
  • 4,781
  • 1
  • 35
  • 50
  • I suspect you've hit the end of the file before the parser expected it due to a bug. Try running [dumppdf.py](https://euske.github.io/pdfminer/#dumppdf) instead and see if there is obviously bad data just before this error. – Peter Brittain Oct 25 '16 at 16:19
  • this is what I get https://gist.github.com/danmash/a8b42f72787ca0c329a0b2c2ae6aeea3 – Danil Oct 26 '16 at 08:34
  • I think you also want to use the `-a` option too... – Peter Brittain Oct 26 '16 at 09:16
  • so.. can you explain what can I do with this [dump](https://gist.github.com/danmash/d1f4e41385e71c49382e0cfb171ee857) ? – Danil Oct 26 '16 at 10:04
  • Looking at the stack trace, you can see that it died processing a font. There are only 2 of that type in the dump and both the streams used by these fonts are present, so it's not obvious what's wrong. – Peter Brittain Oct 26 '16 at 15:41
  • That said, your code references StringIO, which means it is at least 2 years old... Have you tried updating? – Peter Brittain Oct 26 '16 at 15:42
  • Do you mean update pdf2txt.py tool? I tried to [install pdfminer](https://github.com/euske/pdfminer#how-to-install) directly from github repository, but recieved same error – Danil Oct 27 '16 at 09:25
  • Did the stack trace still include the StringIO reference? If so, your install failed... – Peter Brittain Oct 27 '16 at 09:27
  • sorry, you're right. I get a similar error with `BytesIO` instead of `StringIO`. I updated my question. – Danil Oct 27 '16 at 09:38
  • Ok - I think I've found the bug. I suggest you link my answer to this issue: https://github.com/euske/pdfminer/issues/144 – Peter Brittain Oct 29 '16 at 16:22
  • The file [1.pdf](https://yadi.sk/i/Z37JK5S9xZeoX) has only invalid `/Info` object (number 5) that is luckily unused by pdfminer, so no problem. – hynekcer Oct 30 '16 at 02:48

6 Answers6

5

TL; DR

Thanks to @mkl and @hynecker for the extra info... With that I can confirm this is a bug in pdfminer and your PDF. Whenever pdfminer tries to get embedded file streams (e.g. font definitions), it is picking up the last one in the file before an endobj. Sadly, not all PDFs rigorously add the end tag and so pdfminer should be resilient to this.

Quick fix for this issue

I've created a patch - which has been submitted as a pull request on github. See https://github.com/euske/pdfminer/pull/159.

Detailed diagnosis

As mentioned in the other answers, the reason you're seeing this is that you're not getting the expected number of bytes from the stream as pdfminer is unpacking the data. But why?

As you can see in your stack trace, pdfminer (rightly) spots that it has a CID font to process. It then goes on to process the embedded font file as a TrueType font (in pdffont.py). It tries to parse the associated stream (stream ID 18) by reading out a set of binary tables.

This doesn't work for 2.pdf because it has a text stream. You can see this by running dumppdf -b -i 18 2.pdf. I've put the start here:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...

So, garbage in, garbage out... Is this a bug in your file or pdfminer? Well, the fact that other readers can handle it made me suspicious.

Digging around a little more, I see that this stream is identical to stream ID 17, which is the cmap for the ToUnicode field. A quick look at the PDF spec shows that these cannot be the same.

Digging in to the code further, I see that all streams are getting the same data. Oops! This is the bug. The cause appears to be related to the fact that this PDF is missing some end tags - as noted by @hynecker.

The fix is to return the right data for each stream. Any other fix to just swallow the error will result in bad data being used for all streams and so, for example, incorrect font definitions.

I believe the attached patch will fix your problem and should be safe to use in general.

Peter Brittain
  • 13,489
  • 3
  • 41
  • 57
  • Are you sure about your analysis of the PDF? I inspected 2.pdf using Adobe Acrobat Preflight, in particular the object 18, and it looks like [this](https://i.stack.imgur.com/rvaLQ.png), i.e. in particular the contents clearly look like a font file. Using Preflight to check for PDF syntax errors, it merely warns about missing **FontName** entries... – mkl Oct 28 '16 at 22:42
  • Interesting... I used dumppdf. Maybe there's a bug in its stream handling, which is also affecting pdfminer? – Peter Brittain Oct 28 '16 at 23:58
  • @mkl OK - so digging in to the stream parsing, I see that it always returns the last stream, no matter what ID was requested. This is a bug. I'll dig a little more and update my answer... – Peter Brittain Oct 29 '16 at 10:32
  • Ah, the joy of bug hunting... ;) – mkl Oct 29 '16 at 12:56
  • Yes it is simple enough and correct if a new indirect object immediately follows after the the missing endobj. It could be a little problematic to be accepted as a patch for pdfminer, if a more general solution will be necessary later, but for a current status of pdfminer it seems good enough. I added a word "broken" to the patch, for the case it must be refactored later. – hynekcer Oct 31 '16 at 14:55
  • @PeterBrittain can you make pull request to pdfminer repository? – Danil Oct 31 '16 at 16:26
4

I fixed your problem in the source code, and I try on your file 2.pdf to make sure it worked.

In the file pdffont.py I replaced:

class TrueTypeFont(object):

    class CMapNotFound(Exception):
        pass

    def __init__(self, name, fp):
        self.name = name
        self.fp = fp
        self.tables = {}
        self.fonttype = fp.read(4)
        (ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
        for _ in xrange(ntables):
            (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
            self.tables[name] = (offset, length)
        return

by this:

class TrueTypeFont(object):

    class CMapNotFound(Exception):
        pass

    def __init__(self, name, fp):
        self.name = name
        self.fp = fp
        self.tables = {}
        self.fonttype = fp.read(4)
        (ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
        for _ in xrange(ntables):
            fp_bytes = fp.read(16)
            if len(fp_bytes) < 16:
                break
            (name, tsum, offset, length) = struct.unpack('>4sLLL', fp_bytes)
            self.tables[name] = (offset, length)
        return

Explanations

@Nabeel Ahmed was right

The foramt string >4sLLL requires 16 bytes size of buffer, which is specified correctly to fp.read to read 16 bytes at a time.

So, the problem can only be with the buffer stream it's reading i.e. the content of your specific PDF file.

In the code we see that fp.read(16) are made in a loop without any check.Thus, we don't know for sure if it successfully read it all. It could for instance reached an EOF.

To avoid this problem, I just break out of the for loop when this kind of problem appears.

    for _ in xrange(ntables):
        fp_bytes = fp.read(16)
        if len(fp_bytes) < 16:
            break

In any regular cases, it shouldn't change anything anyway.

I will try to do a pull request on github, but I'm not even sure it will be accepted so I suggest you do a monkey patch for now and modify your /home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py file right now.

Community
  • 1
  • 1
Kruupös
  • 5,097
  • 3
  • 27
  • 43
4

This is really an invalid PDF because there are some missing keywords endobj after three indirect objects. (object 5, 18 and 22)

The definition of an indirect object in a PDF file shall consist of its object number and generation number (separated by white space), followed by the value of the object bracketed between the keywords obj and endobj. (chapter 7.3.10 in PDF reference)

The example 2.pdf is a simple PDF 1.3 version that uses a simple uncompressed cross reference and uncompressed object separators. The failure can be easily found by grep command and by a general file viewer that the PDF has 22 indirect objects. The pattern " obj" is found correctly exactly 22 times (never accidentally in a string object or in a stream, fortunately for simplicity), but the keyword endobj is three times missing.

$ grep --binary-files=text -B1 -A2 -E " obj|endobj" 2.pdf
...
18 0 obj
<< /Length 451967/Length1 451967/Filter [/FlateDecode] >> 
stream
...
endstream                 % # see the missing "endobj" here
17 0 obj
<< /Length 12743 /Filter [/FlateDecode] >> 
stream
...
endstream
endobj
...

Similarly the object 5 has no endobj before object 1 and the object 22 has no endobj before object 21.

It is known that broken cross references in PDF can be and should be usually reconstructed by obj/endobj keywords (see the PDF reference, chapter C.2) Some applications do probably vice-versa fix missing endobj if cross references are correct, but it is no written advice.

hynekcer
  • 14,942
  • 6
  • 61
  • 99
  • Good catch. So there are issues both in pdfminer and the pdf... ;) – mkl Oct 30 '16 at 07:57
  • @mkl Do you think that a rule can be quoted from the documentation to explain that pdfminer is not a "conforming reader" according to the PDF reference? I would like to write a patch, but I know that some strange implementations are correct and should not be fixed. I see a [discussion](https://feliam.wordpress.com/2010/08/14/pdf-a-broken-spec/#div-comment-122) with Leonard Rosenthol - PDF Standards Architect in Adobe - "... as long as the object that it points to is valid then it can be anywhere – even in the middle of an uncompressed stream" – hynekcer Oct 30 '16 at 22:14
  • 1
    @hynekcer I don't think we need to fall back on xrefs. pdfminer is actually parsing at the (indirect) object level and so only needs to know when one ends. Since (I believe) indirect objects cannot be nested, you can detect and use the next obj tag to be an implicit endobj. Coding that up has worked on these 2 files for me. – Peter Brittain Oct 30 '16 at 23:04
  • Can you send a link to a working branch to check it? I can find some counterexamples for tests until it is ok. I think to create some conditional breaks to terminate the main loop `while not self.results:` in psparser.PSStackParser.nextobject(). Indirect object can contain indirect objects and it is not a problem because the referencing is postponed until you call the `resolve()` method of the indirect object. The possibilities are `any_object [ comment | whitespace ]* [ stream .. endstream ] [ comment | whitespace ]* { endobj | other_token }`. Other token denotes a lost endobj. – hynekcer Oct 31 '16 at 00:35
  • @hynekcer Wouldn't an indirect object contain a reference to another indirect object? In which case, using the obj tag would be fine as it would just be another entry in the sequence in the PDF body (and not nested)? Anyway, the code is pasted into my answer: http://stackoverflow.com/a/40295837/4994021 – Peter Brittain Oct 31 '16 at 07:56
  • First of all a conforming pdf processor **must** use the information from the cross references. In particular if there are multiple objects with the same object and generation number, the "right" one is not necessarily the last one in the file (even though it usually is). Furthermore, even if there is only a single occurrence of an object with a given object and generation number, that object has to be ignored if it is not in the cross references or if it has been marked deleted. And there are certain extreme cases which are ambiguous without the references... – mkl Oct 31 '16 at 09:04
  • @mkl Sorry - I wasn't clear. Of course pdfminer uses the cross-references. However, it also extracts indirect objects from the file as required. The issue is with the latter processing. It simply did not recognize the end of the object and so pulled out more data than it should (and then used the wrong stream because of that). – Peter Brittain Oct 31 '16 at 11:00
  • Ah OK. In that case how about using the length stream dictionary entry? – mkl Oct 31 '16 at 14:36
  • @mkl The length of stream is clear. Tthere was never a problem with length of stream. Between keywords "obj" and "endobj" are two objects of any type and additional two integer objects and keyword "obj". PDF structure is is based on Postscript besides other. Operands in Postscript are pushed on stack. Then should come an operator (e.g. "endobj") that consumes the data. The behavior of pdfminer was logical that the last object on the stack was used, but we agree that "endobj" should not be used like a simple Postscript operator in PS processor. (I accept Peter Britain's solution. The world is.. – hynekcer Oct 31 '16 at 21:45
  • 1
    ... the world is not perfect.) – hynekcer Oct 31 '16 at 21:46
2

The last error message tells you a lot:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in

init (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16)) struct.error: unpack requires a string argument of length 16

You can easily debug what is going on, for example, by putting necessary debug statements exactly in pdffont.py file. My guess is that there is something special about your pdf contents. Judging by the method name - TrueTypeFont - which throws the error message, there is some incompatibility with the font type.

Community
  • 1
  • 1
Jacobian
  • 10,122
  • 29
  • 128
  • 221
2

Let start with explaining the statement where you're getting exception:

struct.unpack('>4sLLL', fp.read(16))

where the synopsis is:

struct.unpack(fmt, buffer)

The method unpack, unpacks from the buffer buffer (which presumably earlier packed by pack(fmt, ...)) according to the format string fmt. The result is a tuple even if it contains exactly one item. The buffer’s size in bytes must match the size required by the format, as reflected by calcsize().

The most common case is, wrong number of bytes (16) for the format used (>4sLLL) - for example, for a format expecting 4 bytes, you have specified 3 bytes:

(name, tsum, offset, length) = struct.unpack('BH', fp.read(3))

for this you'll get

struct.error: unpack requires a string argument of length 4

The reason - the format struct ('BH') expects 4 bytes i.e. when we pack something using 'BH' format it'll occupy 4 bytes of memory. A good explanation here.


To clarify it further - let's look into the >4sLLL format string. To verify the size unpack 'd be expecting for the buffer (the bytes you're reading from the PDF file). Quoting from docs:

The buffer’s size in bytes must match the size required by the format, as reflected by calcsize().

>>> import struct 
>>> struct.calcsize('>4sLLL')
16
>>> 

To this point we can say there's nothing wrong with the statement:

(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))

The foramt string >4sLLL requires 16 bytes size of buffer, which is specified correctly to fp.read to read 16 bytes at a time.

So, the problem can only be with the buffer stream it's reading i.e. the content of your specific PDF file.


Can be a bug - as per this comment:

This is a bug in the upstream PDFminer by @euske There seems to be patches for this so it should be an easy fix. Beyond this I also need to strengthen the pdf parsing such that we never error out from a failed parse

I'll edit the question it I find something helpful to add here - a solution, or a patch.

Community
  • 1
  • 1
Nabeel Ahmed
  • 18,328
  • 4
  • 58
  • 63
1

In case you still get some struct errors after applying Peter's patch, especially when parsing many files in one script's run (using os.listdir), try changing resource manager caching to false.

rsrcmgr = PDFResourceManager(caching=False)

It helped me to get rid of the rest of errors after applying above solutions.

murnko
  • 148
  • 1
  • 10