6

While processing a file with pdfminer (pdf2txt.py) I received empty output:

dan@work:~/project$ pdf2txt.py  docs/homericaeast.pdf 

dan@work:~/project$ 

Can anybody say what wrong with this file and what I can do to get data from it?

Here's dumppdf.py docs/homericaeast.pdf output:

<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>

<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>
Danil
  • 4,781
  • 1
  • 35
  • 50
  • [This](http://stackoverflow.com/questions/17193839/where-can-i-a-mapping-of-identity-h-encoded-characters-to-ascii-or-unicode-chara) question/answer might help – Wokpak May 09 '17 at 16:48
  • @Daniel Just in case if you want alternative `pdftotext` utility provide good results by keeping the layout as well http://dpaste.com/3EV77FE – Aamir Rind May 10 '17 at 16:55
  • @J.Hollom `pdf2txt.py -d homericaeast.pdf` gives me empty result as well – Danil May 10 '17 at 17:05
  • @AamirAdnan I prefer to use pdfminer, because I already have a big project with pdfminer usage and I have to integrate new code into it. But I'll have a look to pdftotext, thank you – Danil May 10 '17 at 17:09
  • @DanielM apologies, the `-d` flag is not relevant so have deleted that comment. I was able to get it to produce an output, although garbled, by exporting the original file as a pdf using Preview on Mac and then running `pdf2txt.py` – Wokpak May 11 '17 at 09:43

2 Answers2

4

Now I have fixed the problem with /OneByteIdentityH similarly to the code for two byte unicode mapping /Identity-H. The patch is in PR #179

hynekcer
  • 14,942
  • 6
  • 61
  • 99
2

The problem is that pdfminer doesn't understand the CMap that you are using in this PDF.

If you do a custom build of pdfminer switching STRICT=1 on in psparser.py you'll get an error a bit like this:

pdfminer.psparser.PSTypeError: Literal required: <PDFStream(21): raw=267, {u'Filter': /'FlateDecode', u'CMapName': /u'OneByteIdentityH', u'Type': /u'CMap', u'CIDSystemInfo': <PDFObjRef:20>, u'Length': 266}>

I'm not hugely familiar with the code, but even allowing this through doesn't help, because it doesn't recognize the mapping (even if I hard code the name to OneByteIdentityH and ask it to look that up). The net result is that the CMap contains no mappings and so it translates every character in your PDF to an empty string (well None if I'm being picky).

The fix is probably to create a mapping for this CMap that simply returns the character that was passed in similar to the other Identity maps already implemented in cmapdb.py

Peter Brittain
  • 13,489
  • 3
  • 41
  • 57
  • I am pleased to remember that we both have hardly participated exactly on two questions about pdfminer: here and on [struct.error: unpack requires...](http://stackoverflow.com/questions/40158637/struct-error-unpack-requires-a-string-argument-of-length-16), where you in turn wrote a patch. – hynekcer May 15 '17 at 20:30