Why character ID 160 is not recognised as Unicode in PDFMiner?

Question

I am converting .pdf files into .xml files using PDFMiner.

For each word in the .pdf file, PDFMiner checks whether it is Unicode or not (among many other things). If it is, it returns the character, if it is not, it raises an exception and returns the string "(cid:%d)" where %d is the character id, which I think is the Unicode Decimal.

This is well explained in the edit part of this question: What is this (cid:51) in the output of pdf2txt?. I report the code here for convenience:

def render_char(self, matrix, font, fontsize, scaling, rise, cid):
    try:
        text = font.to_unichr(cid)
        assert isinstance(text, unicode), text
    except PDFUnicodeNotDefined:
        text = self.handle_undefined_char(font, cid)


def handle_undefined_char(self, font, cid):
    if self.debug:
        print >>sys.stderr, 'undefined: %r, %r' % (font, cid)
    return '(cid:%d)' % cid

I usually get this Exception for .pdf files written in Cyrillic. However, there is one file that uses plain English and where I get this Exception for non breaking spaces (that have cid=160). I do not understand why this character is not recognised as Unicode, while all others in the same file are.

If, on the same environment, I run isinstance(u'160', unicode) in the console I get True, while an (apparently) equivalent command is returning False when it's run inside PDFMiner.

If I debug, I see that the font is properly recognised, i.e. I get:

cid = 160
font =  <PDFType1Font: basefont='Helvetica'>

PDFMiner accepts the codec as a parameter. I have chosen utf-8, which has 160 as Unicode Decimal for non breaking space (http://dev.networkerror.org/utf8/).

If it might help, here is the code for to_unichr:

def to_unichr(self, cid):
    if self.unicode_map:
        try:
            return self.unicode_map.get_unichr(cid)
        except KeyError:
            pass
    try:
        return self.cid2unicode[cid]
    except KeyError:
        raise PDFUnicodeNotDefined(None, cid)

Is there a way to set/change the character map recognised by the code?

What do you think I should change, or where do you think I should investigate, so that cid=160 does not raise the Exception?

Here is the [test file](https://db.tt/dyK5S7Vw). When running `pdf2txt.py -o cidIssue_160_4times.xml cidIssue_160_4times.pdf` I get the string "(cid:160)" 4 times. They can be found in the .xml file at lines: 4264, 4266, 4269, 4272. — Luca, Dec 06 '15 at 10:39

score 2 · Accepted Answer · answered Dec 07 '15 at 16:37

The font in question in the sample document is a Simple Font and uses WinAnsiEncoding. This encoding is defined in the PDF specification ISO 32000-1 as one of four special encodings in a table in Annex D.2 Latin Character Set and Encodings. This table does not contain an entry for 240 (= decimal 160. The table entries are given as octal numbers!) in the WIN column.

This table is extracted as the ENCODING array in latin_enc.py, and from this array maps for those four encodings are generated in encodingdb.py which then are used, e.g. for fonts with that very encoding, cf PDFSimpleFont in pdffont.py.

Thus, the code 160 is not recognized by PdfMiner as having any associated character in WinAnsiEncoding. This causes your problem.

Only looking at the table that seems correct, but if one reads the notes below the table, one finds:

The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding. This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE.

This seems to have been overlooked by PdfMiner development.

This oversight might be fixed by adding an second entry for space

('nbspace', None, 202, 160, None)

to the ENCODING array (which is using decimal numbers); if you prefer, you might want to use space instead.

(I say might because I'm not into Python programming and, therefore, cannot check, in particular not for unwanted side effects.)

[great answer thank you!] **encodingdb.py** takes the name of the character from **latin_enc.py** and the unciode value for that name from **glyphlist.py**. So I believe that if one adds the nbspace line to latin_enc.py, then one should also add the following line to glyphlist.py: `'nbspace': u'\xa0'`. "space" is already present in the encoding table (with cid = 32 for the four encodings). So I think that it is better to treat them as two separate entries. — Luca, Dec 08 '15 at 09:58
actually **glyphlist.py** already contains the key 'nbspace' with value: 'u\u00A0'. — Luca, Dec 08 '15 at 10:10
Actually i only proposed adding the `('nbspace', None, 202, 160, None)` line after i found `nbspace` in **glyphlist.py**, before that i wanted to propose adding `('space', ...)` — mkl, Dec 08 '15 at 14:23

Sue Dunham · Answer 2 · 2019-12-11T17:48:02.417

One solution that works for me for similar characters in a different file is to use ftfy.fix_text(). I was drawn to this package fixing mojibake baked into a pdf's unicode, basically your typical curly quote hijinks between different encodings. Pdfminer caught them as "(cid:146)", etc., but I wanted to clean them up further. This class works on that one file so far; it includes the minimum to make it print something, but there would probably be more pdfminer elements in a working module. If one is using pdf2txt.py, perhaps one could put a copy somewhere safe, redirect the pdfminer.high_level.extract_text_to_fp(fp, **locals()) line to a safe copy of that module, tack this class onto the end of that, and swap it for the base class it inherits. I've just done the HTMLConverter, but the other ones could probably be handled similarly.

from pdfminer.converter import HTMLConverter
from io                 import BytesIO
class HTMLConvertOre(HTMLConverter):
    import ftfy, six
    from pdfminer.layout    import LTChar
    from pdfminer.pdffont   import PDFUnicodeNotDefined
    def __init__(self, rsrcmgr, outfp, codec='utf-8', pageno=1, laparams=None,
                 scale=1, fontscale=1.0, layoutmode='normal', showpageno=True,
                 pagemargin=50, imagewriter=None, debug=0,
                 rect_colors={'curve': 'black', 'page': 'gray'},
                 text_colors={'char': 'black'}):
        """Initialize pdfminer.converter HTMLConverter."""
        HTMLConverter.__init__(**locals())
    def render_char(self, matrix, font, fontsize, scaling, rise, cid, ncs,
                    graphicstate):
        """Mod invoking ftfy.fix_text() to possibly rescue bad cids."""
        try:
            text = font.to_unichr(cid)
            assert isinstance(text, six.text_type), str(type(text))
        except PDFUnicodeNotDefined:
            try:
                text = ftfy.fix_text(chr(cid), uncurl_quotes=False)
                assert isinstance(text, six.text_type), str(type(text))
                cid=ord(text)
            except PDFUnicodeNotDefined:
                text = self.handle_undefined_char(font, cid)
        textwidth = font.char_width(cid)
        textdisp = font.char_disp(cid)
        item = LTChar(matrix, font, fontsize, scaling, rise, text, textwidth,
                      textdisp, ncs, graphicstate)
        self.cur_item.add(item)
        return item.adv
if __name__ == '__main__':
    rsrcmgr = PDFResourceManager()
    outfp = BytesIO()
    device = HTMLConvertOre(rsrcmgr, outfp)
    print(device)

score -1 · Answer 3 · answered Oct 14 '19 at 07:52

-1

For those who got the above error, below code might help you.

import minecart
from PIL import Image
import io

pdffile = open('sample.pdf', 'rb')
doc = minecart.Document(pdffile)

for page in doc.iter_pages():
    im = page.images[0]#taking only one image per page
    byteArray = im.obj.get_data()
    image = Image.open(io.BytesIO(byteArray))
    image.show()

Hope it helps!!

Please refer https://github.com/felipeochoa/minecart/issues/16 .

answered Oct 14 '19 at 07:52

Modem Rakesh goud

1,578
1
12
11

1

What has this got to do with characters? – snakecharmerb Oct 14 '19 at 08:06
please refer to https://github.com/felipeochoa/minecart/issues/16 – Modem Rakesh goud Oct 14 '19 at 10:33
1

I did. The github issue, and your code, are about extracting images. This question is about how pdfminer handles a particular character. I fail to see the connection. – snakecharmerb Oct 14 '19 at 10:40

Why character ID 160 is not recognised as Unicode in PDFMiner?

3 Answers3

Linked