I am converting .pdf files into .xml files using PDFMiner.
For each word in the .pdf file, PDFMiner checks whether it is Unicode or not (among many other things). If it is, it returns the character, if it is not, it raises an exception and returns the string "(cid:%d)" where %d is the character id, which I think is the Unicode Decimal.
This is well explained in the edit part of this question: What is this (cid:51) in the output of pdf2txt?. I report the code here for convenience:
def render_char(self, matrix, font, fontsize, scaling, rise, cid):
try:
text = font.to_unichr(cid)
assert isinstance(text, unicode), text
except PDFUnicodeNotDefined:
text = self.handle_undefined_char(font, cid)
def handle_undefined_char(self, font, cid):
if self.debug:
print >>sys.stderr, 'undefined: %r, %r' % (font, cid)
return '(cid:%d)' % cid
I usually get this Exception for .pdf files written in Cyrillic. However, there is one file that uses plain English and where I get this Exception for non breaking spaces (that have cid=160). I do not understand why this character is not recognised as Unicode, while all others in the same file are.
If, on the same environment, I run isinstance(u'160', unicode)
in the console I get True
, while an (apparently) equivalent command is returning False
when it's run inside PDFMiner.
If I debug, I see that the font is properly recognised, i.e. I get:
cid = 160
font = <PDFType1Font: basefont='Helvetica'>
PDFMiner accepts the codec as a parameter. I have chosen utf-8, which has 160 as Unicode Decimal for non breaking space (http://dev.networkerror.org/utf8/).
If it might help, here is the code for to_unichr:
def to_unichr(self, cid):
if self.unicode_map:
try:
return self.unicode_map.get_unichr(cid)
except KeyError:
pass
try:
return self.cid2unicode[cid]
except KeyError:
raise PDFUnicodeNotDefined(None, cid)
Is there a way to set/change the character map recognised by the code?
What do you think I should change, or where do you think I should investigate, so that cid=160 does not raise the Exception?