Following on from Replace (cid:<number>) with chars using Python when extracting text from PDF files (I can't add a comment there), I attempted to convert the following with @josefz script but get unrecognisable strings not in the original PDF. PDFPlumber extracted the data originally.
import re
def cidToChar(cidx):
#return chr(int(re.findall(r'\(cid\:(\d+)\)',cidx)[0]) + 29)
return chr(int(re.findall(r'\(cid\:(\d+)\)',cidx)[0]) + 29)
xx = '''(cid:50)(cid:54)(cid:47)(cid:48)(cid:49)(cid:47)(cid:50)(cid:48)(cid:50)(cid:50)(cid:32)(cid:49)(cid:48)(cid:58)(cid:52)(cid:48)(cid:97)(cid:109) (cid:50)(cid:48)(cid:50)(cid:50)(cid:48)(cid:49)(cid:49)(cid:48)(cid:57)(cid:57) (cid:80)(cid:97)(cid:121)(cid:109)(cid:101)(cid:110)(cid:116)(cid:32)(cid:73)(cid:115)(cid:115)(cid:117)(cid:101) (cid:65)(cid:115)(cid:115)(cid:105)(cid:103)(cid:110)(cid:101)(cid:100)'''
for x in xx.split('\n'):
if x != '' and x != '(cid:3)': # merely to compact the output
abc = re.findall(r'\(cid\:\d+\)',x)
if len(abc) > 0:
for cid in abc: x=x.replace(cid, cidToChar(cid))
print(repr(x).strip("'"))
The output is unrecognizable: OSLMNLOMOO=NMWQM~\x8a OMOOMNNMVV m~\x96\x8a\x82\x8b\x91=f\x90\x90\x92\x82 ^\x90\x90\x86\x84\x8b\x82\x81
Am I doing something incorrect in the above?