1

Following on from Replace (cid:<number>) with chars using Python when extracting text from PDF files (I can't add a comment there), I attempted to convert the following with @josefz script but get unrecognisable strings not in the original PDF. PDFPlumber extracted the data originally.

import re
    
def cidToChar(cidx):
    #return chr(int(re.findall(r'\(cid\:(\d+)\)',cidx)[0]) + 29)
    return chr(int(re.findall(r'\(cid\:(\d+)\)',cidx)[0]) + 29)

xx = '''(cid:50)(cid:54)(cid:47)(cid:48)(cid:49)(cid:47)(cid:50)(cid:48)(cid:50)(cid:50)(cid:32)(cid:49)(cid:48)(cid:58)(cid:52)(cid:48)(cid:97)(cid:109) (cid:50)(cid:48)(cid:50)(cid:50)(cid:48)(cid:49)(cid:49)(cid:48)(cid:57)(cid:57) (cid:80)(cid:97)(cid:121)(cid:109)(cid:101)(cid:110)(cid:116)(cid:32)(cid:73)(cid:115)(cid:115)(cid:117)(cid:101) (cid:65)(cid:115)(cid:115)(cid:105)(cid:103)(cid:110)(cid:101)(cid:100)'''

for x in xx.split('\n'):
  if x != '' and x != '(cid:3)':         # merely to compact the output
    abc = re.findall(r'\(cid\:\d+\)',x)
    if len(abc) > 0:
        for cid in abc: x=x.replace(cid, cidToChar(cid))
    print(repr(x).strip("'"))

The output is unrecognizable: OSLMNLOMOO=NMWQM~\x8a OMOOMNNMVV m~\x96\x8a\x82\x8b\x91=f\x90\x90\x92\x82 ^\x90\x90\x86\x84\x8b\x82\x81

Am I doing something incorrect in the above?

DaveC
  • 21
  • 3

0 Answers0