Using PyPDF2
, could able to parser the content of the PDF but for some PDFs i am getting following data:
Decoded Data:
...
BT
310.00 621.52 TD
/F1 12 Tf
[<0030004B004E004B005000460003003000430046004A0057004D00430054000300270047005100540047>] TJ
ET
...
How to decode the string before Tj
i.e. [<0030004B004E004B005000460003003000430046004A0057004D00430054000300270047005100540047>]
My decoding code is something like following:
### get the content
pdf = PdfFileReader(self.in_file)
for page_number in range(0, pdf.getNumPages()):
page = pdf.getPage(page_number)
contents = page.getContents()
### adjust the contents...
data = contents.get_data()
encoding = chardet.detect(data)['encoding']
decoded_data = data.decode(encoding)
print(f'Decoded Data: {decoded_data}')