0

Using PyPDF2, could able to parser the content of the PDF but for some PDFs i am getting following data:

Decoded Data:
...

BT
310.00 621.52 TD
/F1 12 Tf
[<0030004B004E004B005000460003003000430046004A0057004D00430054000300270047005100540047>] TJ
ET

...

How to decode the string before Tj i.e. [<0030004B004E004B005000460003003000430046004A0057004D00430054000300270047005100540047>]

My decoding code is something like following:

### get the content
pdf = PdfFileReader(self.in_file)

for page_number in range(0, pdf.getNumPages()):
    page = pdf.getPage(page_number)
    contents = page.getContents()

    ### adjust the contents...
    data = contents.get_data()
    encoding = chardet.detect(data)['encoding']
    decoded_data = data.decode(encoding)
    print(f'Decoded Data: {decoded_data}')
Milind Deore
  • 2,887
  • 5
  • 25
  • 40
  • 1
    In the line before the **TJ** text showing instruction there is a **Tf** text font setting instruction. Simply look up the font **F1** set there in the font resources of your content stream. If it has a **ToUnicode** stream, use that. Otherwise try to use the **Encoding** entry. For more details see [this answer](https://stackoverflow.com/a/33416913/1729265). – mkl Jun 16 '23 at 08:42
  • @mkl thanks for guiding me. I could find both `ToUnicode` and `Encoding`. I have limited idea to getting text in PDF with toUnicode, but let me give it a try. Kindly let me know if you can guide some code with? Thank a lot! – Milind Deore Jun 16 '23 at 13:30
  • 1
    I don't have any Python code. I would assume, though, that there are numerous Python PDF libraries that support text extraction out-of-the-box, applying **ToUnicode** or **Encoding** under-the-hood. – mkl Jun 16 '23 at 16:47

0 Answers0