How to convert bytes from PDF to string in Python?

Question

I am trying to convert bytes which I get from book_download_page = requests.get(link) then content = book_download_page.content into string.

What I have tried,

content = book_download_page.content.decode('utf-8')

Error I get,

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Edit- You can try this link for downloading

Thank you!

Does this answer your question? [How to extract text from a PDF file?](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file) — metatoaster, Jun 25 '20 at 03:39
Try other decodings like 'latin-1' and please give link will check and give you solution — NAGA RAJ S, Jun 25 '20 at 03:40

score 1 · Answer 1 · answered Jun 25 '20 at 03:46

PDF contents are made up of tokens, see here:

You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python.

import pdfbox

pdf_ref = pdfbox.PDFBox()
pdf_ref.extract_text('directory/originalPDF.pdf')   # Result .txt will be in directory/originalPDF.txt

Simple example paraphrased from python-pdfbox in case if you want to convert other things like images too.

1 Answers1