0

I am trying to convert bytes which I get from book_download_page = requests.get(link) then content = book_download_page.content into string.

What I have tried,

content = book_download_page.content.decode('utf-8')

Error I get,

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Edit- You can try this link for downloading

Thank you!

parth shukla
  • 107
  • 4
  • 11
  • 2
    Does this answer your question? [How to extract text from a PDF file?](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file) – metatoaster Jun 25 '20 at 03:39
  • Try other decodings like 'latin-1' and please give link will check and give you solution – NAGA RAJ S Jun 25 '20 at 03:40

1 Answers1

1

PDF contents are made up of tokens, see here:

Adobe PDF Reference

You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python.

import pdfbox

pdf_ref = pdfbox.PDFBox()
pdf_ref.extract_text('directory/originalPDF.pdf')   # Result .txt will be in directory/originalPDF.txt

Simple example paraphrased from python-pdfbox in case if you want to convert other things like images too.

user176692
  • 780
  • 1
  • 6
  • 21