0

How can I parse an online PDF file with Python?

I just need the second line of the first page. I need to do this without downloading the file and I am using Python 3.5

I have tried something like this, but it didnt work: Using PDFMiner (Python) with online pdf files. Encode the url?

from pdfminer.pdfparser import PDFParser
import urllib.request
from io import StringIO
import io

url = 'url_with_the_pdf'

open = urllib.request.urlopen(url).read()

memoryFile = io.StringIO(open)

parser = PDFParser(memoryFile)

I get this error:

memoryFile = io.StringIO(open) TypeError: initial_value must be str or None, 
not bytes
Laura
  • 1,192
  • 2
  • 18
  • 36
  • 1
    did you check this? https://stackoverflow.com/a/16575064/435089 – Kannappan Sirchabesan Jan 20 '19 at 19:37
  • what version of python are you using? the answer changes in python3.6 which allows `loads(bytes)`. Also I suspect you don't actually have json if you need to `.replace("'", '"')` – anthony sottile Jan 20 '19 at 19:38
  • I am using 3.6 . So i dont need the line my_json = data.decode('utf-8').replace("'", '"') ?? – Laura Jan 20 '19 at 19:41
  • `If the data being deserialized is not a valid JSON document, a JSONDecodeError will be raised.` -- https://docs.python.org/3/library/json.html#json.JSONDecodeError The decoder is probably hitting a string value which is wrapped in single quotes because of your call to `replace` and freaking out because single quotes do not encapsulate strings in JSON. You would do well to post document you are trying to parse so we can see it. – Pocketsand Jan 20 '19 at 19:44
  • you can pass to the ocr,pdf change to image, and ocr can distinguish the image – amcoder Jan 21 '19 at 00:39

1 Answers1

0

In Python 3 use io.BytesIO, i.e.

memoryFile = io.BytesIO(open)

Details: https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

...import the io module and use io.StringIO or io.BytesIO for text and data respectively

funkifunki
  • 1,149
  • 2
  • 13
  • 24