Parse online PDF file with Python and PDFMiner

Question

How can I parse an online PDF file with Python?

I just need the second line of the first page. I need to do this without downloading the file and I am using Python 3.5

I have tried something like this, but it didnt work: Using PDFMiner (Python) with online pdf files. Encode the url?

from pdfminer.pdfparser import PDFParser
import urllib.request
from io import StringIO
import io

url = 'url_with_the_pdf'

open = urllib.request.urlopen(url).read()

memoryFile = io.StringIO(open)

parser = PDFParser(memoryFile)

I get this error:

memoryFile = io.StringIO(open) TypeError: initial_value must be str or None, 
not bytes

did you check this? https://stackoverflow.com/a/16575064/435089 — Kannappan Sirchabesan, Jan 20 '19 at 19:37
what version of python are you using? the answer changes in python3.6 which allows `loads(bytes)`. Also I suspect you don't actually have json if you need to `.replace("'", '"')` — anthony sottile, Jan 20 '19 at 19:38
I am using 3.6 . So i dont need the line my_json = data.decode('utf-8').replace("'", '"') ?? — Laura, Jan 20 '19 at 19:41
`If the data being deserialized is not a valid JSON document, a JSONDecodeError will be raised.` -- https://docs.python.org/3/library/json.html#json.JSONDecodeError The decoder is probably hitting a string value which is wrapped in single quotes because of your call to `replace` and freaking out because single quotes do not encapsulate strings in JSON. You would do well to post document you are trying to parse so we can see it. — Pocketsand, Jan 20 '19 at 19:44
you can pass to the ocr,pdf change to image, and ocr can distinguish the image — amcoder, Jan 21 '19 at 00:39

score 0 · Answer 1 · answered Dec 24 '19 at 11:27

0

In Python 3 use io.BytesIO, i.e.

memoryFile = io.BytesIO(open)

Details: https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

...import the io module and use io.StringIO or io.BytesIO for text and data respectively

answered Dec 24 '19 at 11:27

funkifunki

1,149
2
13
24

Parse online PDF file with Python and PDFMiner

1 Answers1