0

I am trying to a get a pdf online using something like requests and convert it to a a string in Python. I don't want to end up with the pdf in my hard disk. instead I want to get a of online and work on it in terms of text/string in python3.

For example say you have a pdf file with the contents: I love programming.

url = 'xyzzy.org/g.pdf'
re = requests.get(url)
# do something to re and assign it to `pdf`
convert_to_string(pdf) -> "I love programming"
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
chez93
  • 131
  • 6
  • 1
    Does this answer your question? [How to extract text from a PDF file?](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file) – Jonathan Ciapetti Jun 22 '22 at 20:53
  • 1
    @user_ I believe those answers assume you have the pdf in your computer/hard disk. I want to get the pdf using a url and convert it to a string without having to save the pdf in my computer ie opening up the pdf using python. – chez93 Jun 22 '22 at 21:02
  • 4
    @sphereInAManifold so combine your work with the answers: fetch the PDF from a URL, feed the bytes of the download to, say, `PdfReader()` from `pyPDF2`, and then extract the text. You'll probably want to read about [`io.BytesIO`](https://docs.python.org/3/library/io.html#binary-i-o). – wkl Jun 22 '22 at 21:05

2 Answers2

3

Update: read K J's answer

As pointed out in the comments, you can divide this task into two parts:

  1. Download the pdf through a stream object
  2. Convert the in-memory pdf into a string

This should do the job (it needs the PyMuPDF package):

import io
import requests
import fitz

url = "http://.../sample.pdf"

response = requests.get(url)
pdf = io.BytesIO(response.content)
with fitz.open(stream=pdf) as doc:
    text = ""
    for page in doc:
        text += page.get_text()
print(text)
Jonathan Ciapetti
  • 1,261
  • 3
  • 11
  • 16
1

Whilst the correct given answer above uses a FileStream for sample.pdf

(BytesIO(response.content)
with fitz.open(stream=pdf) as doc:

It still had to be transferred down from https via hyper text response (download) and then decoded by fitz as a %tmp%MemoryBlob.pdf (thus a file that was downloaded and discarded after extraction)

If you want to do similar using just the OS and Poppler to be able to tryout different options the sequence is simply

curl -o "%tmp%\temp.pdf" RemoteURL
pdftotext [options] "%tmp%\temp.pdf" filename.txt

It gives you infinite time to replay the last line and next time overwrite the same %tmp% memory file. If you set options and change filename.txt to simply - you can review the console output, however beware for non native encodings the console may appear cruder output than would be stored in a filename.

enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36