Extracting text from PDF url file with Python

Question

I want to extract text from PDF file thats on one website. The website contains link to PDF doc, but when I click on that link it automaticaly downloads that file. Is it possible to extract text from that file without downloading it

import fitz  # this is pymupdf lib for text extraction
from bs4 import BeautifulSoup
import requests
from io import StringIO

url = "https://www.blv.admin.ch/blv/de/home/lebensmittel-und-ernaehrung/publikationen-und-forschung/statistik-und-berichte-lebensmittelsicherheit.html"

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}


response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

all_news = soup.select("div.mod.mod-download a")[0]
pdf = "https://www.blv.admin.ch"+all_news["href"]

#https://www.blv.admin.ch/dam/blv/de/dokumente/lebensmittel-und-ernaehrung/publikationen-forschung/jahresbericht-2017-2019-oew-rr-rasff.pdf.download.pdf/Jahresbericht_2017-2019_DE.pdf

This is code for extracting text from pdf. It works good when file is downloaded:

my_pdf_doc = fitz.open(pdf)
text = ""
for page in my_pdf_doc:
    text += page.getText()

print(text)

The same question is if link does not downloads the pdf file automatically, for example this link:

"https://amsoldingen.ch/images/files/Bekanntgabe-Stimmausschuss-13.12.2020.pdf"

How can I extract text from that file

I have also tried this:

pdf_content = requests.get(pdf)
print(type(pdf_content.content))

file = StringIO() 
print(file.write(pdf_content.content.decode("utf-32")))

But I get error:

Traceback (most recent call last):
  File "/Users/aleksandardevedzic/Desktop/pdf extraction scrapping.py", line 25, in <module>
    print(file.write(pdf_content.content.decode("utf-32")))
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

You can download to a space in memory using BytesIO: https://stackoverflow.com/questions/22340265/python-download-file-using-requests-directly-to-memory — Ramon Medeiros, Nov 24 '20 at 12:43
Can you show me how to apply that on my code, maybe Im doing something wrong — taga, Nov 25 '20 at 00:11

Ramon Medeiros · Answer 1 · 2020-11-25T09:45:45.003

5

Here is an example using PyPDF2.

To install

pip install PyPDF2

import requests, PyPDF2
from io import BytesIO

url = 'https://www.blv.admin.ch/dam/blv/de/dokumente/lebensmittel-und-ernaehrung/publikationen-forschung/jahresbericht-2017-2019-oew-rr-rasff.pdf.download.pdf/Jahresbericht_2017-2019_DE.pdf'
response = requests.get(url)
my_raw_data = response.content

with BytesIO(my_raw_data) as data:
    read_pdf = PyPDF2.PdfFileReader(data)

    for page in range(read_pdf.getNumPages()):
        print(read_pdf.getPage(page).extractText())

Output:

' 1/21  Fad \nŒ 24.08.2020\n      Bericht 2017\n Œ 2019: Öffentliche Warnungen, \nRückrufe und Schnellwarnsystem RASFF\n      '

edited Nov 25 '20 at 09:45

answered Nov 25 '20 at 01:15

Ramon Medeiros

2,272
2
24
41

Just added the loop – Ramon Medeiros Nov 25 '20 at 09:45
The problem is that result that I get is all over the place, its not organised – taga Nov 25 '20 at 10:03
The question is getting out of the scope. The first question was answered (how to read a pdf file without downloading it). Now I suggest you to learn how to read the pdf by reaching the docs of this library: https://pythonhosted.org/PyPDF2/ – Ramon Medeiros Nov 25 '20 at 10:05

Vihaan Thora · Answer 2 · 2022-03-06T18:40:07.567

PyMuPDF allows us to open a BytesIO stream directly, as mentioned in the documentation.

import requests
import fitz
import io

url = "your-url.pdf"
request = requests.get(url)
filestream = io.BytesIO(request.content)
pdf = fitz.open(stream=filestream, filetype="pdf")

pdf can then be parsed like a regular PyMuPDF document, as shown here.

P.S. This is my first answer on Stack Overflow, and any improvements/suggestions are welcome.

score 0 · Answer 3 · answered May 02 '22 at 11:49

I have done @Vihaan Thora solution it worked for me

!pip install PyMuPDF

import requests
import fitz
import io

url = "https://www.livelaw.in/pdf_upload/vsa02052022matfc1162021145829-416435.pdf"
request = requests.get(url)
filestream = io.BytesIO(request.content)
with fitz.open(stream=filestream, filetype="pdf") as doc:
    detail_judgement = ""
    for page in doc:
        detail_judgement += page.get_text()
print(detail_judgement)

K J · Answer 4 · 2022-05-04T02:14:36.280

It is IMPOSSIBLE to read a web application/pdf file that is at a remote location such as a server without "Download". The browser / reader / text extractor is local and HTTPS security requires the file is worked as Hyper Text Transferred locally (unless the server is unlikely configured specifically to allow client administrative edits of its served files).

BOTH your example links instantly download in My browser, since my browser user settings ares set to securely download only NOT run exploitable view in browser.

Thus to extract text you get a temporary copy in local device file system memory (this often uses hard drive cache) and others have suggested that can be done using Python FileStream IO. However that is not much different to how a download works.

The file can be transferred using memory to temporary IO as efficient File Bytes using

Curl -O https://www.blv.admin.ch/dam/blv/de/dokumente/lebensmittel-und-ernaehrung/publikationen-forschung/jahresbericht-2017-2019-oew-rr-rasff.pdf.download.pdf/Jahresbericht_2017-2019_DE.pdf

then use the related Python OS command(s)

pdftotext Jahresbericht_2017-2019_DE.pdf | Find "whatever you need"

Extracting text from PDF url file with Python

4 Answers4

Linked