I want to extract text from PDF file thats on one website. The website contains link to PDF doc, but when I click on that link it automaticaly downloads that file. Is it possible to extract text from that file without downloading it
import fitz # this is pymupdf lib for text extraction
from bs4 import BeautifulSoup
import requests
from io import StringIO
url = "https://www.blv.admin.ch/blv/de/home/lebensmittel-und-ernaehrung/publikationen-und-forschung/statistik-und-berichte-lebensmittelsicherheit.html"
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
all_news = soup.select("div.mod.mod-download a")[0]
pdf = "https://www.blv.admin.ch"+all_news["href"]
#https://www.blv.admin.ch/dam/blv/de/dokumente/lebensmittel-und-ernaehrung/publikationen-forschung/jahresbericht-2017-2019-oew-rr-rasff.pdf.download.pdf/Jahresbericht_2017-2019_DE.pdf
This is code for extracting text from pdf. It works good when file is downloaded:
my_pdf_doc = fitz.open(pdf)
text = ""
for page in my_pdf_doc:
text += page.getText()
print(text)
The same question is if link does not downloads the pdf file automatically, for example this link:
"https://amsoldingen.ch/images/files/Bekanntgabe-Stimmausschuss-13.12.2020.pdf"
How can I extract text from that file
I have also tried this:
pdf_content = requests.get(pdf)
print(type(pdf_content.content))
file = StringIO()
print(file.write(pdf_content.content.decode("utf-32")))
But I get error:
Traceback (most recent call last):
File "/Users/aleksandardevedzic/Desktop/pdf extraction scrapping.py", line 25, in <module>
print(file.write(pdf_content.content.decode("utf-32")))
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)