3

So I have come across a few posts that deal with converting PDF's to HTML or converting them to text, however they all deal with doing so from a file saved to the computer. Is there a way to extract the text from a webpage PDF without downloading the PDF file itself (as I will be doing so for a large number of files by iterating through a list of URL's)?

I am also curious which is the best library to achieve this with. pdfkit, pdf2txt, pdfminer, etc.?

Here is an example website with the format I will be dealing with: http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf

rahlf23
  • 8,869
  • 4
  • 24
  • 54
  • 2
    Even when viewing a PDF in the web-browser, you download a copy to your local cache. Your browser just still shows you the remote URL, even though what you are looking at has been saved to disk in your browser's tmp directory. Why not just do the same? – Matt Clark Aug 02 '17 at 21:06

3 Answers3

6

You can download the file as a byte stream with requests wrapping it with io.BytesIO(), just so:

import io

import requests
from pyPdf import PdfFileReader

url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')

f is a file like object you can use just like you opened a PDF file. this way the file is only in the memory and never saved locally.

In order to get text from the PDF file you can use PyPdf.

Dror Av.
  • 1,184
  • 5
  • 14
  • Edited the answer to give a more complete one, thanks @Milk for the link and the second part. – Dror Av. Aug 02 '17 at 21:45
  • @Dror Av., I have used your code chunk to help another user at this link https://stackoverflow.com/questions/67931135/how-do-i-obtain-redirected-urls-in-python. Thank you. It helped to help others. – Raky Jun 11 '21 at 07:06
1

Updated the code for the PyPDF2 library

import io
import requests
import PyPDF2

url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PyPDF2.PdfReader(f)
contents = reader.pages[2].extract_text().split('\n')
Andriy125
  • 11
  • 2
0

just a minor update to above answer

import PyPDF2
import requests
import io


url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'

response = requests.get(url)
f = io.BytesIO(response.content)
reader = PyPDF2.PdfReader(f)
pages = reader.pages
# get all pages data
text = "".join([page.extract_text() for page in pages])
Ankesh
  • 11
  • 3
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/34895573) – doneforaiur Aug 27 '23 at 06:57