Converting PDF Table from URL into a Pandas Dataframe?

Question

Having issues converting PDF data into a dataframe depending on how the PDF is uploaded to the website.

Hi all,

Does anyone have any ideas on how to read an uploaded PDF's data into a pandas dataframe? I am having issues doing it with certain PDFs.

For example, with this url https://www.rrc.texas.gov/media/ep0le0dv/2022-january-01-0692.pdf, i was able to easily get the data like so:

import tabula as tb
pdf_url = 'https://www.rrc.texas.gov/media/ep0le0dv/2022-january-01-0692.pdf'
tb.read_pdf(pdf_url, pages = 1, guess = True)

but for other links where I cannot highlight values on the PDF (it looks just faxed in), like this url https://rrc.texas.gov/media/uzzdihmq/2023-july-10-0026.pdf, I get stuck. I have tried using tabula, pdfplumber, pytesseract so far, but with no success

Does anyone have any ideas? Thanks in advance!

You're right, that one is images rather than text. You'll need some sort of OCR / image recognition software such as [AWS Rekognition](https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html). — stdunbar, Aug 18 '23 at 21:03
@stdunbar Ahh ok thank you. Does not seem worth the trouble to set up a code with tools I am not familiar with such as AWS (guessing it costs $$ too) for data that has only monthly granularity. Might be easier to just go into the website once a month and manually enter the data into a master Excel spreadsheet. Thanks for your response! — jare2620, Aug 18 '23 at 21:12

Converting PDF Table from URL into a Pandas Dataframe?

0 Answers0