1

Having issues converting PDF data into a dataframe depending on how the PDF is uploaded to the website.

Hi all,

Does anyone have any ideas on how to read an uploaded PDF's data into a pandas dataframe? I am having issues doing it with certain PDFs.

For example, with this url https://www.rrc.texas.gov/media/ep0le0dv/2022-january-01-0692.pdf, i was able to easily get the data like so:

import tabula as tb
pdf_url = 'https://www.rrc.texas.gov/media/ep0le0dv/2022-january-01-0692.pdf'
tb.read_pdf(pdf_url, pages = 1, guess = True)

but for other links where I cannot highlight values on the PDF (it looks just faxed in), like this url https://rrc.texas.gov/media/uzzdihmq/2023-july-10-0026.pdf, I get stuck. I have tried using tabula, pdfplumber, pytesseract so far, but with no success

Does anyone have any ideas? Thanks in advance!

jare2620
  • 13
  • 3
  • You're right, that one is images rather than text. You'll need some sort of OCR / image recognition software such as [AWS Rekognition](https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html). – stdunbar Aug 18 '23 at 21:03
  • @stdunbar Ahh ok thank you. Does not seem worth the trouble to set up a code with tools I am not familiar with such as AWS (guessing it costs $$ too) for data that has only monthly granularity. Might be easier to just go into the website once a month and manually enter the data into a master Excel spreadsheet. Thanks for your response! – jare2620 Aug 18 '23 at 21:12

0 Answers0