Having issues converting PDF data into a dataframe depending on how the PDF is uploaded to the website.
Hi all,
Does anyone have any ideas on how to read an uploaded PDF's data into a pandas dataframe? I am having issues doing it with certain PDFs.
For example, with this url https://www.rrc.texas.gov/media/ep0le0dv/2022-january-01-0692.pdf, i was able to easily get the data like so:
import tabula as tb
pdf_url = 'https://www.rrc.texas.gov/media/ep0le0dv/2022-january-01-0692.pdf'
tb.read_pdf(pdf_url, pages = 1, guess = True)
but for other links where I cannot highlight values on the PDF (it looks just faxed in), like this url https://rrc.texas.gov/media/uzzdihmq/2023-july-10-0026.pdf, I get stuck. I have tried using tabula, pdfplumber, pytesseract so far, but with no success
Does anyone have any ideas? Thanks in advance!