0

I wish to extract tables to excel from a scanned booklet. sample pdf can be found here It is in .pdf format though the tables appear as an image as it is scanned. Tried Camelot and PyMuPDF but seem to get it wrong somewhere.

Here is the code I used :

import camelot
import PyPDF2
import pandas as pd

file = r"C:/Users/Vibes/Desktop/Projects/.pdf/Camelot.pdf"

tables = camelot.read_pdf(file, pages='all', flavor="stream", encoding="utf-8")

master_DF = pd.DataFrame()

for i in range(tables.n):
    if i == 0:
        new_header = tables[i].df.iloc[4] 
        tables[i].df = tables[i].df[5:]
        tables[i].df.columns = new_header
        master_DF = pd.concat([master_DF, tables[i].df], axis=0, ignore_index=True)
    else:
         tables[i].df = tables[i].df[1:]
         tables[i].df.columns = new_header
         master_DF = pd.concat([master_DF, tables[i].df], axis=0, ignore_index=True)
         print(tables[i].df)
         print("")

Tried camelot but got this error:

page-1 is image-based, camelot only works on text-based pages.

Toni
  • 1
  • 2

1 Answers1

1

Apologies, I don't have 50 points so it would not let me add this as a comment but here is why that error may be generated https://camelot-py.readthedocs.io/en/master/user/faq.html#:~:text=Does%20Camelot%20work%20with%20image,is%20text%2Dbased%E2%80%9D.)

If the pdf has been converted and the page is just an image, and not actual text that could be "edited" as text, then Camelot cannot convert it. If its an image your only resort may be a text recognition program like pytesseract to try and pull the text from the image.

The pdf link you provided looks like the pages are scanned images and not actually text based tables in the pdf.

Chris
  • 41
  • 6