I wish to extract tables to excel from a scanned booklet. sample pdf can be found here It is in .pdf format though the tables appear as an image as it is scanned. Tried Camelot and PyMuPDF but seem to get it wrong somewhere.
Here is the code I used :
import camelot
import PyPDF2
import pandas as pd
file = r"C:/Users/Vibes/Desktop/Projects/.pdf/Camelot.pdf"
tables = camelot.read_pdf(file, pages='all', flavor="stream", encoding="utf-8")
master_DF = pd.DataFrame()
for i in range(tables.n):
if i == 0:
new_header = tables[i].df.iloc[4]
tables[i].df = tables[i].df[5:]
tables[i].df.columns = new_header
master_DF = pd.concat([master_DF, tables[i].df], axis=0, ignore_index=True)
else:
tables[i].df = tables[i].df[1:]
tables[i].df.columns = new_header
master_DF = pd.concat([master_DF, tables[i].df], axis=0, ignore_index=True)
print(tables[i].df)
print("")
Tried camelot but got this error:
page-1 is image-based, camelot only works on text-based pages.