I am trying to extract tables and the table names from a pdf file using camelot in python. Although I know how to extract tables (which is pretty straightforward) using camelot, I am struggling to find any help on how to extract the table name. The intention is to extract this information and show a visual of the tables and their names for a user to select relevant tables from the list.
I have tried extracting tables and then extracting text as well from pdfs. I am successful at both but not at connecting the table name to the table.
def tables_from_pdfs(filespath):
pdffiles = glob.glob(os.path.join(filespath, "*.pdf"))
print(pdffiles)
dictionary = {}
keys = []
for file in pdffiles:
print(file)
n = PyPDF2.PdfFileReader(open(file, 'rb')).getNumPages()
print(n)
tables_dict = {}
for i in range(n):
tables = camelot.read_pdf(file, pages = str(i))
tables_dict[i] = tables
head, tail = os.path.split(file)
tail = tail.replace(".pdf", "")
keys.append(tail)
dictionary[tail] = tables_dict
return dictionary, keys
The expected result is a table and the name of the table as stated in the pdf file. For instance: Table on page x of pdf name: Table 1. Blah Blah blah '''Table'''