3

I am trying to extract tables and the table names from a pdf file using camelot in python. Although I know how to extract tables (which is pretty straightforward) using camelot, I am struggling to find any help on how to extract the table name. The intention is to extract this information and show a visual of the tables and their names for a user to select relevant tables from the list.

I have tried extracting tables and then extracting text as well from pdfs. I am successful at both but not at connecting the table name to the table.

def tables_from_pdfs(filespath):
    pdffiles = glob.glob(os.path.join(filespath, "*.pdf"))
    print(pdffiles)
    dictionary = {}
    keys = []
    for file in pdffiles:
        print(file)
        n = PyPDF2.PdfFileReader(open(file, 'rb')).getNumPages()
        print(n)
        tables_dict = {}
        for i in range(n):
            tables = camelot.read_pdf(file, pages = str(i))
            tables_dict[i] = tables
        head, tail = os.path.split(file)
        tail = tail.replace(".pdf", "")
        keys.append(tail)
        dictionary[tail] = tables_dict
    return dictionary, keys

The expected result is a table and the name of the table as stated in the pdf file. For instance: Table on page x of pdf name: Table 1. Blah Blah blah '''Table'''

Thomas
  • 185
  • 2
  • 11
Vijay
  • 57
  • 2
  • 6
  • the code you posted do not represent anything you tried for fetching the table name. Camelot-py do not give what you are looking for. I would suggest to use pdfminer or PyPDF2 to read the PDF objects with location bindings and extract the table name. – ExtractTable.com Oct 03 '19 at 14:03
  • Please read this: https://stackoverflow.com/questions/58185404/python-pdf-parsing-with-camelot-and-extract-the-table-title There aren't general solutions. – Stefano Fiorucci - anakin87 Oct 04 '19 at 07:13
  • Does this answer your question? [Python PDF Parsing with Camelot and Extract the Table Title](https://stackoverflow.com/questions/58185404/python-pdf-parsing-with-camelot-and-extract-the-table-title) – Brian Wylie Feb 17 '21 at 02:34

2 Answers2

0

I was able to find a relative solution. Works for me at least.

import os, PyPDF2, time, re, shutil
import pytesseract
from pdf2image import convert_from_path
import camelot
import datefinder
from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

similarityAmt = 0.6 # find with 60% similarity
def find_table_name(dataframe, documentString):
    
    # Assuming that you extracted the text from a PDF, it should be multi-lined. We split by line
    stringsSeparated = text.split("\n")
    for i, string in enumerate(stringsSeparated):
        
        # Split by word
        words = string.split()
        for k, word in enumerate(words):
            
            # Get the keys from the dataframe as a list (it is initially extracted as a generator type)
            dfList = list(dataframe.keys())
            keys = str(dfList)
            
            # If the first key is a digit, we assume that the keys are from the row below the keys instead
            if keys[0].isdigit():
                keys = dataframe[dfList[0]]

            # Put all of the keys in a single string
            keysAll = ""
            for key in keys:
                keysAll += key

            # Since a row should be horizontal, we check the similarity between that of the text by line.
            similarRating = similar(words, keysAll)
            if similarRating > similarityAmt: # If similarity rating (which is a ratio from 0 to 1) is above the similarity amount, we approve of it
                for j in range(10): # Iterate upwards 10 lines above until we are capable of finding a line that is longer than 4 characters (this is an arbitrary number just to ignore blank lines).
                    try:
                        separatedString = stringsSeparated[i-j-1]
                        if len(separatedString) > 4:
                            return stringsSeparated[i-j-2]+separatedString # Return the top two lines to hopefully have an accurate name
                        else:
                            continue
                    except:
                        continue
    return "Unnamed"

# Retreive the text from the pdf
pages = convert_from_path(pdf_path, 500) # pdf_path would be the path of the PDF which you extracted the table from
pdf_text = ""
# Add all page strings into a single string, so the entire PDF is one single string
for pageNum, imgBlob in enumerate(pages):
    extractedText = pytesseract.image_to_string(imgBlob, lang='eng')
    pdf_text += extractedText + "\n"

# Get the name of the table using the table itself and pdf text
tableName = find_table_name(table.df, pdf_text) # A table you extracted with your code, which you want to find the name of
-3

Tables are listed with the TableList and Table functions in the camelot API found here: https://camelot-py.readthedocs.io/en/master/api.html#camelot.core.TableList


start in the web page where it says:


Lower-Lower-Level Classes


Camelot does not have a reference to the table name just the cell data descriptions. It does use python's panda database API though which may have the table name in it.


Combine usage of Camelot and Pandas to get the table name.


Get the name of a pandas DataFrame


appended update to answer


from https://camelot-py.readthedocs.io/en/master/

import camelot
tables = camelot.read_pdf('foo.pdf')
tables
<TableList n=1>
tables.export('foo.csv', f='csv', compress=True) # json, excel, html
tables[0]
<Table shape=(7, 7)>
tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
df_table = tables[0].df # get a pandas DataFrame!

#add
df_table.name = 'name here'


#from https://stackoverflow.com/questions/31727333/get-the-name-of-a-pandas-dataframe
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'

print df.name

note: the added 'name' attribute is not part of df. While serializing the df, the added name attribute is lost.


More appended answer, the 'name' attribute is actually called 'index'.


Getting values

>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
...      index=['cobra', 'viper', 'sidewinder'],
...      columns=['max_speed', 'shield'])
>>> df
            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8

Single label. Note this returns the row as a Series.

>>> df.loc['viper']
max_speed    4
shield       5
Name: viper, dtype: int64
Joe McKenna
  • 135
  • 5
  • 1
    The name that we are searching doesn't belong to the table, so it's not part of the dataframe. I think that your answer doesn't solve the problem. – Stefano Fiorucci - anakin87 Oct 04 '19 at 07:17
  • Hi Joe, Thanks for the response. I looked through the documentation and still could not find an answer. I am relatively new to text related packages (and mainly to camelot). Can you please guide me a bit more and show me the functions that can be used? Thanks, Vijay – Vijay Oct 04 '19 at 07:36
  • Yes, done. Be careful, you have to add the 'name' attribute to the df but some of the scenarios with it will lose that data. – Joe McKenna Oct 04 '19 at 09:11
  • Thanks Joe. I think the code is assigning a name rather than extracting it from the pdf. Anakin87 suggested that the name does not belong to the table and so what we extract will not contain the name. I am trying to get the table name from the pdf file the way the author has written it :) – Vijay Oct 11 '19 at 09:25
  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc – Joe McKenna Oct 11 '19 at 12:30
  • The 'loc' function is what people are looking for? See the example at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc. Notice 'name' is actually called 'index'. – Joe McKenna Oct 11 '19 at 12:39