I am using the python library Camelot to parse through multiple PDFs and pull out all tables within those PDF files. The first line of code yields back all of the tables that were scraped from the pdf in list format. I am looking for one table in particular that has a unique string in it. Thankfully, this string is unique to this table so I can, theoretically, use it to isolate the table that I want to grab.
These pdfs are more or less created in the same format, however there is enough variance that I cant just have a static call on the table that I want. For example, sometimes the table I want will be the first table scraped, and sometimes it will be the third. Therefore, I need to write some code to be able to select the table dynamically.
The workflow I have in my mind logically goes like this:
Create an empty list before the for loop to append the tables to. Call a for loop and iterate over each table in the list outputted by the Camelot code. If the table does not have the string I am looking for, delete all data in that table and then append the empty data frame to the empty list. If it does have the string I am looking for, append it to the empty list without deleting anything.
Is there a better way to go about this? Im sure there probably is.
I have put what I have so far put together in my code. Im struggling putting together a conditional statement to drop all of the rows of the dataframe if the string is present. I have found plenty of examples of dropping columns and rows if the string is present, but nothing for the entire data frame
import camelot
import pandas as pd
#this creates a list of all the tables that Camelot scrapes from the pdf
tables = camelot.read_pdf('pdffile', flavor ='stream', pages = '1-end')
#empty list to append the tables to
elist = []
for t in tables:
dftemp = t.df
#my attempt at dropping all the value if the unique value isnt found. THIS DOESNT WORK
dftemp[dftemp.values != "Unique Value", dftemp.iloc[0:0]]
#append to the list
elist.append(dftemp)
#combine all the dataframes in the list into one dataframe
dfcombined = pd.concat(elist)