How to iterate through a list of Data frames and drop all data if a specific string isnt found

Question

I am using the python library Camelot to parse through multiple PDFs and pull out all tables within those PDF files. The first line of code yields back all of the tables that were scraped from the pdf in list format. I am looking for one table in particular that has a unique string in it. Thankfully, this string is unique to this table so I can, theoretically, use it to isolate the table that I want to grab.

These pdfs are more or less created in the same format, however there is enough variance that I cant just have a static call on the table that I want. For example, sometimes the table I want will be the first table scraped, and sometimes it will be the third. Therefore, I need to write some code to be able to select the table dynamically.

The workflow I have in my mind logically goes like this:

Create an empty list before the for loop to append the tables to. Call a for loop and iterate over each table in the list outputted by the Camelot code. If the table does not have the string I am looking for, delete all data in that table and then append the empty data frame to the empty list. If it does have the string I am looking for, append it to the empty list without deleting anything.

Is there a better way to go about this? Im sure there probably is.

I have put what I have so far put together in my code. Im struggling putting together a conditional statement to drop all of the rows of the dataframe if the string is present. I have found plenty of examples of dropping columns and rows if the string is present, but nothing for the entire data frame

import camelot
import pandas as pd

#this creates a list of all the tables that Camelot scrapes from the pdf
tables = camelot.read_pdf('pdffile', flavor ='stream', pages = '1-end')

#empty list to append the tables to
elist = []

for t in tables:
    dftemp = t.df

    #my attempt at dropping all the value if the unique value isnt found. THIS DOESNT WORK
    dftemp[dftemp.values  != "Unique Value", dftemp.iloc[0:0]]

    #append to the list
    elist.append(dftemp)

#combine all the dataframes in the list into one dataframe
dfcombined = pd.concat(elist)

You need a if condition. Something like this: if string_found: elist.append([]) else: elist.append(t) — Hello.World, Mar 07 '19 at 21:27
How about dftemp = t.df[t.df.isin(['Unique Value'])].dropna() — run-out, Mar 07 '19 at 22:02

score 3 · Accepted Answer · answered Mar 07 '19 at 22:00

3

You can use the 'in' operator on the numpy array returned by dftemp.values link

for t in tables:
    dftemp = t.df

    #my attempt
    if "Unique Value" in dftemp.values:
        #append to the list
        elist.append(dftemp)

answered Mar 07 '19 at 22:00

kudeh

883
1
5
16

score 2 · Answer 2 · answered Mar 07 '19 at 22:16

2

You can do it in a single row:

dfcombined = pd.concat([t.df if "Unique Value" in t.df.values else pd.DataFrame() for t in tables ])

answered Mar 07 '19 at 22:16

OSainz

522
3
6

How to iterate through a list of Data frames and drop all data if a specific string isnt found

2 Answers2