Efficiently delete rows in Pandas DataFrame if referenced file does not exist

Question

I have a Pandas DataFrame, df, which has a path column which has paths to image files for analysis. Some of the images in this dataset do not actually exist, so I need to selectively remove rows with a nonexistent image path.

Currently, I am looping through the entire dataframe and reassigning it like so:

for index, sample in df.iterrows():
    if not os.path.isfile(sample['path']):
        df = df.drop(index)

However, as my dataset contains tens of thousands of images, this is extremely slow.

I've also looked at using an approach like in this more general question here:

df = df.drop(df[not os.path.isfile(df['path'])].index)

However, this does not work as os.path.isfile is incompatible with Pandas DataFrames.

I feel like there must be a better way to approach this problem. Any ideas?

score 1 · Answer 1 · answered Jun 30 '19 at 12:09

1

try using .apply on rows (axis=1) to get a boolean index of which rows matched your condition:

df = df.drop(df.apply(lambda row: not os.path.isfile(row['path']), axis=1))

answered Jun 30 '19 at 12:09

Adam.Er8

12,675
3
26
38

note: using `.apply` with a python function might not have a performance advantage over looping with `iterrows` or a list-comprehension, as it needs to execute a python function on each row (so it leaves the Cython space that makes "native" pandas/numpy functions so fast). – Adam.Er8 Jun 30 '19 at 12:16

score 1 · Accepted Answer · answered Jun 30 '19 at 12:13

1

I would vote for a list comprehension instead of apply() for performance and use the output as boolean index for slicing:

df[[os.path.isfile(i) for i in df['path']]]

answered Jun 30 '19 at 12:13

anky

74,114
11
41
70

Efficiently delete rows in Pandas DataFrame if referenced file does not exist

2 Answers2