1

I have a Pandas DataFrame, df, which has a path column which has paths to image files for analysis. Some of the images in this dataset do not actually exist, so I need to selectively remove rows with a nonexistent image path.

Currently, I am looping through the entire dataframe and reassigning it like so:

for index, sample in df.iterrows():
    if not os.path.isfile(sample['path']):
        df = df.drop(index)

However, as my dataset contains tens of thousands of images, this is extremely slow.

I've also looked at using an approach like in this more general question here:

df = df.drop(df[not os.path.isfile(df['path'])].index)

However, this does not work as os.path.isfile is incompatible with Pandas DataFrames.

I feel like there must be a better way to approach this problem. Any ideas?

Oliver
  • 1,576
  • 1
  • 17
  • 31

2 Answers2

1

try using .apply on rows (axis=1) to get a boolean index of which rows matched your condition:

df = df.drop(df.apply(lambda row: not os.path.isfile(row['path']), axis=1))
Adam.Er8
  • 12,675
  • 3
  • 26
  • 38
  • note: using `.apply` with a python function might not have a performance advantage over looping with `iterrows` or a list-comprehension, as it needs to execute a python function on each row (so it leaves the Cython space that makes "native" pandas/numpy functions so fast). – Adam.Er8 Jun 30 '19 at 12:16
1

I would vote for a list comprehension instead of apply() for performance and use the output as boolean index for slicing:

df[[os.path.isfile(i) for i in df['path']]]
anky
  • 74,114
  • 11
  • 41
  • 70