I am processing a pandas dataframe and want to remove rows if they contain a "Full Path" that is already contained in other "Full Path" of the dataframe.
In the example below I want to remove the rows 1 2 3 4 because c:/dir/ "contains" them (we are talking about file systems path here):
Full Path Value
0 c:/dir/ x
1 c:/dir/sub1/ x
2 c:/dir/sub2/ x
3 c:/dir/sub2/a x
4 c:/dir/sub2/b x
5 c:/anotherdir/ x
6 c:/anotherdir_A/ x
7 c:/anotherdir_C/ x
Rows 6 & 7 are kept because the path is not contained in 5 (a in b
in my code below).
The code I came up with is the following, res is the initial dataframe:
to_drop = []
for index, row in res.iterrows():
a = row['Full Path']
for idx, row2 in res.iterrows():
b = row2['Full Path']
if a != b and a in b:
to_drop.append(idx)
res2 = res.loc[~res.index.isin(to_drop)]
It works but the code does not feel 100% pythonic to me. I am quite sure there is a more elegant/clever way to do this. Any idea?