1

I am processing a pandas dataframe and want to remove rows if they contain a "Full Path" that is already contained in other "Full Path" of the dataframe.

In the example below I want to remove the rows 1 2 3 4 because c:/dir/ "contains" them (we are talking about file systems path here):

     Full Path        Value
0    c:/dir/            x
1    c:/dir/sub1/       x
2    c:/dir/sub2/       x
3    c:/dir/sub2/a      x
4    c:/dir/sub2/b      x
5    c:/anotherdir/     x
6    c:/anotherdir_A/   x
7    c:/anotherdir_C/   x

Rows 6 & 7 are kept because the path is not contained in 5 (a in b in my code below).

The code I came up with is the following, res is the initial dataframe:

to_drop = []
for index, row in res.iterrows():
    a = row['Full Path']
    for idx, row2 in res.iterrows():
        b = row2['Full Path']
        if a != b and a in b:
            to_drop.append(idx)
res2 = res.loc[~res.index.isin(to_drop)]

It works but the code does not feel 100% pythonic to me. I am quite sure there is a more elegant/clever way to do this. Any idea?

wotter
  • 518
  • 5
  • 22
  • Are you only considering subdirectories as "contained", or should rows containing `c:/dira` and `c:/dirb` also be considered as duplicates? – DeepSpace Sep 14 '20 at 15:57
  • no they are different directories, not dup in my case (the a in b test is enough for me it's the double loop that seems clumsy) – wotter Sep 14 '20 at 16:00
  • Does this answer your question? [Drop all duplicate rows across multiple columns in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-across-multiple-columns-in-python-pandas) – Trenton McKinney Sep 14 '20 at 16:01
  • Does this answer your question? [How do I get a list of all the duplicate items using pandas in python?](https://stackoverflow.com/questions/14657241) – Trenton McKinney Sep 14 '20 at 16:02
  • 1
    @Trenton McKinney It would if I could define what a duplicate is (in my case it's "a in b and a != b") – wotter Sep 14 '20 at 16:10

2 Answers2

2
pd.concat([df, df['Full Path'].str.extract('(.*:\/.*?\/)')], axis = 1)\
  .drop_duplicates([0])\
  .drop(columns = 0)

You can use .str.extract and regex to extract the base directory, concating the extract back to the original df, dropping the duplicates of the base directory, followed by finally dropping the extracted column.

Edit: Alternate if Path is not in order:

df[df['Full Path'] == df['Full Path'].str.extract('(.*:\/.*?\/)', expand = False)]
Ben Pap
  • 2,549
  • 1
  • 8
  • 17
  • 1
    If the dataframe is not sorted by `Full Path` this will not always provide the correct output (assuming OP wants to always get the top-most-not-duplicated entry) – DeepSpace Sep 14 '20 at 16:58
  • 2
    Thats a good point. In that case you can use the extraction as a filter to match against the Path column. See my edit. This is probably better than my original answer. – Ben Pap Sep 14 '20 at 17:16
1

The time complexity of this is in the tank (no matter how you turn it, you have to check every path against every other path), but a single line solution using str.startswith:

df = pd.DataFrame({'Full Path': ['c:/dir/', 'c:/dir/sub/', 'c:/anotherdir/dir',
                                 'c:/anotherdir/'],
                   'Value': ['A', 'B', 'C', 'D']})

print(df[[any(a.startswith(b) if a != b else False for a in df['Full Path'])
          for b in df['Full Path']]])

output

        Full Path Value
0         c:/dir/     A
3  c:/anotherdir/     D
DeepSpace
  • 78,697
  • 11
  • 109
  • 154