4

So, i have this Dataframe with almost 3 thousand rows, that looks something like this:

        CITIES
0       ['A','B']
1       ['A','B','C','D']
2       ['A','B','C']
4       ['X']
5       ['X','Y','Z']
...     ...
2670    ['Y','Z']

I would like to remove from the DF all rows were the 'CITIES' list is contained in another row (the order does not matter), on the example above, i would like to remove 0 and 2, since both are contained in 1, and also remove 4 and 2670, since both are contained, i tried something, it kinda worked, but it was really stupid and took almost 10 minutes to compute, this was it:

indexesToRemove=[]
for index, row in entrada.iterrows():
    citiesListFixed=row['CITIES']
    for index2, row2 in entrada.iloc[index+1:].iterrows():
        citiesListCurrent=row2['CITIES']
        if set(citiesListFixed) <= set(citiesListCurrent):
            indexesToRemove.append(index)
            break

Is there a more efficient way to do this?

Levy Barbosa
  • 342
  • 1
  • 5
  • 13
  • 1
    Does this answer your question? [Pandas: remove duplicates that exist in any order](https://stackoverflow.com/questions/51603520/pandas-remove-duplicates-that-exist-in-any-order) – artemis Feb 05 '21 at 15:55
  • What does "contained" mean in this context? Should ["A", "C"] or ["A", "X"] be removed or not? – Mr. T Feb 05 '21 at 16:02
  • 2
    I don't see how his case could relate to mine, there's no list in his Dataframe – Levy Barbosa Feb 05 '21 at 16:02
  • @Mr.T, sorry if was not clear enough, i think subset is a better world for it, a row should only be removed if all elements in a list are contained in another list, your example would not be removed because "C" is not an element in the 2nd list – Levy Barbosa Feb 05 '21 at 16:04

2 Answers2

6

First create the DataFrame of dummies and then we can use matrix multiplication to see if one of the rows is a complete subset of another row, by checking if the sum of multiplication with another row is equal to the number of elements in that row. (Going to be a memory intensive)

import pandas as pd
import numpy as np

df = pd.DataFrame({'Cities': [['A','B'], ['A','B','C','D'], ['A','B','C'],
                              ['X'], ['X','Y','Z'], ['Y','Z']]})    

arr = pd.get_dummies(df['Cities'].explode()).max(level=0).to_numpy()
#[[1 1 0 0 0 0 0]
# [1 1 1 1 0 0 0]
# [1 1 1 0 0 0 0]
# [0 0 0 0 1 0 0]
# [0 0 0 0 1 1 1]
# [0 0 0 0 0 1 1]]

subsets = np.matmul(arr, arr.T)
np.fill_diagonal(subsets, 0)  # So same row doesn't exclude itself

mask = ~np.equal(subsets, np.sum(arr, 1)).any(0)

df[mask]
#         Cities
#1  [A, B, C, D]
#4     [X, Y, Z]

As it stands if you have two rows which tie for the longest subset, (i.e. two rows with ['A','B','C','D']) both are dropped. If this is not desired you can first drop_duplicates on 'Cities' (will need to covert to a hashable type like frozenset) and then apply the above.

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Very neat and tidy answer, thanks for that! What would be the time complexity of this approach? (I don't know numpy under the hood that well yet) – rodrigocfaria Feb 05 '21 at 16:35
  • @rodrigocfaria I'm not the best with time complexity, and numpy can get complicated. But the multiplications and sums should all be vectorized so it should scale pretty well over a large range of N. – ALollz Feb 05 '21 at 16:54
  • No problem at all, I'll check numpy documentation to see if this point is cited somewhere. Thank you – rodrigocfaria Feb 05 '21 at 17:07
  • This one is really fast, but it uses more than 4GB of ram during execution, so i'm probably going to use @rodrigocfaria 's solution, even if it's slower – Levy Barbosa Feb 05 '21 at 19:24
  • @MárcioLevy yup that makes! Definitely will become prohibitively expensive for larger datasets – ALollz Feb 05 '21 at 19:26
2

A possible and didactic approach would be the following:

import pandas as pd
import numpy as np
import time

start = time.process_time()
  
lst1 = [0,1,2,3,4,2670]   
lst2 = [['A','B'], ['A','B','C','D'], ['A','B','C'], ['X'], ['X','Y','Z'], ['Y','Z']] 

df = pd.DataFrame(list(zip(lst1, lst2)), columns =['id', 'cities'])

df['contained'] = 0

n = df.shape[0]

for i in range(n):
    for j in range(n):
        if i != j:
            if((set(df.loc[i,'cities']) & set(df.loc[j,'cities']))== set(df.loc[i,'cities'])): 
                df.loc[i,'contained'] = 1


print(df)

print("\nTime elapsed:", time.process_time() - start, "seconds")

The time complexity of this solution is O(n^2).

You end up with this data frame as a result:

     id        cities  contained
0     0        [A, B]          1
1     1  [A, B, C, D]          0
2     2     [A, B, C]          1
3     3           [X]          1
4     4     [X, Y, Z]          0
5  2670        [Y, Z]          1

Then you just have to exclude the rows where contained == 1.

rodrigocfaria
  • 440
  • 4
  • 11
  • Your answer works well, it takes around 180sec, compared to the @ALollz one that takes around 30s, but your's use WAY less RAM, so i think i will use this one, the other one uses around 4GB on my 2670 rows Dataframe – Levy Barbosa Feb 05 '21 at 19:23
  • I'm glad it helped @MárcioLevy! Feel free to accept my answer :) – rodrigocfaria Feb 05 '21 at 19:42