1

I want to count number of DataFrame rows which contains all of the values from the list.

Lets say I've got:

abc = pd.DataFrame(np.random.randint(1,100,(100, 10)))

and I've got following list (with unique values):

x = [10, 32, 8]

What is the simplest way, using pandas, to chck if row contains all given values? I could probably convert each row to list and than compare content of new list with the given list and increment my counter if comparison results with success, but I guess it is not the fastest way in this case...

4 Answers4

2

it is necessary to mask duplicates, then we can use DataFrame.isin and DataFrame.sum to count the number of matches per row and check if it is greater than or equal to the length of the list with DataFrame.ge, which will be sufficient since there are no duplicates. finally we obtain the sum of the rows that meet this condition

df=pd.DataFrame(np.sort(abc, axis=1))
df.where(df.ne(df.shift(axis=1))).isin(x).sum(axis=1).ge(len(x)).sum()

If not duplicates in rows:

abc.isin(x).sum(axis=1).ge(len(x)).sum()

References to mask duplicates:

ansev
  • 30,322
  • 5
  • 17
  • 31
  • This piece - ```abc.stack().drop_duplicates()``` is wrong. You will also drop same number appearing for the second (and any further) time in the same column, hence it will disappear from all the rows below. – Grzegorz Skibinski Mar 19 '20 at 23:34
  • This is what I was looking for! Thank you! I'm just wondering how `df.ne(df.shift(axis=1))` works. I can see that it is creating mask and marking same values in the row. Are values in the row shifted one by one? Which values are compared with `df.ne()`? – jack.sparrow Mar 20 '20 at 11:31
  • Hi, thanks for accepting the answer. The problem here is that duplicates need to be removed in order to compare to the list size. To eliminate duplicates and to continue maintaining the shape of the dataframe (the number of duplicates can be different in each row) it is necessary to convert them to NaN. Also there is no function in python to do this, we could use apply, but this is not recommended on multiple occasions https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas- apply-in-my-code. – ansev Mar 20 '20 at 11:39
  • This is why it occurs to me if **all the rows were ordered in descending or ascending order we could check the duplicates simply checking if it is equal to the previous one (by rows)**. The shift operation creates a daframe by shifting all the row values ​​to the right. So then with `.ne()` we check the cells where there are no duplicates and with `DataFrame.where()` we can set `NaN` the rest. It may seem complicated, but it is a way to do this with `numpy` and `pandas` functions that work well and fast with multiple sizes of DataFrame. – ansev Mar 20 '20 at 11:45
1

Something like

abc.apply(lambda y : pd.Series(x).isin(y).all(), axis=1).sum()
4
ansev
  • 30,322
  • 5
  • 17
  • 31
BENY
  • 317,841
  • 20
  • 164
  • 234
0

You can do:

# in case x is not unique
# x=set(x)

df=abc.stack().reset_index(level=1, drop=True).reset_index().drop_duplicates().rename(columns={0: "num"})

cnt=df.loc[df["num"].isin(x)].groupby("index")["num"].count().eq(len(x)).sum()

print(cnt)
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
0

You can use numpy setdiff of the array and row, if the len is zero, all elements are present.

abc.apply(lambda row: len(np.setdiff1d(x, row)) == 0, axis = 1)
Vaishali
  • 37,545
  • 5
  • 58
  • 86