Filter Pandas dataframe by the number of user-item interactions

Question

I'm given a Pandas dataframe with columns user_id, item_id, timestamp . Suppose that there is only one user_id - item_id interaction, i.e., there exist only one timestamp with this particular interaction:

     user_id  item_id   timestamp
  0      1       2         123
  1      1       3         145
  2      4       6         123
  3      5       7         198

Given a parameter threshold I filter out those user_ids that appear <= threshold number of times:

data = data.groupby("user_id").filter(lambda x: len(x) > threshold)

and get (for threshold = 1):

     user_id  item_id   timestamp
  0      1       2         123
  1      1       3         145

because only user "1" has interacted more than threshold items.

Now, suppose that there may be multiple user_id - item_id interactions, i.e., there could be several timestamps with a particular interaction:

     user_id  item_id   timestamp
  0      1       2         123
  1      1       2         145
  2      4       6         123
  3      4       7         198

What would be the most elegant (the fastest) way to filter out those users that have <= threshold number of unique interactions? The desired output would then be:

     user_id  item_id   timestamp
  0      4       6         123
  1      4       7         198

(because user "1" has interacted only with 1 item, and user "4" remains there, because he has interacted with 2 items).

One way I thought of (not that elegant, huh?):

data_cold = data.groupby('user_id').agg({'item_id':lambda x: x.nunique()})
data_cold = data_cold.reset_index()
data_cold = data_cold[data_cold.item_id > threshold]
data = data[data['user_id'].isin(data_cold.user_id)]

Welcome to SO. I suggest adding sample input and expected output. It is rather difficult to follow what you try to achieve. See also [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Mr. T, Jan 28 '22 at 13:12
Try to update your post with the output of `print(df.head(10).to_string())` — Corralien, Jan 28 '22 at 13:21

Mr. T · Accepted Answer · 2022-01-31T09:30:16.497

I would also entertain the idea to use groupby with both columns user_id and item_id but aggregate differently:

import pandas as pd

#sample data
from io import StringIO
data1 = """
  user_id  item_id   timestamp
  0      1       2         123
  1      1       2         145
  2      4       6         123
  3      4       7         198
  4      4       7         172
  5      1       1         163
  6      1       1         172
  7      2       4         871
  8      4       3         883   """
df = pd.read_csv(StringIO(data1), sep = "\s{2,}", engine="python")


threshold = 2

#strategy: count number of user_id-item_id pairs
#remove rows with at least n non-NaN values, where n is defined by your threshold
filtered_ID = df.groupby(["user_id", "item_id"]).size().unstack().dropna(thresh=threshold+1).index

#then filter the original dataframe for the retrieved user_id's
print(df[df["user_id"].isin(filtered_ID)])

Sample output:

   user_id  item_id  timestamp
2        4        6        123
3        4        7        198
4        4        7        172
8        4        3        883

In my example, threshold = 3 would mean that users who have interacted with 3 or less item get filtered out. Here, in your sample output, user "4" interacted with item "7" twice, which means that the number of his unique interactions is 3 (with "6", "7", "3"). — kevin_was_here, Jan 31 '22 at 08:58
It filters for unique interactions, just an oversight that it included the threshold value. Fixed it. — Mr. T, Jan 31 '22 at 09:33

Anynamer · Answer 2 · 2022-01-28T13:45:53.053

0

if I understand correctly, you could group by two columns and then count groups with size(). After that, the unique interactions will have count of 1 and can be filtered if wanted:

data = data.groupby(["user_id", "item_id"]).size().reset_index(name = 'counts')
uniques = data[data['counts'] == 1]

edited Jan 28 '22 at 13:45

answered Jan 28 '22 at 13:34

Anynamer

334
1
6

Thanks, but that's not what I was looking for. Please check the code I provided - I wonder if it can be done more efficiently. – kevin_was_here Jan 28 '22 at 13:56
I edited my post, explaining why user "1" get filtered out, while user "4" remains. – kevin_was_here Jan 28 '22 at 13:59

Filter Pandas dataframe by the number of user-item interactions

2 Answers2