1

I have a dataframe like this:

 index   customerID    item_tag   orderID    Amount
   0         23            A         1        34.50
   1         55            B         2        11.22
   2         23            A         3         9.34
   3         55            D         4       123.44
   4         55            F         5       231.40

I also have a list containing item_tags like this:

my_list = ['A', 'B', 'D']

Now I want to check how many item types from the my_list were ordered by each customer. For example, for customer 23 this number would be = 1, as customer 23 order only item tagged as A, but not B nor D. Customer 55, however, ordered item B and D, so this indicator variable would be 2, as two item types from the my_list are present in his orders. (he also ordered item F, but this item is not in the my_list).

So far I tried groupby([customerId, item_tag], as_index = False).count(), but this requires creating new dataframe (maybe not necessarily?) and then using if statement for each element in a list, but I suspect there is a more elegant way. I could not find any though, neither in Google nor here. My dataframe has million of rows so I am looking for most efficient solution.

As a result, I want dataframe like this:

 index   customerID   if_A  if_B  if_D  sum_in_list
   0         23         1     0    0        1
   1         55         0     1    1        2
DYZ
  • 55,249
  • 10
  • 64
  • 93
Bonzogondo
  • 143
  • 10

3 Answers3

2

Here's one way using get_dummies + groupby, you get counts for free:

res = pd.get_dummies(df[['customerID', 'item_tag']], columns=['item_tag'])\
        .groupby(['customerID'], as_index=False).sum()

print(res)

   customerID  item_tag_A  item_tag_B  item_tag_D  item_tag_F
0          23           2           0           0           0
1          55           0           1           1           1

There are a couple of additional steps if you want a binary result and limit to specific tags:

L = ['A', 'B', 'D']

df_filtered = df.loc[df['item_tag'].isin(L), ['customerID', 'item_tag']] 

res = pd.get_dummies(df_filtered, columns=['item_tag'])\
        .groupby(['customerID']).any().astype(int).reset_index()

res['total_count'] = res.iloc[:, 1:].sum(axis=1)

print(res)

   customerID  item_tag_A  item_tag_B  item_tag_D  total_count
0          23           1           0           0            1
1          55           0           1           1            2
jpp
  • 159,742
  • 34
  • 281
  • 339
2

My solution filters out the unwanted products and then does the grouping:

wanted = df[df['item_tag'].isin(my_list)]
wanted.groupby(['customerID', 'item_tag'])\
      .count().unstack()['Amount'].fillna(0).astype(int)

#item_tag    A  B  D
#customerID         
#23          2  0  0
#55          0  1  1
DYZ
  • 55,249
  • 10
  • 64
  • 93
2

This is a filtered Cross Tabulation and we can see several options for performing them here under the answer to question #9

Use crosstab and clip_upper

pd.crosstab(df.customerID, df.item_tag).clip_upper()[my_list]

item_tag    A  B  D
customerID         
23          1  0  0
55          0  1  1

Add assign to get summation while using a lambda to keep it inline

pd.crosstab(df.customerID, df.item_tag).clip_upper(1)[my_list].assign(
    Total=lambda d: d.sum(1))

item_tag    A  B  D  Total
customerID                
23          1  0  0      1
55          0  1  1      2

pandas.Series

Interesting alternative with constructing a new series object. I construct it in such a way as to place the item_tags in the first level of a MultiIndex leaving it convenient to use loc and slice the tags I care about.

s = pd.Series(1, set(zip(df.item_tag, df.customerID)))
s.loc[my_list].unstack(0, fill_value=0).assign(
    Total=lambda d: d.sum(1))

    A  B  D  Total
23  1  0  0      1
55  0  1  1      2
piRSquared
  • 285,575
  • 57
  • 475
  • 624