0

I have a dataframe that has the information, over time, of the occurences of two events (booking and search) for user ID's. It looks like this:

event    user_id
booking  asdf81
search   frjl22
search   frjl22
booking  frjl22
search   asdf81

I'd like to calculate two columns based on this - num_bookings and num_searches.

So, the output dataframe would look like this:

event    user_id    num_bookings   num_searches
booking  asdf81     1              0     
search   frjl22     0              1
search   frjl22     0              2
booking  frjl22     1              2
search   asdf81     1              1

How can I achieve this in pandas?

neuron
  • 551
  • 3
  • 9
  • 16
  • Unfortunately, this doesn't quite answer my question. :( – neuron Feb 23 '20 at 10:01
  • I added an example of the result I'm looking for. I don't have any code, as I'm not sure how to do this in pandas. Thanks! – neuron Feb 23 '20 at 10:12
  • No, it's not. :) – neuron Feb 23 '20 at 10:16
  • It's just an example. Is this enough to reopen the question, please? I'd much appreciate it. – neuron Feb 23 '20 at 10:18
  • Let me change the input dataframe as well, then, for clarity. – neuron Feb 23 '20 at 10:19
  • OK, I've changed the input dataframe to match the order of events and user ID's in the output dataframe. – neuron Feb 23 '20 at 10:21
  • I got different output, can you explain why `0,1,2` for `frjl22` user and why `0` in last row? – jezrael Feb 23 '20 at 10:27
  • Because it's cumulatively counting the number of events per user_id. So `asdf81` had 0 bookings initially, but had 1 search, so it gets a `0` and a `1`. The calculation is done from the bottom up, if it makes sense. – neuron Feb 23 '20 at 10:32
  • Oh sorry, this is my bad. The algorithm work from the the top down, in descending order. I'll amend the example now. – neuron Feb 23 '20 at 10:33

1 Answers1

2

Idea is reshape event column to columns and use GroupBy.cumcount per all groups with removing DataFrame.dropna and then forward filling misisng values by GroupBy.ffill with replace missing values to 0 and last add to original by DataFrame.join:

df1 = df.set_index('event', append=True)['user_id'].unstack().add_prefix('num_')
df1 = pd.concat([df1.dropna(subset=[c]).groupby(c).cumcount().add(1) 
                                             for c in df1.columns], axis=1, keys=df1.columns)
print (df1)
event  num_booking  num_search
0              1.0         NaN
1              NaN         1.0
2              NaN         2.0
3              1.0         NaN
4              NaN         1.0

df = df.join(df1.groupby(df['user_id']).ffill().fillna(0).astype(int))
print (df)
     event user_id  num_booking  num_search
0  booking  asdf81            1           0
1   search  frjl22            0           1
2   search  frjl22            0           2
3  booking  frjl22            1           2
4   search  asdf81            1           1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This doesn't solve the problem, unfortunately. It doesn't create the two columns, but creates a column per user_id. :( – neuron Feb 23 '20 at 10:28
  • Wait, it works now. I did something wrong. Sorry about that. – neuron Feb 23 '20 at 10:36
  • 1
    Hey, I'm so sorry, but this doesn't quite solve my problem. I don't need the total count joined straight away. I need the cumulative count. So as new events happen, the counts per user_id should be increasing. That's why they start at 0 in my example. – neuron Feb 23 '20 at 10:39
  • @neuron - Is output correct? Why in last row is `search asdf81 1 1` ? Not is correct `search asdf81 0 1` ? – jezrael Feb 23 '20 at 10:45
  • Yes, the last row is correct. User `asdf81` first had one `booking` in the beginning, and then had one `search` at the end. So the cumulative sum of bookings is `1`, and the cumulative sum of searches is `1` as well. – neuron Feb 23 '20 at 10:51