1

This question is in continuation with this one which I asked here:

Now I have data something like this:

Sno   User  Cookie
 1     1       A
 2     1       A
 3     1       A
 4     1       B
 5     1       C
 6     1       D
 7     1       A
 8     1       B
 9     1       D
 10    1       E
 11    1       D
 12    1       A
 13    2       F
 14    2       G
 15    2       F
 16    2       G
 17    2       H
 18    2       H

So lets say we have 5 cookies for user 1 'A,B,C,D,E'. Now I want to count if any cookie has reoccurred after a new cookie was encountered. For example, in the above example, cookie A was encountered again at 7th place and then at 12th place also. NOTE We wouldn't count A at 2nd place as it came simultaneously, but at position 7th and 12th we had seen many new cookies before seeing A again, hence we count that instance. So this is what I will get if I run code mentioned in my previous post:

For User 1

Sno Cookie  Count
 1    A     2
 2    B     1
 3    C     0
 4    D     2
 5    E     0

For User 2

Sno Cookie  Count
 6    F     1
 7    G     1
 8    H     0

Now comes the tricky part, now we know by the count, that for user 1, three cookies "A, B and D" re-occurred. Similarly for User 2 "F and G" reoccurred. I want to aggregate these results like this:

Sno User Reoccurred_Instances
 1   1    3
 2   2    2

Is there any easier way without using a loop to get this result.

Kshitij Yadav
  • 1,357
  • 1
  • 15
  • 35

2 Answers2

1

Following the same first steps as I took in my answer to your previous question, to get rid of consecutive Cookie values and find the duplicates:

no_doubles = df[df.Cookie != df.Cookie.shift()]

no_doubles['dups'] = no_doubles.Cookie.duplicated()

Then use a groupby to group by User on the subset of data that are indeed duplicated (no_doubles[no_doubles['dups']]), and find the number of unique Cookies for each user using nunique:

no_doubles[no_doubles['dups']].groupby('User')['Cookie'].nunique().reset_index()

This returns:

   User  Cookie
0     1       3
1     2       2

You can rename the columns as desired

[EDIT]:

To deal with different cases, you can just add to this logic. For example, considering the following dataframe with no repeats in User number 3:

Sno   User  Cookie
 1     1       A
 2     1       A
 3     1       A
 4     1       B
 5     1       C
 6     1       D
 7     1       A
 8     1       B
 9     1       D
 10    1       E
 11    1       D
 12    1       A
 13    2       F
 14    2       G
 15    2       F
 16    2       G
 17    2       H
 18    2       H
 18    3       H
 18    3       I
 18    3       J

You can do:

no_doubles = df[(df.Cookie != df.Cookie.shift()) | (df.User != df.User.shift())]

no_doubles['dups'] = no_doubles.duplicated(['Cookie', 'User'])

no_doubles.groupby('User').apply(lambda x: x[x.dups]['Cookie'].nunique()).to_frame('Reoccurred_Instances')

To get:

      Reoccurred_Instances
User                      
1                        3
2                        2
3                        0
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • 1
    Like always! Perfect logic. @sacul, so its been 2 years since I am using python, but such level of logic if a little difficult for me to apply. Could you tell me how can I improve upon it ? – Kshitij Yadav Aug 29 '18 at 18:59
  • 1
    One point I want to add is that the result of the above query would be missing those users in which there is no re-occurrence of Cookie. How can I tackle that ? – Kshitij Yadav Aug 29 '18 at 19:02
  • 1
    I think this sort of problem gets overwhelming when trying to think of a solution in one step.... Things tend to be a lot more manageable if you break it down into steps. That, and looking at all of the `pandas` functions, there is almost always a solution using functions that have already been written and vectorized by the `pandas` developers – sacuL Aug 29 '18 at 19:02
  • See updates for your no re-occurrence case (it also deals with a couple other edge cases too) – sacuL Aug 29 '18 at 19:09
  • Hey there is some error I am encountering. Using this formula, cases such as users having no Reoccurred_Instances but having 2 cookies are also being counted as 2 since we are doing groupby of the cookie. For example for a user there is no re-occurred cookie but then 2 distinct cookie (in case of USER 3) the value I am getting is 2, I want to get value 0 as there was no re-occurance although it had 2 distinct ones. Is it possible ? – Kshitij Yadav Aug 29 '18 at 20:10
  • I'm not sure I understand: In the case of user 3, there were 3 distinct cookies, but no re-occurences, so the code I provided gave `0`. Is that not the expected output? – sacuL Aug 29 '18 at 21:29
  • The shift() function is failing when I am processing this for 100k records. For cookies which are same its adding only True even if its encountering the same cookie again and again. – Kshitij Yadav Aug 29 '18 at 21:39
  • So I dont know why, but when I am applying df[df.Cookie != df.Cookie.shift()] the shift function is not filtering correctly. It giving me incorrect results and there is no reason why :/ I am applying it to 100k records and its breaking down and dups flags which I am getting are wrong and hence giving me incorrect results. – Kshitij Yadav Aug 29 '18 at 22:01
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/179060/discussion-between-sacul-and-kshitij-yadav). – sacuL Aug 29 '18 at 22:01
  • Sorry my chat is blocked by the firewall :/ I have one more question, https://stackoverflow.com/questions/52170493/track-users-location-for-detection-of-account-credential-sharing – Kshitij Yadav Sep 04 '18 at 16:25
  • If you have free time do look at it. Thank you :) – Kshitij Yadav Sep 04 '18 at 16:25
  • can u help me with this: https://stackoverflow.com/questions/52433293/merge-multiple-rows-with-same-id-into-one-delimited-row-in-pandas – Kshitij Yadav Sep 20 '18 at 21:04
1

Another approach to this, which I think should be pretty flexible:

dups  = df.loc[:, ['User', 'Cookie']].duplicated()
diffs = df.Cookie != df.Cookie.shift()
flags = np.logical_and(dups, diffs)

df['flag'] = flags

result_1 = df.groupby(['User', 'Cookie'])['flag'].agg([('Count', sum)])
result_2 = result_1.groupby('User')['Count'].agg([('Reoccurred_Instances', lambda x: (x > 0).sum())])

It'll take duplicates in terms of User and Cookie, and then it'll compare Cookie values to their neighbors. Finally, append a column of True/False corresponding to the two aforementioned flags. Use that column to create the two summary tables you mentioned (result_1 and result_2, below).

             Count
User Cookie       
1    A         2.0
     B         1.0
     C         0.0
     D         2.0
     E         0.0
2    F         1.0
     G         1.0
     H         0.0

      Reoccurred_Instances
User                      
1                      3.0
2                      2.0

EDIT: As mentioned in a comment below, let's assume a third user with Cookie values H, H, H, J, J, J. Like this:

    Sno  User Cookie
0     1     1      A
1     2     1      A
2     3     1      A
3     4     1      B
4     5     1      C
5     6     1      D
6     7     1      A
7     8     1      B
8     9     1      D
9    10     1      E
10   11     1      D
11   12     1      A
12   13     2      F
13   14     2      G
14   15     2      F
15   16     2      G
16   17     2      H
17   18     2      H
18   19     3      H
19   20     3      H
20   21     3      H
21   22     3      J
22   23     3      J
23   24     3      J

Running this through the code above, we get the following result dataframes:

             Count
User Cookie       
1    A         2.0
     B         1.0
     C         0.0
     D         2.0
     E         0.0
2    F         1.0
     G         1.0
     H         0.0
3    H         0.0
     J         0.0

And:

      Reoccurred_Instances
User                      
1                      3.0
2                      2.0
3                      0.0

The way the code is set up, it'll still show the third user (who does not have any repeat Cookie values) with a Reoccurred_Instances value of 0.

cpander
  • 374
  • 2
  • 9
  • Hey you answer is great but there is one problem which I am facing with it. Let say there is a user C which has following cookies: "H,H,H,J,J,J,J", since there is no re occurrence, if we apply the shift function, it will leave us with "H,J" and will count it as 2 when it should have counted it as 0 in case of re occurrence. Is it possible to accommodate this in the code ? – Kshitij Yadav Aug 29 '18 at 20:33
  • Not sure that that's an issue with this; see my edit above. – cpander Aug 30 '18 at 02:29
  • Worked thank you. I do have one more question: https://stackoverflow.com/questions/52170493/track-users-location-for-detection-of-account-credential-sharing – Kshitij Yadav Sep 04 '18 at 16:30