1

I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.

Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions

  • They must be the same customer (same Guest ID)
  • They must have been there on the previous day

If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.

I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.

How can I wrap the following code into something smarter than a loop?

for i in range (0, len(df)):
    newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))

This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns

I have the following input

df
   Day     Guest ID
0  3230    Tom
1  3230    Peter
2  3231    Tom
3  3232    Peter
4  3232    Peter

and expect this output

df
   Day     Guest ID      newby
0  3230    Tom           1
1  3230    Peter         1
2  3231    Tom           0
3  3232    Peter         1
4  3232    Peter         1

Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).

KingOtto
  • 840
  • 5
  • 18

1 Answers1

3

Do:

# ensure the df is sorted by date 
df = df.sort_values('Day')

# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)

Output

    Day Guest ID  newby
0  3230      Tom      1
1  3230    Peter      1
2  3231      Tom      0
3  3232    Peter      1

UPDATE

If multiple visits are allowed per day, you could do:

# only keep unique visits per day
uniques = df.drop_duplicates()

# ensure the df is sorted by date
uniques = uniques.sort_values('Day')

# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)

# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])

print(res)

Output

    Day Guest ID  newby
0  3230      Tom      1
1  3230    Peter      1
2  3231      Tom      0
3  3232    Peter      1
4  3232    Peter      1

As an alternative, without sorting or merging, you could do:

lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)

Output

    Day Guest ID  newby
0  3230      Tom      1
1  3230    Peter      1
2  3231      Tom      0
3  3232    Peter      1
4  3232    Peter      1
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • thanks! this only works if there is exactly one combination of `Day` and `Guest ID`. But it fails when there are multiple rows (i.e., it only finds the first occurrence of "same day, same guest", but not if there are multiple -- I'll update my input posting to include another row to make this clear – KingOtto Dec 20 '20 at 09:35
  • It looks like the merge command kills the indexes... if the former indexes were not [0, 1, 2..] but rather something meaningful, how can I keep them (from the left set, i.e. from `df`)? I tried `left_index=True`, but that resulted in wrong order – KingOtto Dec 20 '20 at 11:21
  • @KingOtto Perhaps this could help, maintain the index: https://stackoverflow.com/questions/11976503/how-to-keep-index-when-using-pandas-merge – Dani Mesejo Dec 20 '20 at 11:27
  • Hm this is not working.. We just sorted the uniques in a different order, and I don't have a link between old and new sorting. So now we would need to introduce yet another column to memorize the order. This is super complex. This is crazy simple in Excel [COUNTIFS(D:D,D3,B:B,B3-1)], where D is the customer column and B is the date column. It is also super easy in a `for` loop. Does Pandas really force me to write 7 lines of code, re-sort multiple times and merge several aux dataframes? – KingOtto Dec 20 '20 at 11:41
  • @KingOtto Do you mean the for loop that you wrote in the question? Are you sure is that what you want? – Dani Mesejo Dec 20 '20 at 11:46
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226190/discussion-between-dani-mesejo-and-kingotto). – Dani Mesejo Dec 20 '20 at 12:11