I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby
, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
- They must be the same customer (same
Guest ID
) - They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-...
to identify new customers.
I managed to do this with a for
loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df
, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3
and 4
are not necessarily duplicates - given there might be additional, varying columns (such as their order).