2

I am trying to calculate in Pandas a rolling window over one date column and count the distinct values in another column. Let's say I have this df dataframe:

date    customer
2020-01-01  A
2020-01-02  A
2020-01-02  B
2020-01-03  A
2020-01-03  C
2020-01-03  D
2020-01-04  E

I would like to group by the datecolumn, create a rolling window of two days and count the distinct values in the column customer. The expected output would be something like:

date       distinct_customers
2020-01-01  NaN --> (first value)
2020-01-02  2.0 --> (distinct customers between 2020-01-01 and 2020-01-02: [A, B]) 
2020-01-03  4.0 --> (distinct customers between 2020-01-02 and 2020-01-03: [A, B, C, D])
2020-01-04  4.0 --> (distinct customers between 2020-01-03 and 2020-01-04: [A, C, D, E])

It seems easy but I don't seem to find any straight-forward way to achieve that, I've tried using groupby or rolling. I don't find other posts solving this issue. Does someone have any idea how to do this? Thanks a lot in advance!

david fdez
  • 163
  • 9
  • 1
    You should use crosstab method of df. [Examples](https://mlcourse.ai/articles/topic1-exploratory-data-analysis-with-pandas/) – Musulmon Jul 15 '20 at 14:47

1 Answers1

1

Based on the idea of @Musulmon, this one liner should do it:

pd.crosstab(df['date'], df['customer']).rolling(2).sum().clip(0,1).sum(axis=1)

Thanks!

david fdez
  • 163
  • 9
  • 1
    thanks @Manakin, it's the `axis=0`which should not be there. I've removed it, thanks for pointing that out! :) this works: https://hub.gke.mybinder.org/user/jupyterlab-jupyterlab-demo-tyv31p59/lab/tree/demo/Untitled.ipynb – david fdez Jul 15 '20 at 16:45