1

I'm looking to count the number of occurrences in a column, based on a certain time delta. Consider the following dataset example:

         Date  Type
0  2016-04-01  A
1  2016-04-01  A
2  2016-04-02  A
3  2016-04-02  B
4  2016-04-03  C
5  2016-04-04  B
6  2016-04-05  C    
7  2016-04-06  C
8  2016-04-06  A

I want to create a column that contains how often the same type has been registered in for example the last two days, e.g:

         Date  Type N_Occs_Last_2_Days
0  2016-04-01  A    1
1  2016-04-01  A    2
2  2016-04-02  A    3
3  2016-04-02  B    1
4  2016-04-03  C    1
5  2016-04-04  B    1
6  2016-04-05  C    1    
7  2016-04-06  C    2
8  2016-04-06  A    1

I've found some information on a similar problem: how to extract the average value, but I have a hard time in extracting the counts. Can anyone point me in the right direction?

ansev
  • 30,322
  • 5
  • 17
  • 31
ThomasW
  • 344
  • 3
  • 10
  • 2
    Shouldn't the column of occurs of the last 2 days be [1, 2, 3, 1, ...], since you don't know what's before 2016-04-01 – Swedgin Dec 16 '19 at 16:31
  • @Swedgin, good point. I was thinking to put 2 in the first two rows because two type 'A's occur on 2016-04-01. However, in reality this would be updated in real time so you would not know how many type 'A's would occur after that moment. I'll edit it. – ThomasW Dec 16 '19 at 16:44

4 Answers4

2

Use groupby.cumcount with pd.Grouper:

df['Date']=pd.to_datetime(df['Date']) #converting to datetime
df['N_Occs_Last_2_Days']= ( df.groupby(['Type',pd.Grouper(key='Date',freq='2D')])
                              .Type
                              .cumcount()
                              .add(1) )

Output

print(df)
        Date Type  N_Occs_Last_2_Days
0 2016-04-01    A                   1
1 2016-04-01    A                   2
2 2016-04-02    A                   3
3 2016-04-02    B                   1
4 2016-04-03    C                   1
5 2016-04-04    B                   1
6 2016-04-05    C                   1
7 2016-04-06    C                   2
8 2016-04-06    A                   1
ansev
  • 30,322
  • 5
  • 17
  • 31
  • This nearly works, but it resets every two days. Say you would have three type A's in three different days, the N_occs columns would be: 1 2 1. – ThomasW Dec 17 '19 at 09:08
  • 1
    do you get the correct Serie if you use `freq='1D'` in my solution? – ansev Dec 17 '19 at 09:54
  • No, sadly I don't. I think the downside of this method is that it creates static groups of 2 days, when they should be rolling. I'm working on using the pd.rolling function but my problem there is how to realize this whilst already having grouped by Type. – ThomasW Dec 17 '19 at 09:58
1

I've managed to do it using a lot of workarounds. If anyone has any tips to improve this code, they would be more than welcome, since it will have to be used professionally eventually.

test['Date'] = pd.to_datetime(test['Date'])

df = pd.get_dummies(test, columns=['Type']).set_index('Date')
df = df.apply(lambda x: x.rolling('2D').sum())
df['Type'] = test['Type'].values

# Prevent a double index error (dates are not unique)
df.reset_index(inplace=True)


type_cols = {'A': 'Type_A',
             'B': 'Type_B',
             'C': 'Type_C'}

for typ, col in type_cols.items():
    df.loc[df['Type'] == typ, 'N_occs'] = df[col]

test['N_occs'] = df['N_occs'].values

First, we extract the dummy encoding, and after setting the index to the date we can apply panda's rolling function to sum those counts. We then select the appropriate column based on the type and copy this to N_occs. This is then copied back to the original dataframe. Both frames are shown below:

print(df)
        Date  Type_A  Type_B  Type_C Type  N_occs
0 2016-04-01     1.0     0.0     0.0    A     1.0
1 2016-04-01     2.0     0.0     0.0    A     2.0
2 2016-04-02     3.0     0.0     0.0    A     3.0
3 2016-04-02     3.0     1.0     0.0    B     1.0
4 2016-04-03     1.0     1.0     1.0    C     1.0
5 2016-04-04     0.0     1.0     1.0    B     1.0
6 2016-04-05     0.0     1.0     1.0    C     1.0
7 2016-04-06     0.0     0.0     2.0    C     2.0
8 2016-04-06     1.0     0.0     2.0    A     1.0

print(test)
        Date Type  N_occs
0 2016-04-01    A       1
1 2016-04-01    A       2
2 2016-04-02    A       3
3 2016-04-02    B       1
4 2016-04-03    C       1
5 2016-04-04    B       1
6 2016-04-05    C       1
7 2016-04-06    C       2
8 2016-04-06    A       1
ThomasW
  • 344
  • 3
  • 10
1

Here is my solution inspired by the OP's one. I did not use 'set_index', 'get_dummies' methods or explicit loops.

df["Date"]= pd.to_datetime(df.Date)

df2=df.groupby("Type") \
      .apply(lambda grp:grp.assign(Type=1) \
                           .rolling("2D",on="Date").sum())

                 Date  Type
    Type                   
    A    0 2016-04-01   1.0
         1 2016-04-01   2.0
         2 2016-04-02   3.0
         8 2016-04-06   1.0
    B    3 2016-04-02   1.0
         5 2016-04-04   1.0
    C    4 2016-04-03   1.0
         6 2016-04-05   1.0
         7 2016-04-06   2.0

df2= df2.reset_index(level=0,drop=True)

df["N_occs"]= df2["Type"]

                    Date Type  N_occs
            0 2016-04-01    A     1.0
            1 2016-04-01    A     2.0
            2 2016-04-02    A     3.0
            3 2016-04-02    B     1.0
            4 2016-04-03    C     1.0
            5 2016-04-04    B     1.0
            6 2016-04-05    C     1.0
            7 2016-04-06    C     2.0
            8 2016-04-06    A     1.0
kantal
  • 2,331
  • 2
  • 8
  • 15
0

I would count the occurrences in a separate data structure.

Do a loop over the date x grouped with its n previous dates and count the occurrence of the Type.

Now, since it's difficult to handle the date type, hash (or assign a unique number) them before grouping them together.

Liuk
  • 351
  • 1
  • 13