1

I am trying to follow the otherwise excellent solution provided in the thread pandas-counting-and-summing-specific-conditions, but the code only ever outputs nan values, and with sum (not count), gives a future warning.

Basically, for each row in my df, I want to count how many dates in a column are within a range of +/- 1 days of other dates in the same column.

If I were doing it in excel, the following muti-conditional sumproduct or countifs are possible:

= SUMPRODUCT(--(AN2>=$AN$2:$AN$35000-1),--(AN2<=$AN$2:$AN$35000+1)),

or

=countifs($AN$2:$AN$35000,">="&AN2-1,$AN$2:$AN$35000,"<="&AN2+1)

In python, trying the approach in the linked thread, I believe the code would be:

import pandas as pd
import datetime

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                             pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                             pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})

df["caseIntensity"] = df[(df['datet'] <= df['datet'] + datetime.timedelta(days=1)) &\
                             (df['datet'] >= df['datet'] - datetime.timedelta(days=1))].sum()

The output should be: 2, 2, 2, 3, 3, 2. Instead it is wholemeal nan!

Is it correct to assume that because I'm testing conditions, it doesn't matter if I sum or count? If I need to sum, I get a future warning about invalid columns (the columns are valid), which I don't understand. But mostly, my question is why am I only getting nan?

DrWhat
  • 2,360
  • 5
  • 19
  • 33

2 Answers2

1

I think what you're trying to sum doesn't match the logic you're trying to apply.

Use the following code:

Create a function which counts the number of days in that range Call that function for each row and save that as the the value of the new column

import pandas as pd
import datetime

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                             pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                             pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})

def get_dates_in_range(df_copy, row):
    return df_copy[(df_copy['datet'] <= row['datet'] + datetime.timedelta(days=1)) &\
                     (df_copy['datet'] >= row['datet'] - datetime.timedelta(days=1))].shape[0]
    
    
df["caseIntensity"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)


       datet    caseIntensity
0   2020-03-04  2
1   2020-03-05  2
2   2020-03-09  2
3   2020-03-10  3
4   2020-03-11  3
5   2020-03-12  2
Shubham Periwal
  • 2,198
  • 2
  • 8
  • 26
  • Also a very good solution. It works with my real data and could be extended with more conditions similar to a sumproduct or countifs. I'd love to know why pandas (or numpy) don't sum my logic. If I separate or print the logic, its true or false in all the right places. – DrWhat Sep 23 '21 at 09:40
  • While the np.sum method is a lot quicker, I needed to use this sumproduct-like solution when my data got too large, and started causing: MemoryError: Unable to allocate 55.9 GiB for an array with shape (244969, 244969) and data type bool. Thanks again Shubham – DrWhat May 04 '22 at 08:17
1

Instead loops in apply is possible use vectorized solution, first create numpy arrays chained by &, compare and for counts Trues is possible use sum:

a = df['datet']
b = a + pd.Timedelta(days=1)
c = a - pd.Timedelta(days=1)
    
mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])

df["caseIntensity"]  = mask.sum(axis=1)
print (df)
       datet  caseIntensity
0 2020-03-04              2
1 2020-03-05              2
2 2020-03-09              2
3 2020-03-10              3
4 2020-03-11              3
5 2020-03-12              2

Here is perfomance for 6k rows:

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                         pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                         pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df = pd.concat([df] * 1000, ignore_index=True)


In [140]: %%timeit
     ...: a = df['datet']
     ...: b = a + pd.Timedelta(days=1)
     ...: c = a - pd.Timedelta(days=1)
     ...:     
     ...: mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
     ...: 
     ...: df["caseIntensity"]  = mask.sum(axis=1)
     ...: 
469 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [141]: %%timeit
     ...: df["caseIntensity1"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
     ...: 
     ...: 
6.2 s ± 368 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @DrWhat - No sure if understand, `I am basically creating a mask with as many columns as rows` – jezrael Sep 23 '21 at 09:24
  • This extends a good solution you provided elsewhere Mr Jezrael, which I couldn't get to work on this problem. It's also very compact and clear. Please correct me if I am wrong - I am basically creating a np array mask with as many columns as rows in the original series. Each column in the mask considers the defined condition, true or false, for the row. I then simply sum the rows of the mask. Could you please clarify why I need the [:, None] – DrWhat Sep 23 '21 at 09:29
  • @DrWhat - Sure, because is used numpy broadcasting - check [this](https://numpy.org/doc/stable/user/basics.broadcasting.html) – jezrael Sep 23 '21 at 09:31