1

I got 2 columns stating the start and end dates of an event in my data frame. I need to calculate the number of 31.12.XXXX's within that range (while treating it as an open interval). Any help is appreciated, thank you.

Larx
  • 101
  • 7

3 Answers3

1

Use list comprehension with date_range:

df['sum'] = [(pd.date_range(s,e).strftime('%d.%m') == '31.12').sum() 
               for s, e in zip(df['start'], df['end'])]

Another idea with numpy broadcasting by all possible datetimes between start and end columns if performance is important:

s = pd.Series(pd.date_range(df['start'].min(), df['end'].max()))
a = s[s.dt.strftime('%d.%m').eq('31.12')].to_numpy()[:, None]

df['sum'] = ((df['start'].to_numpy() <= a) & (df['end'].to_numpy() >= a)).sum(axis=0)

np.random.seed(123)

#https://stackoverflow.com/a/50559079/2901002
def random_dates(start, end, unit='D', seed=None):

    ndays = (end - start).days + 1
    n = len(start)
    return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start

start = pd.date_range('1980-01-01', freq='10D', periods=1000)
end = pd.to_datetime('2030-01-01')


df = pd.DataFrame({'start':start,
                    'end': random_dates(start, end).normalize()})

df['sum1'] = [(pd.date_range(s,e).strftime('%d.%m') == '31.12').sum() 
                   for s, e in zip(df['start'], df['end'])]

s = pd.Series(pd.date_range(df['start'].min(), df['end'].max()))
a = s[s.dt.strftime('%d.%m').eq('31.12')].to_numpy()[:, None]
df['sum'] = ((df['start'].to_numpy() <= a) & (df['end'].to_numpy() >= a)).sum(axis=0)

print (df)
         start        end  sum1  sum
0   1980-01-01 2014-10-29    34   34
1   1980-01-11 1994-04-30    14   14
2   1980-01-21 1991-05-21    11   11
3   1980-01-31 2007-08-09    27   27
4   1980-02-10 2016-01-03    36   36
..         ...        ...   ...  ...
995 2007-03-30 2013-04-05     6    6
996 2007-04-09 2022-12-23    15   15
997 2007-04-19 2015-03-07     8    8
998 2007-04-29 2007-06-02     0    0
999 2007-05-09 2014-01-11     7    7

[1000 rows x 4 columns]

In [68]: %timeit df['sum1'] = [(pd.date_range(s,e).strftime('%d.%m') == '31.12').sum() for s, e in zip(df['start'], df['end'])]
39.7 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [69]: %%timeit
    ...: s = pd.Series(pd.date_range(df['start'].min(), df['end'].max()))
    ...: a = s[s.dt.strftime('%d.%m').eq('31.12')].to_numpy()[:, None]
    ...: df['sum'] = ((df['start'].to_numpy() <= a) & (df['end'].to_numpy() >= a)).sum(axis=0)
    ...: 
108 ms ± 321 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Assuming you mean "open interval" in the ordinary mathematical sense, being that the end dates themselves are excluded, then...

If you just want to work with the textual form, split each date into DD.MM and YYYY components j1=d1[0:5]; y1=int(d1[6:]); y2=int(d2[6:]).

Subtract the "start" year from the "finish" year, subtract one more if the first date is 31.12: n = y2 - y1 - int(j1 == "31.12")

Take the maximum of this and zero (because the answer cannot be negative): if (n<0): n=0

Alternatively, if you have dates represented in a computable form (e.g. Julian day numbers, Unix Epoch seconds, etc) start by adding one day to the "start" date; then take the year of each date, subtract one from the other, and you have your answer.

Sanity checking both approaches, consider:

  1. any two dates in the same year: answer is 0.
  2. 31.12.2020 to 31.12.2021: answer is 0.
  3. 30.12.2020 to 01.01.2021: answer is 1.
Martin Kealey
  • 546
  • 2
  • 11
1

You can use a fully vectorial approach by computing the difference between the years, and correct one edge case:

# ensure datetime
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime)

m1 = df['start'].dt.strftime('%m-%d').eq('12-31')
m2 = df['end'].dt.strftime('%m-%d').ne('12-31')

df['n'] = df['end'].dt.year.sub(df['start'].dt.year) - (m1 & m2)

Output:

        start        end  expected  n
0  2020-01-01 2020-06-15         0  0
1  2020-01-01 2020-12-31         0  0
2  2020-01-01 2021-12-31         1  1
3  2020-01-01 2021-06-15         1  1
4  2020-01-01 2025-06-15         5  5
5  2020-01-01 2025-12-31         5  5
6  2020-06-15 2021-01-15         1  1
7  2020-12-30 2021-01-01         1  1
8  2020-12-31 2021-01-01         0  0
9  2020-12-31 2020-12-31         0  0
10 2020-12-31 2021-12-31         1  1
mozway
  • 194,879
  • 13
  • 39
  • 75