2

I have Pandas Data Frame in Python like below ("col1" is in datetime64 data format):

col1
--------
23-11-2020
25-05-2021
...

Moreover I have list of special dates like below (values are as "object" data type in list):

special_dates = ["25.11.2020", "23.11.2020", "01.06.2021", "20.05.2021", ...] 

And I need to create 2 more columns in my DataFrame:

  • col2 - numer of days until the nearest date from special_dates list
  • col3 - numer of days since the lates date from special_dates list

Be aware that some months have 31 or 30 days and in delinquent years, February has a different (28 or 29) number of days

So as a result i need something like below:

col1       | col2 | col3
-----------|------|......
23-11-2020 | 2    | 0
25-05-2021 | 7    | 5
...        | ...  | ...

How can I do that in Python Pandas ?

dingaro
  • 2,156
  • 9
  • 29
  • 1
    Why not just convert the members of your list to datetimes and find the difference from `col1`? – MattDMo Jul 18 '22 at 13:14
  • what means "*numer of days since the lates date from special_dates list*"? Can you give an example? – mozway Jul 18 '22 at 13:45
  • numer of days since the lates date from special_dates list - for example you have date in "col1" - 23.11.2020 - the latest date from special_list for this date is 23.11.2020, so in this case number of days is 0, and for example the nearest date from special list for mentioned date is 25.11.2020 so number of days is 2 :) – dingaro Jul 18 '22 at 13:50
  • how come the first value in col2 is 2 and not 0? `23.11.2020` is present in the special_dates, do you want to exclude self match? – mozway Jul 18 '22 at 14:24
  • Hmmm you are right, so for 23.11.2020 in col2 and col3 should be 0 because 23.11.2020 is in special dates – dingaro Jul 18 '22 at 14:28

2 Answers2

0

vectorial merge:

df = pd.DataFrame({'col1':["23.11.2020", "25.05.2021", "26.05.2021", "26.05.2022", "26.05.2018"]})
s = pd.Series(pd.to_datetime(special_dates, dayfirst=True)).sort_values()
df['col1'] = pd.to_datetime(df['col1'], dayfirst=True)

df = df.sort_values(by='col1').reset_index()

df['col2'] = (pd.merge_asof(df, s.rename('other'), 
                            left_on='col1', right_on='other',
                            direction='forward', allow_exact_matches=True)['other']
                .sub(df['col1']).dt.days
             )

df['col3'] = (pd.merge_asof(df, s.rename('other'),
                            left_on='col1', right_on='other',
                            direction='backward', allow_exact_matches=True)['other']
                .rsub(df['col1']).dt.days
             )

df = df.set_index('index').sort_index()

output:

            col1   col2   col3
index                         
0     2020-11-23    0.0    0.0
1     2021-05-25    7.0    5.0
2     2021-05-26    6.0    6.0
3     2022-05-26    NaN  359.0
4     2018-05-26  912.0    NaN

older answer (misunderstanding of question)

You can use numpy broadcasting:

special_dates = ["25.11.2020", "23.11.2020", "01.06.2021", "20.05.2021"] 

df['col1'] = pd.to_datetime(df['col1'], dayfirst=True)

a = pd.to_datetime(special_dates, dayfirst=True).to_numpy()
out = (df
       .join(pd.DataFrame((a-df['col1'].to_numpy()[:,None]),
                          index=df.index,
                          columns=range(1, len(special_dates)+1))
               .add_prefix('date_')
               .clip('0')
               #.apply(lambda c: c.dt.days) # uncomment for days as int
             )
      )

output:

        col1 date_1 date_2   date_3   date_4
0 2020-11-23 2 days 0 days 190 days 178 days
1 2021-05-25 0 days 0 days   7 days   0 days

output as integers (last line uncommented):

        col1  date_1  date_2  date_3  date_4
0 2020-11-23       2       0     190     178
1 2021-05-25       0       0       7       0

Variant with dates as headers:

out = (df
       .join(pd.DataFrame((a-df['col1'].to_numpy()[:,None]),
                          index=df.index,
                          columns=special_dates)
               .clip('0')
               .apply(lambda c: c.dt.days)
             )
      )

output:

        col1  25.11.2020  23.11.2020  01.06.2021  20.05.2021
0 2020-11-23           2           0         190         178
1 2021-05-25           0           0           7           0
mozway
  • 194,879
  • 13
  • 39
  • 75
  • mozway, the first solution (where output is not as intiger) works great, but if I try to implement the second solution, (where output is as intiger) I have error like: "AttributeError: Can only use .dt accessor with datetimelike values" in line: .join(pd.DataFrame((a-df['col1'].to_numpy()[:,None]), how to solve that ? – dingaro Jul 18 '22 at 13:42
  • And your solution mozway is not ideally, because I need to have only TWO new columns "col3" and "col4" number of days until the nearest and since the latest special date, not number of days until and since every date from list – dingaro Jul 18 '22 at 13:43
  • strange, what is your pandas version? Can you update your question with a reproducible minimal DataFrame as input? Regarding col2/col3 I read the question too quickly – mozway Jul 18 '22 at 13:43
  • I have Python version 3.8.8 – dingaro Jul 18 '22 at 13:57
  • I was referring to the **pandas** (not python) version, anyway I updated the answer with a vectorial solution. If you want to enable self-match use `allow_exact_matches=True` – mozway Jul 18 '22 at 14:27
  • mozway, you are right, so for 23.11.2020 in col2 and col3 should be 0 because 23.11.2020 is in special dates, could you modify your answer ? – dingaro Jul 18 '22 at 14:37
  • @ritendro sure, see update. Btw, the chosen answer is not vectorial and also quadratic. It should be particularly slow on large dataframes and large lists of special dates. You should almost never have to loop with pandas. – mozway Jul 18 '22 at 14:53
  • maybe you are right that your answer is better but there is one mistake because when I creater df like: df = pd.DataFrame({'col1':["23.11.2020", "25.05.2021", "26.05.2021", "25.05.2021", "20.11.2020"]}) I have error ValueError: left keys must be sorted in line: df['col2'] = (pd.merge_asof(df, s.rename('other'), because i need to check whether for values in col1 for which there is not date higher or date lower in special dates there should be NaN not 0 as is now in your output – dingaro Jul 18 '22 at 14:58
  • Because for date for example: "25.05.2021" I should have NaN in col2 and 5 in col3, and for date for example: "20.11.2020" I should have 5 in col2 and NaN in col3 – dingaro Jul 18 '22 at 15:00
  • The dates must be sorted for a `merge_asof` (see update). For the NaNs, you need to comment the `fillna` line. – mozway Jul 18 '22 at 15:02
  • It look almost perfect but if I have in col1 date higher than dates in special_dates for example "26.05.2022" everything is calculated correct, but try to run your code on the following data frame with date which is lower than any date in special_dates (26.05.2018) and you will se that there are some problems: df = pd.DataFrame({'col1':["23.11.2020", "25.05.2021", "26.05.2021", "26.05.2022", "26.05.2018"]}) – dingaro Jul 18 '22 at 15:08
  • First of all I do not want to have "-" and secondly for 2020-11-23 in col3 is NaN not 0 – dingaro Jul 18 '22 at 15:10
  • mozway could you solve your solution for this df: df = pd.DataFrame({'col1':["23.11.2020", "25.05.2021", "26.05.2021", "26.05.2022", "26.05.2018"]}) ? – dingaro Jul 18 '22 at 15:21
  • I see, this is an issue of index alignment, see update – mozway Jul 18 '22 at 15:24
0

Probably not the best/most efficient way, but you can use the days_between function from this post and then compute the difference between the number of days. This would give you:

import pandas as pd
import numpy as np
from datetime import datetime

def days_between(d1, d2):
    d1 = datetime.strptime(d1, "%d-%m-%Y")
    d2 = datetime.strptime(d2, "%d.%m.%Y")
    return (d2 - d1).days

df = pd.DataFrame({'col1':["23-11-2020", "25-05-2021"]})
special_dates = ["25.11.2020", "23.11.2020", "01.06.2021", "20.05.2021"] 

for idx, date in enumerate(df['col1']):
    col2=np.inf
    col3=np.inf
    for special_date in special_dates:
        delta = days_between(date, special_date)
        if delta >= 0 and delta < col2:
            col2 = delta
        if delta <= 0 and delta > -col3:
            col3 = -delta

    df.loc[df.index[idx], 'col2'] = col2
    df.loc[df.index[idx], 'col3'] = col3

df.replace(np.inf, np.nan, inplace=True)
df[['col2','col3']].round(0)
  • Maximilian is it possible to modify your code so as to have integer in output in "col2" and "col3"? I would like to have someting like 2 instead of 2.0 :) – dingaro Jul 18 '22 at 13:53
  • Yes, of course, just add: ```df[["col2", "col3"]] = df[["col2", "col3"]].astype(int) ``` at the end to convert the columns to ```int``` (I just updated the code) – Maximilian Gangloff Jul 18 '22 at 13:57
  • Maximilian, youranswer is perfect, but could you, as the last thing, add support for the situation when there is no earlier or later date in the list of special dates ? For instance for date 01.01.2019 there is no earlier date ("col3") in special dates and for example for 01.01.2026 there is not next special date ("col2") and in this situation I prefer to have NaN, but will it be possible to combine in column NaN and int values ? – dingaro Jul 18 '22 at 14:07
  • It is not possible to convert NaNs as int afaik. I see 2 options: - Replace the NaNs by -1 so you can check for -1 to find all invalid values since all the other values are valid or, - Leave the columns as float but change their representation format to display 2 instead of 2.0 e.g. – Maximilian Gangloff Jul 18 '22 at 14:15
  • How to change their representation format to display 2 instead of 2.0 e.g. ? because the second option could be better I think – dingaro Jul 18 '22 at 14:17
  • I jsut updated the answer. You can change the rounding of the columns actually using ```df[['col2','col3']].round(0)``` – Maximilian Gangloff Jul 18 '22 at 14:22
  • Maximilian, I see one problem, because for date 23.11.2020 both in col2 and col3 should be value 0 because 23.11.2020 is in special_dates list, how to solve that ? It is really the last question, sorry but I have only just noticed this mistake – dingaro Jul 18 '22 at 14:31
  • No problem, I thought it was meant to be that way. I just updated the answer. Basically, the condition for col 2 changes and we now check if the value is greater or equal 0 instead of strictly greater. – Maximilian Gangloff Jul 18 '22 at 14:40
  • Thank youvery much! Now is perfect, I gave best answer :) – dingaro Jul 18 '22 at 14:47
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/246546/discussion-between-maximilian-gangloff-and-ritendro). – Maximilian Gangloff Jul 18 '22 at 14:47
  • Why? Do you want to add something to this solution ? – dingaro Jul 18 '22 at 14:51
  • No, not to the solution, that's why I created the discussion chat. I just wrote that you could also upvote the answers since both given answers work :) – Maximilian Gangloff Jul 18 '22 at 15:00