0

I am trying to apply the next function in which two datetime64 pandas dataframe columns are arguments:

import datetime
import pandas as pd

def set_dif_months_na(start_date, end_date):
    if (pd.isnull(start_date) and pd.notnull(end_date)):
        return None
    elif (pd.notnull(start_date) and pd.isnull(end_date)):
        return None
    elif (pd.isnull(start_date) and pd.isnull(end_date)):
        return None
    else:
        start_date = datetime.strptime(start_date, "%d/%m/%Y")
        end_date = datetime.strptime(end_date, "%d/%m/%Y")
    return abs((end_date.year - start_date.year) * 12 + (end_date.month - start_date.month))

This function is intended to get month difference as integer given two dates as arguments, else it has to return None.

When I apply it to a new pandas dataframe column as this:

df['new_col'] = [set_dif_months_na(date1, date2)
                                          for date1,date2 in 
                                          zip(df['date1'], df['date2'])]

The next error arises:

TypeError: strptime() argument 1 must be str, not Timestamp

How could I adjust the function in order to properly apply it over a new pandas dataframe column?

AlSub
  • 1,384
  • 1
  • 14
  • 33
  • The error says you already have pd.Timestamp, why would you need strptime at all? To only get the date, use `.floor('d')` – FObersteiner Feb 03 '21 at 16:04
  • With `strptime` I am trying to declare arguments as a `datetime64` object, how could I properly set the arguments as `datetime64` ? – AlSub Feb 03 '21 at 16:08
  • It seems you *already have* that dtype... – FObersteiner Feb 03 '21 at 16:26
  • Yes the columns are `datetime64`, however the function does not recognize them as `datetime64` – AlSub Feb 03 '21 at 16:30
  • 1
    What the error tells you is that `start_date = datetime.strptime(start_date, "%d/%m/%Y")` makes no sense since `start_date` already is datetime64 / pd.Timestamp – FObersteiner Feb 03 '21 at 17:17

2 Answers2

1

You see, pandas uses numpy to parse dates, and numpy.datetime64 is not directly compatible with datetime.datetime, which you are trying to use.

There's a couple of different solutions, but if you want to use datetime, which is more readable in my opinion, you may do something like this. First we define a function to convert between both data types (got it from here):

def numpy2datetime(date):
    return (datetime.
            datetime.
            utcfromtimestamp(
                            (date - np.datetime64('1970-01-01T00:00:00')) / 
                            np.timedelta64(1, 's'))
                            )

Then you may be able to do what you want by changing your function from :

start_date = datetime.strptime(start_date, "%d/%m/%Y")
end_date = datetime.strptime(end_date, "%d/%m/%Y")

to

start_date = numpy2datetime(start_date)
end_date = numpy2datetime(end_date)

This should work. However, I may have some additional suggestions for you. First, you can change all your if and elif to a single one by using the or logical operator:

if pd.isnull(start_date) or pd.isnull(end_date):
    return None
else:
    start_date = numpy2datetime(start_date)
    end_date = numpy2datetime(end_date)
    return abs((end_date.year - start_date.year) * 12 + (end_date.month - start_date.month))

And a last one is regarding your list comprehension. You don't need zip at all, since both columns are within the same dataframe. You can simply do:

df['new_col'] = [set_dif_months_na(date1, date2)
                                          for date1,date2 in 
                                          df[['date1','date2']].values]

Don't know if it's faster, but at least it's clearer.

Hope it's useful. And let us know if you have any further issues.

Ralubrusto
  • 1,394
  • 2
  • 11
  • 24
  • 1
    pandas handles datetime.datetime objects pretty well but that is not the point here I think. The OP should work with pandas datetime only, and in fact the error tells me he already does – FObersteiner Feb 03 '21 at 17:20
  • I've just tryed to follow OP's line of thought, converting dates to `datetime` and then getting the difference in months. Feel free to suggest a solution using pure pandas, which will indeed be more efficient – Ralubrusto Feb 03 '21 at 17:25
1

By changing start_date and end_date setting from strptime to pd.to_datetime the function worked without any error:


def set_dif_months_na(start_date, end_date):
    if (pd.isnull(start_date) and pd.notnull(end_date)):
        return None
    elif (pd.notnull(start_date) and pd.isnull(end_date)):
        return None
    elif (pd.isnull(start_date) and pd.isnull(end_date)):
        return None
    else:
        start_date = pd.to_datetime(start_date,  format="%d/%m/%Y")
        end_date = pd.to_datetime(end_date, format="%d/%m/%Y")
    return abs((end_date.year - start_date.year) * 12 + (end_date.month - start_date.month))
AlSub
  • 1,384
  • 1
  • 14
  • 33