1

I have a data frame dft:

Date              Total Value
02/01/2022          2
03/01/2022          6 
N/A                 4
03/11/2022          4
03/15/2022          4
05/01/2022          4

For each date in the data frame, I want to calculate the how many days from today and I want to add these calculated values in a new column called Days.

I have tried the following code:

newdft = []
for item in dft:
  temp = item.copy()
  timediff = datetime.now() - datetime.strptime(temp["Date"], "%m/%d/%Y")
  temp["Days"] = timediff.days
  newdft.append(temp)   

But the third date value is N/A, which caused an error. What should I add to my code so that I only conduct the calculation only when the date value is valid?

Emi OB
  • 2,814
  • 3
  • 13
  • 29
Mary
  • 231
  • 1
  • 3
  • 12
  • If you convert your Date column to a date time object, you can get do `datetime.now() - dft['Date']` and get the column in one go – Emi OB Jun 29 '22 at 14:21
  • Oh actually, reading a little closer, I'm not sure what your code is doing. Normally, `for item in df` iterates over the column labels, not the row values. So just in case the linked question doesn't work for you, you need to provide a [mre] including complete code, expected output, and the [full error message with traceback](https://meta.stackoverflow.com/q/359146/4518341). (See also [How to make good reproducible pandas examples](/q/20109391/4518341).) – wjandrea Jun 29 '22 at 14:31

3 Answers3

4

I would convert the whole Date column to be a date time object, using pd.to_datetime(), with the errors set to coerce, to replace the 'N/A' string to NaT (Not a Timestamp) with the below:

dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')

So the column will now look like this:

0   2022-02-01
1   2022-03-01
2          NaT
3   2022-03-11
4   2022-03-15
5   2022-05-01
Name: Date, dtype: datetime64[ns]

You can then subtract that column from the current date in one go, which will automatically ignore the NaT value, and assign this as a new column:

dft['Days'] = datetime.now() - dft['Date']

This will make dft look like below:

        Date  Total Value                     Days
0 2022-02-01            2 148 days 15:49:03.406935
1 2022-03-01            6 120 days 15:49:03.406935
2        NaT            4                      NaT
3 2022-03-11            4 110 days 15:49:03.406935
4 2022-03-15            4 106 days 15:49:03.406935
5 2022-05-01            4  59 days 15:49:03.406935

If you just want the number instead of 59 days 15:49:03.406935, you can do the below instead:

df['Days'] = (datetime.now() - df['Date']).dt.days

Which will give you:

        Date  Total Value   Days
0 2022-02-01            2  148.0
1 2022-03-01            6  120.0
2        NaT            4    NaN
3 2022-03-11            4  110.0
4 2022-03-15            4  106.0
5 2022-05-01            4   59.0
Emi OB
  • 2,814
  • 3
  • 13
  • 29
  • 1
    Another option to get just the days is `df['Days'] = date.today() - df['Date'].dt.date`, which is dtype `timedelta64[ns]`. – wjandrea Jun 29 '22 at 14:54
  • Thank you so much, but `dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')` will give me the error: "list indices must be integers or slices, not str" How do I fix this? – Mary Jun 29 '22 at 19:39
  • @Mary hmmmm, that's odd. Would you be able to provide me with the outputs of `dft.dtypes` and `dft['Dates'].to_list()`? – Emi OB Jun 30 '22 at 06:38
  • @EmiOB The output of `dft.dtypes` is 'list' object has no attribute 'dtypes'. The output of `dft['Dates'].to_list()` is list indices must be integers or slices, not str. I think I know why. My dft is a list of dictionaries, not a data frame – Mary Jun 30 '22 at 12:32
  • ah ok, my answer assumes it's a dafaframe. You can easily turn it into one with pd.DataFrame(dft) and then it'll work – Emi OB Jun 30 '22 at 12:45
1

Following up on the excellent answer by Emi OB I would suggest using DataFrame.mask() to update the dataframe without type coercion.

import datetime
import pandas as pd

dft = pd.DataFrame({'Date': [
  '02/01/2022',
  '03/01/2022',
  None,
  '03/11/2022',
  '03/15/2022',
  '05/01/2022'],
  'Total Value': [2,6,4,4,4,4]})

dft['today'] = datetime.datetime.now()
dft['Days'] = 0
dft['Days'].mask(dft['Date'].notna(),
                 (dft['today'] - pd.to_datetime(dft['Date'])).dt.days,
                 axis=0, inplace=True)
dft.drop(columns=['today'], inplace=True)

This would result in integer values in the Days column:

         Date  Total Value  Days
0  02/01/2022            2   148
1  03/01/2022            6   120
2        None            4  None
3  03/11/2022            4   110
4  03/15/2022            4   106
5  05/01/2022            4    59

Dima Chubarov
  • 16,199
  • 6
  • 40
  • 76
  • I'm not familiar with `DataFrame.mask()` (still pretty new to Pandas), but this doesn't seem to work. I get an error: `ValueError: Columns must be same length as key` at `dft['Days'] = dft.mask(...`. Is this something that might have changed between Pandas versions? – wjandrea Jun 29 '22 at 15:13
  • Is it supposed to be `dft.mask(...)['Days']`? But then you get `0` instead of `None` in the `Days` column. – wjandrea Jun 29 '22 at 15:14
  • @wjandrea Interesting, I could not reproduce your error, but in general your point is correct, that is the number of columns should be the same. I believe the best fix would be to use the `inplace=True` flag to avoid the confusion. – Dima Chubarov Jun 30 '22 at 05:03
1

In contrast to Emi OB's excellent answer, if you did actually need to process individual values, it's usually easier to use apply to create a new Series from an existing one. You'd just need to filter out 'N/A'.

df['Days'] = (
    df['Date']
    [lambda d: d != 'N/A']
    .apply(lambda d: (datetime.now() - datetime.strptime(d, "%m/%d/%Y")).days)
    )

Result:

         Date  Total Value   Days
0  02/01/2022            2  148.0
1  03/01/2022            6  120.0
2         N/A            4    NaN
3  03/11/2022            4  110.0
4  03/15/2022            4  106.0
5  05/01/2022            4   59.0

And for what it's worth, another option is date.today() instead of datetime.now():

    .apply(lambda d: date.today() - datetime.strptime(d, "%m/%d/%Y").date())

And the result is a timedelta instead of float:

         Date  Total Value      Days
0  02/01/2022            2  148 days
1  03/01/2022            6  120 days
2         N/A            4       NaT
3  03/11/2022            4  110 days
4  03/15/2022            4  106 days
5  05/01/2022            4   59 days

See also: How do I select rows from a DataFrame based on column values?

wjandrea
  • 28,235
  • 9
  • 60
  • 81