0

I have a dataset called weather and it contains one column 'Date' that looks like this.

Date
2020-01-01
2020-01-02
2020-02-01
2020-02-04
2020-03-01
2020-04-01
2020-04-02
2020-04-03
2020-04-04
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-01-01
2020-02-01
2020-04-01
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2020-01-01

The problem is the year is always 2020 when it should be 2020, 2021, and 2022.

The desired column looks like this

Date
2020-01-01
2020-01-02
2020-02-01
2020-02-04
2020-03-01
2020-04-01
2020-04-02
2020-04-03
2020-04-04
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2021-01-01
2021-02-01
2021-04-01
2021-05-01
2021-06-01
2021-07-01
2021-08-01
2021-09-01
2021-10-01
2021-11-01
2021-12-01
2022-01-01

Each year's last month is not necessarily 12, but the new year starts with month 01.

Here is my code:

month = ['01','02','03','04','05','06','07','08','09','10','11','12']
for i in range(len(weather['Date'])):
    year = 2022
    for j in range(len(month)):
        if weather['Date'][i][5:7] == '01':
            weather['Date'][i] = weather['Date'][i].apply(lambda x: 'year' + x[5:])

Is there any suggestion for fixing my code and getting the desired column?

Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
mmmmmm
  • 53
  • 4

2 Answers2

2

Here's one approach:

  • Turn the date strings in column Date into datetime, using pd.to_datetime and apply Series.diff and chain Series.dt.day.
  • Since each negative value (i.e. "day") in our Series will represent the start of a new year, let's apply Series.lt(0) to turn all values below 0 into True and the rest into False.
  • At this stage, we chain Series.cumsum to end up with a Series containing 0, ..., 1, ..., 2. These will be the values that need to be added to the year 2020 to achieve the correct years.
  • Now, finally, we can create the correct dates by passing (new_year = year + addition), month, day again to pd.to_datetime (cf. this SO answer).
df['Date'] = pd.to_datetime(df['Date'])

df['Date'] = pd.to_datetime(dict(year=(df['Date'].dt.year 
                                       + df['Date'].diff().dt.days.lt(0).cumsum()), 
                                 month=df['Date'].dt.month, 
                                 day=df['Date'].dt.day))

df['Date']

0    2020-01-01
1    2020-01-02
2    2020-02-01
3    2020-02-04
4    2020-03-01
5    2020-04-01
6    2020-04-02
7    2020-04-03
8    2020-04-04
9    2020-05-01
10   2020-06-01
11   2020-07-01
12   2020-08-01
13   2020-09-01
14   2020-10-01
15   2020-11-01
16   2021-01-01
17   2021-02-01
18   2021-04-01
19   2021-05-01
20   2021-06-01
21   2021-07-01
22   2021-08-01
23   2021-09-01
24   2021-10-01
25   2021-11-01
26   2021-12-01
27   2022-01-01
Name: Date, dtype: datetime64[ns]

You don't need to convert to datetime, of course. You can also recreate the date strings, leaving off from the following line:

df['Date'].str[5:7].astype(int).diff().lt(0).cumsum()
ouroboros1
  • 9,113
  • 3
  • 7
  • 26
0

Similar to @ouroboros1, but using numpy to get the number of years to add to each date, and then pd.offsets.DateOffset(years=...) for the addition.

import numpy as np
import pandas as pd

df['Date'] = pd.to_datetime(df['Date'])
s = df['Date'].values
y = np.r_[0, (s[:-1] > s[1:]).cumsum()]

At this point, it would be tempting to do:

df['Date'] += y * pd.offsets.DateOffset(years=1)

But then we would get a warning: PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized.

So instead, we group by number of years to add, and add the relevant offset to all the dates in the group.

def add_years(g):
    return g['Date'] + pd.offsets.DateOffset(years=g['y'].iloc[0])

df['Date'] = df.assign(y=y).groupby('y', sort=False, group_keys=False).apply(add_years)

This is reasonably fast (4.25 ms for 1000 rows and 10 distinct y values), and, for other situations than yours, is a bit more general than @ouroboros1's answer:

  1. It handles date changes due to leap year (not present in your example where all dates are on the first of a month, but if one of the dates was '2020-02-29' and we try to add 1 year to it using the construct dt = df['Date'].dt; pd.to_datetime(dict(year=dt.year + y, month=dt.month, ...), then we'd get a ValueError: cannot assemble the datetimes: day is out of range for month).
  2. It preserves any time of day and timezone information (again, not in your case, but in the general case one would retain those).
Pierre D
  • 24,012
  • 7
  • 60
  • 96