30

I have a datetime series, and need to change the day to 1 for each entry. I have thought of numerous simple solutions, but none of them works for me. For now, the only thing that actually works is

  • set the series as the index
  • Query month and year from the index
  • Reconstruct a new time series using year, month and 1

It can't really be that complicated, can it? There is month start, but is unfortunately an offset, that's of no use here. There seems to be no set() function for the method, and even less functionality while the series is a column, and not (part of) the index itself.

The only related question was this, but the trick used there is not applicable here.

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
FooBar
  • 15,724
  • 19
  • 82
  • 171

3 Answers3

47

You can use .apply and datetime.replace, eg:

import pandas as pd
from datetime import datetime

ps = pd.Series([datetime(2014, 1, 7), datetime(2014, 3, 13), datetime(2014, 6, 12)])
new = ps.apply(lambda dt: dt.replace(day=1))

Gives:

0   2014-01-01
1   2014-03-01
2   2014-06-01
dtype: datetime64[ns]
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
37

The other answer works, but any time you use apply, you slow your code down a lot. I was able to get an 8.5x speedup by writing a quick vectorized Datetime replace for a series.

def vec_dt_replace(series, year=None, month=None, day=None):
    return pd.to_datetime(
        {'year': series.dt.year if year is None else year,
         'month': series.dt.month if month is None else month,
         'day': series.dt.day if day is None else day})

Apply:

%timeit dtseries.apply(lambda dt: dt.replace(day=1))
# 4.17 s ± 38.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Vectorized:

%timeit vec_dt_replace(dtseries, day=1)
# 491 ms ± 6.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that you could face errors by trying to change dates to ones that don't exist, like trying to change 2012-02-29 to 2013-02-29. Use the errors argument of pd.to_datetime to ignore or coerce them.

Data generation: Generate series with 1 million random dates:

import pandas as pd
import numpy as np

# Generate random dates. Modified from: https://stackoverflow.com/a/50668285
def pp(start, end, n):
    start_u = start.value // 10 ** 9
    end_u = end.value // 10 ** 9

    return pd.Series(
        (10 ** 9 * np.random.randint(start_u, end_u, n)).view('M8[ns]'))

start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
dtseries = pp(start, end, 1000000)
# Remove time component
dtseries = dtseries.dt.normalize()
Kyle Barron
  • 2,452
  • 22
  • 17
13

The other two answers work, but neither is very elegant nor in the spirit of the pandas library. Instead, consider this, which is also ever so slightly faster in my tests then Kyle Barron's vectorized answer. It's a one liner that does not require defining any outside functions, is vectorized, and stays within the pandas ecosystem:

import pandas as pd
dtseries.dt.to_period('M').dt.to_timestamp()

This method has the added benefit of supporting many other frequencies to floor to, such as weekly ('W') or business days ('B') that would be trickier to implement with the vectorized approach above.

You can find the abbreviations for various other frequencies in the relevant doc page.

This of course assumes that dtseries is a datetime series, if not you can easily convert it with pd.to_datetime(my_series).

This solution also allows for great flexibility in using various offsets. For example, to use the tenth day of the month:

from pandas.tseries.offsets import DateOffset
dtseries.dt.to_period('M').dt.to_timestamp() + DateOffset(days=10)

I recommend you check the doc for pandas offsets. The offsets pandas provides support a lot of rather complex offsets, such as business days, holidays, business hours, etc... Those would be extremely cumbersome to implement by hand as proposed by the answers of @KyleBarron and @JonClements. Consider this example for instance, to get dates offset 5 business days from the start of the month:

from pandas.tseries.offsets import BusinessDay
dtseries.dt.to_period('M').dt.to_timestamp() + BusinessDay(n=5)
Charles
  • 3,116
  • 2
  • 11
  • 20
  • Note that this applies only to the special case where you want to floor to a predefined frequency. If you want to set all dates to, say, day 10, then I don't see a way to use this. – Kyle Barron Apr 07 '21 at 20:47
  • @KyleBarron No, that would still work. You would only have to add a [date offset](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.DateOffset.html). See my edit :) – Charles Apr 10 '21 at 00:13
  • @KyleBarron In fact, I would argue that my solution is on the contrary much more flexible in dealing with complex offsets. How would yours deal with the business day case for example? – Charles Apr 10 '21 at 00:25
  • Those are good points, and the edit explains it well. From a simple test your answer looks to be 5% faster than mine. I don't work with dates in pandas anymore, but I think yours is most generalized. – Kyle Barron Apr 12 '21 at 18:27
  • I only had to call the offset function like this and it worked perfectly as I previously had imported pandas as pd: `pd.DateOffset(days=10)` – Elias Jun 26 '22 at 12:06