I made an example that gives us 1000 rows covering 165 years:
base = datetime.today()
date_list = [base + x*timedelta(days=60) for x in range(0, 1000)]
df = pd.DataFrame(data={'times': date_list})
Trying a bunch of different options and listing them in order from fastest to slowest(ish):
Using dt.year.unique()
directly on the times
column:
df.times.dt.year.unique()
398 µs ± 5.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using set
with dt.year.values
we get about the same performance as dt.year.unique()
, but with greater variability:
set(df.times.dt.year.values)
422 µs ± 34.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using apply
:
df.times.apply(lambda x: x.year).unique()
5.51 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using set
and a list comprehension:
years = set([time.year for time in df.times])
3.48 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Interestingly, using the almost syntactically identical set and a generator, we get something closer to df.apply
:
years = set((time.year for time in df.times))
5.85 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And using np.unique
with a list comprehension:
np.unique([time.year for time in df.times])
6.09 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It looks like, at least on my machine, df.times.dt.year.unique()
is the fastest so far. If I think of others I will add them to the list.