1

I have a pandas column of datetime objects, and from this column I want to extract a list of unique years from it. What would be the most efficient way of doing this?

edit: Wasn't sure how the order of performing the operations was treated by pandas

big11mac
  • 189
  • 3
  • 11
  • 1
    Possible duplicate of [Find the unique values in a column and then sort them](https://stackoverflow.com/questions/32072076/find-the-unique-values-in-a-column-and-then-sort-them) -- provided you know how to extract the year from a datetime which you can get from [this question](https://stackoverflow.com/questions/25146121/extracting-just-month-and-year-from-pandas-datetime-column-python). – pault Jul 10 '18 at 14:58
  • 2
    @pault I have to disagree. First, getting the years from a datetime column adds a syntactic complexity over the question you link. Second, the question is how to do so *efficiently*. – Engineero Jul 10 '18 at 15:09

3 Answers3

6

You can use pd.Series.dt.year followed by pd.Series.unique.

Timings on Python 3.6 / Pandas 0.19 below using data from @Engineero.

%timeit df['time'].dt.year.unique().tolist()                  # 739 µs per loop
%timeit df['time'].apply(lambda x: x.year).unique().tolist()  # 5.9 ms per loop
%timeit list(set(df['time'].dt.year.values))                  # 823 µs per loop
jpp
  • 159,742
  • 34
  • 281
  • 339
4

If you consider the speed , data from Engineero

list(set(df['time'].dt.year.values))
BENY
  • 317,841
  • 20
  • 164
  • 234
3

I made an example that gives us 1000 rows covering 165 years:

base = datetime.today()
date_list = [base + x*timedelta(days=60) for x in range(0, 1000)]
df = pd.DataFrame(data={'times': date_list})

Trying a bunch of different options and listing them in order from fastest to slowest(ish):

  • Using dt.year.unique() directly on the times column:

    df.times.dt.year.unique()
    398 µs ± 5.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
  • Using set with dt.year.values we get about the same performance as dt.year.unique(), but with greater variability:

    set(df.times.dt.year.values)
    422 µs ± 34.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
  • Using apply:

    df.times.apply(lambda x: x.year).unique()
    5.51 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  • Using set and a list comprehension:

    years = set([time.year for time in df.times])
    3.48 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  • Interestingly, using the almost syntactically identical set and a generator, we get something closer to df.apply:

    years = set((time.year for time in df.times))
    5.85 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  • And using np.unique with a list comprehension:

    np.unique([time.year for time in df.times])
    6.09 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

It looks like, at least on my machine, df.times.dt.year.unique() is the fastest so far. If I think of others I will add them to the list.

Engineero
  • 12,340
  • 5
  • 53
  • 75