Most efficient way of retrieving unique elements from a Pandas column?

Question

I have a pandas column of datetime objects, and from this column I want to extract a list of unique years from it. What would be the most efficient way of doing this?

edit: Wasn't sure how the order of performing the operations was treated by pandas

Possible duplicate of [Find the unique values in a column and then sort them](https://stackoverflow.com/questions/32072076/find-the-unique-values-in-a-column-and-then-sort-them) -- provided you know how to extract the year from a datetime which you can get from [this question](https://stackoverflow.com/questions/25146121/extracting-just-month-and-year-from-pandas-datetime-column-python). — pault, Jul 10 '18 at 14:58
@pault I have to disagree. First, getting the years from a datetime column adds a syntactic complexity over the question you link. Second, the question is how to do so *efficiently*. — Engineero, Jul 10 '18 at 15:09

score 6 · Accepted Answer · answered Jul 10 '18 at 15:06

You can use pd.Series.dt.year followed by pd.Series.unique.

Timings on Python 3.6 / Pandas 0.19 below using data from @Engineero.

%timeit df['time'].dt.year.unique().tolist()                  # 739 µs per loop
%timeit df['time'].apply(lambda x: x.year).unique().tolist()  # 5.9 ms per loop
%timeit list(set(df['time'].dt.year.values))                  # 823 µs per loop

BENY · Answer 2 · 2018-07-10T15:11:51.107

4

If you consider the speed , data from Engineero

list(set(df['time'].dt.year.values))

edited Jul 10 '18 at 15:11

answered Jul 10 '18 at 15:03

BENY

317,841
20
164
234

Engineero · Answer 3 · 2018-07-10T15:22:07.370

I made an example that gives us 1000 rows covering 165 years:

base = datetime.today()
date_list = [base + x*timedelta(days=60) for x in range(0, 1000)]
df = pd.DataFrame(data={'times': date_list})

Trying a bunch of different options and listing them in order from fastest to slowest(ish):

Using dt.year.unique() directly on the times column:

df.times.dt.year.unique()
398 µs ± 5.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Using set with dt.year.values we get about the same performance as dt.year.unique(), but with greater variability:
```
set(df.times.dt.year.values)
422 µs ± 34.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Using apply:

df.times.apply(lambda x: x.year).unique()
5.51 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using set and a list comprehension:

years = set([time.year for time in df.times])
3.48 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Interestingly, using the almost syntactically identical set and a generator, we get something closer to df.apply:

years = set((time.year for time in df.times))
5.85 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And using np.unique with a list comprehension:

np.unique([time.year for time in df.times])
6.09 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It looks like, at least on my machine, df.times.dt.year.unique() is the fastest so far. If I think of others I will add them to the list.

Most efficient way of retrieving unique elements from a Pandas column?

3 Answers3