Create a set from a series in pandas

Question

I have a dataframe extracted from Kaggle's San Fransico Salaries: https://www.kaggle.com/kaggle/sf-salaries and I wish to create a set of the values of a column, for instance 'Status'.

This is what I have tried but it brings a list of all the records instead of the set (sf is how I name the data frame).

a=set(sf['Status'])
print a

According to this webpage, this should work. How to construct a set out of list items in python?

perhaps I used the term incorrectly, i refer that it gives me all the values from the column without caring if it is pure nans for instance — Julio Arriaga, Sep 17 '16 at 21:10

score 111 · Accepted Answer · answered Sep 17 '16 at 21:33

If you only need to get list of unique values, you can just use unique method. If you want to have Python's set, then do set(some_series)

In [1]: s = pd.Series([1, 2, 3, 1, 1, 4])

In [2]: s.unique()
Out[2]: array([1, 2, 3, 4])

In [3]: set(s)
Out[3]: {1, 2, 3, 4}

However, if you have DataFrame, just select series out of it ( some_data_frame['<col_name>'] ).

score 41 · Answer 2 · edited Jun 20 '20 at 09:12

41

With large size series with duplicates the set(some_series) execution-time will evolve exponentially with series size.

Better practice would be to set(some_series.unique()).

A simple exemple showing x16 execution time.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 07 '18 at 23:10

Adrien Pacifico

1,649
1
15
33

1

Could anyone explain why the execution time evolves exponentially without using unique? – justinpc Apr 05 '21 at 08:34
4

some_series.unique() gives every unique item in the series = basically a set. Creating a set from a set is fast because you have no duplicates --> less items to work on --> less work to do --> fast. Pandas is mostly C under the hood, maybe set() is not that optimized compared to .unique()? Question is: Does this speedup still holds up if your series is already unique --> you don't throw anything out when using .unique()? – Vega May 21 '21 at 09:15

Create a set from a series in pandas

2 Answers2