3

I have a Pandas Series formed by list of terms:

my_series = pd.Series([['a','b','c'], ['a','d'], [], ['e']])

Is there a better/more elegant/faster way of get a set of unique terms than doing like this?:

lt = set()
for l in my_series.tolist():
    lt = lt.union(l)
jpp
  • 159,742
  • 34
  • 281
  • 339

4 Answers4

5

O(n) extended iterable unpacking with set.union.

>>> set().union(*my_series)
{'a', 'b', 'c', 'd', 'e'}

If you prefer old-fashioned, there's the set-comprehension equivalent -

>>> {y for x in my_series for y in x}
{'a', 'b', 'c', 'd', 'e'}
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    My speed tests seem to show that your first solution is the fastest for large lists, but they are all (including OP's solution) very close. – pault Apr 20 '18 at 15:27
3

sum with [], then using set

set(sum(my_series,[]))#set(my_series.sum())

Out[85]: {'a', 'b', 'c', 'd', 'e'}

Or using reduce

set(functools.reduce(lambda x, y: x+y, my_series.tolist()))
Out[90]: {'a', 'b', 'c', 'd', 'e'}

With pandas unique

pd.DataFrame(my_series.tolist()).stack().unique()
Out[93]: array(['a', 'b', 'c', 'd', 'e'], dtype=object)

With numpy

import numpy as np
np.unique(np.concatenate(my_series))
Out[95]: array(['a', 'b', 'c', 'd', 'e'], dtype='<U32')

Or with loop

set(x for y in my_series for x in y)
BENY
  • 317,841
  • 20
  • 164
  • 234
  • using `sum` or `reduce` to flatten the lists can be slow: [why sum on lists is (sometimes) faster than itertools.chain?](https://stackoverflow.com/questions/41772054/why-sum-on-lists-is-sometimes-faster-than-itertools-chain) – pault Apr 20 '18 at 15:11
  • Not sure if this is better, but another way: `functools.reduce(lambda x, y: x|y, map(set, my_series))` – pault Apr 20 '18 at 15:15
  • 1
    @pault sum and reduce all slow for sure. Performance side , op's solution can beat most of my answers – BENY Apr 20 '18 at 15:18
3

One way is to use itertools.chain with set:

from itertools import chain

s = pd.Series([['a','b','c'], ['a','d'], [], ['e']])

res = set(chain.from_iterable(s))

print(res)

# {'b', 'a', 'c', 'd', 'e'}

Performance benchmarking

Note: performance will be system and data dependent. Do test on your own data.

from itertools import chain

lst = [['a','b','c'], ['a','d'], [], ['e']]

s = pd.Series(lst*1000000)

def cs(my_series):
    return set().union(*my_series)

def cs2(my_series):
    return {y for x in my_series for y in x}

def jp(my_series):
    return set(chain.from_iterable(my_series))

def pt(my_series):
    return {x for x in chain.from_iterable(my_series)}

%timeit cs(s)   # 333 ms per loop
%timeit cs2(s)  # 433 ms per loop
%timeit jp(s)   # 294 ms per loop
%timeit pt(s)   # 402 ms per loop
jpp
  • 159,742
  • 34
  • 281
  • 339
  • my benchmarking has your solution as virtually tied with cᴏʟᴅsᴘᴇᴇᴅ's. But like I said, they're all very close so it's hard to make a definitive call. (I ran the test multiple times and the ordering was consistent) – pault Apr 20 '18 at 18:15
  • Yeh I believe you. Performance will be system specific, OP should test with their data. – jpp Apr 20 '18 at 18:23
2

You can use set comprehension with itertools.chain.from_iterable:

from itertools import chain

my_series = pd.Series([['a','b','c'], ['a','d'], [], ['e']])
print({x for x in chain.from_iterable(my_series)})
#{'a', 'b', 'c', 'd', 'e'}

Timing Results (python 2.7)

import string
import numpy as np
N = 1000
a = list(string.ascii_lowercase)
my_series = pd.Series(
    [
        np.random.choice(a, size=np.random.randint(1,10), replace=False) 
        for _ in range(N)
    ]
)

%%timeit
lt = set()
for l in my_series.tolist():
    lt = lt.union(l)
#1000 loops, best of 3: 1.66 ms per loop (OP)

%%timeit
lt = {x for x in chain.from_iterable(my_series)}
#1000 loops, best of 3: 1.25 ms per loop (pault)

%%timeit
lt = set().union(*my_series)
#1000 loops, best of 3: 1.16 ms per loop (cᴏʟᴅsᴘᴇᴇᴅ)

%%timeit
lt = set(chain.from_iterable(my_series))
#1000 loops, best of 3: 1.17 ms per loop (jpp)
pault
  • 41,343
  • 15
  • 107
  • 149