9

I have a pandas dataframe as below:

enter image description here

How can I combine all the lists (in the 'val' column) into a unique list (set), e.g. [val1, val2, val33, val9, val6, val7]?

I can solve this with the following code. I wonder if there is an easier way to get all unique values from a column without iterating the dataframe rows?

def_contributors=[]
for index, row in df.iterrows():
    contri = ast.literal_eval(row['val'])
    def_contributors.extend(contri)
def_contributors = list(set(def_contributors))
Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
kitchenprinzessin
  • 1,023
  • 3
  • 14
  • 30

4 Answers4

25

Another solution with exporting Series to nested lists and then apply set to flatten list:

df = pd.DataFrame({'id':['a','b', 'c'], 'val':[['val1','val2'],
                                               ['val33','val9','val6'],
                                               ['val2','val6','val7']]})

print (df)
  id                  val
0  a         [val1, val2]
1  b  [val33, val9, val6]
2  c   [val2, val6, val7]

print (type(df.val.ix[0]))
<class 'list'>

print (df.val.tolist())
[['val1', 'val2'], ['val33', 'val9', 'val6'], ['val2', 'val6', 'val7']]

print (list(set([a for b in df.val.tolist() for a in b])))
['val7', 'val1', 'val6', 'val33', 'val2', 'val9']

Timings:

df = pd.concat([df]*1000).reset_index(drop=True)

In [307]: %timeit (df['val'].apply(pd.Series).stack().unique()).tolist()
1 loop, best of 3: 410 ms per loop

In [355]: %timeit (pd.Series(sum(df.val.tolist(),[])).unique().tolist())
10 loops, best of 3: 31.9 ms per loop

In [308]: %timeit np.unique(np.hstack(df.val)).tolist()
100 loops, best of 3: 10.7 ms per loop

In [309]: %timeit (list(set([a for b in df.val.tolist() for a in b])))
1000 loops, best of 3: 558 µs per loop

If types is not list but string use str.strip and str.split:

df = pd.DataFrame({'id':['a','b', 'c'], 'val':["[val1,val2]",
                                               "[val33,val9,val6]",
                                               "[val2,val6,val7]"]})

print (df)
  id                val
0  a        [val1,val2]
1  b  [val33,val9,val6]
2  c   [val2,val6,val7]

print (type(df.val.ix[0]))
<class 'str'>

print (df.val.str.strip('[]').str.split(','))
0           [val1, val2]
1    [val33, val9, val6]
2     [val2, val6, val7]
Name: val, dtype: object

print (list(set([a for b in df.val.str.strip('[]').str.split(',') for a in b])))
['val7', 'val1', 'val6', 'val33', 'val2', 'val9']
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • i have added this when importing the csv file, so that the val column will be recognized as list object type : converters={"val": literal_eval} – kitchenprinzessin Aug 12 '16 at 03:15
  • What's the most concise way to get around the issue of one of the rows being NaN? Having at least one NaN row gets TypeError: 'float' object is not iterable – gregorio099 Mar 28 '23 at 18:56
  • 1
    `print (list(set([a for b in df.val.dropna().str.strip('[]').str.split(',') for a in b])))` does the trick! – gregorio099 Mar 28 '23 at 19:34
7

Convert that column into a DataFrame with .apply(pd.Series). If you stack the columns, you can call the unique method on the returned Series.

df
Out[123]: 
            val
0      [v1, v2]
1      [v3, v2]
2  [v4, v3, v2]

df['val'].apply(pd.Series).stack().unique()
Out[124]: array(['v1', 'v2', 'v3', 'v4'], dtype=object)
ayhan
  • 70,170
  • 20
  • 182
  • 203
2

One way would be to extract those elements into an array using np.hstack and then using np.unique to give us an array of such unique elements, like so -

np.unique(np.hstack(df.val))

If you want a list as output, append with .tolist() -

np.unique(np.hstack(df.val)).tolist()
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Very interesting, I think your solution will be faster as list comprehension with `set`, but not. – jezrael Aug 11 '16 at 13:11
  • @jezrael Yeah that `hstack` isn't helping much I guess. Ah nevermind I did at my end and is even slower! – Divakar Aug 11 '16 at 13:12
  • It is more slowier `In [310]: %timeit np.unique(np.concatenate(df.val)) 10 loops, best of 3: 39.6 ms per loop` – jezrael Aug 11 '16 at 13:14
  • @jezrael Yeah, even slower with `np.concatenate`. Guess not much NumPy can do here :) – Divakar Aug 11 '16 at 13:15
1

You can use str.concat followed by some string manipulations to obtain the desired list.

In [60]: import re
    ...: from collections import OrderedDict

In [62]: s = df['val'].str.cat()

In [63]: L = re.sub('[[]|[]]',' ', s).strip().replace("  ",',').split(',')

In [64]: list(OrderedDict.fromkeys(L))
Out[64]: ['val1', 'val2', 'val33', 'val9', 'val6', 'val7']
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85