1

New to pandas python.

I have a dataframe (df) with two columns of cusips. I want to turn those columns into a list of the unique entries of the two columns.

My first attempt was to do the following:

cusips = pd.concat(df['long'], df['short']).

This returned the error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

I have read a few postings, but I am still having trouble with why this comes up. What am I missing here?

Also, what's the most efficient way to select the unique entries in a column or a dataframe? Can I call it in one function? Does the function differ if I want to create a list or a new, one-coulmn dataframe?

Thank you.

user1911092
  • 3,941
  • 9
  • 26
  • 30

2 Answers2

1

To obtain the unique values in a column you can use the unique Series method, which will return a numpy array of the unique values (and it is fast!).

df.long.unique()
# returns numpy array of unique values

You could then use numpy.append:

np.append(df.long.unique(), df.short.unique())

Note: This just appends the two unique results together and so itself is not unique!

.

Here's a (trivial) example:

import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2], [1, 4]], columns=['long','short'])

In [4]: df
Out[4]: 
   long  short
0     1      2
1     1      4

In [5]: df.long.unique()
Out[5]: array([1])

In [6]: df.short.unique()
Out[6]: array([2, 4])

And then appending the resulting two arrays:

In [7]: np.append(df.long.unique(), df.short.unique())
Out[7]: array([1, 2, 4])

Using @Zalazny7's set is significantly faster (since it runs over the array only once) and somewhat upsettingly it's even faster than np.unique (which sorts the resulting array!).

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
1

Adding to Hayden's answer, you could also use the set() method for the same result. The performance is slightly better if that's a consideration:

In [28]: %timeit set(np.append(df[0],df[1]))
100000 loops, best of 3: 19.6 us per loop

In [29]: %timeit np.append(df[0].unique(), df[1].unique())
10000 loops, best of 3: 55 us per loop
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • set is definitely the way to go here, I am upset it is faster than `numpy.unique` (which sorts)! – Andy Hayden Jan 02 '13 at 16:23
  • Quick follow-up. Does Set take all or just the unique of df[0] and df[1]. I am assuming the unique only. Any thoughts on this: This returned the error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). What should I make of that? – user1911092 Jan 02 '13 at 16:34
  • Try enclosing your Series in []: cusips = pd.concat([df['long'], df['short']]) – Zelazny7 Jan 02 '13 at 17:08