Efficient way to get the unique values from 2 or more columns in a Dataframe

Question

Given a matrix from an SFrame:

>>> from sframe import SFrame
>>> sf =SFrame({'x':[1,1,2,5,7], 'y':[2,4,6,8,2], 'z':[2,5,8,6,2]})
>>> sf
Columns:
    x   int
    y   int
    z   int

Rows: 5

Data:
+---+---+---+
| x | y | z |
+---+---+---+
| 1 | 2 | 2 |
| 1 | 4 | 5 |
| 2 | 6 | 8 |
| 5 | 8 | 6 |
| 7 | 2 | 2 |
+---+---+---+
[5 rows x 3 columns]

I want to get the unique values for the x and y columns and I can do it as such:

>>> sf['x'].unique().append(sf['y'].unique()).unique()
dtype: int
Rows: 7
[2, 8, 5, 4, 1, 7, 6]

This way I get the unique values of x and unique values of y then append them and get the unique values of the appended list.

I could also do it as such:

>>> sf['x'].append(sf['y']).unique()
dtype: int
Rows: 7
[2, 8, 5, 4, 1, 7, 6]

But that way, if my x and y columns are huge with lots of duplicates, I would be appending it into a very huge container before getting the unique.

Is there a more efficient way to get the unique values of a combined columns created from 2 or more columns in an SFrame?

What is the equivalence in pandas of the efficent way to get unique values from 2 or more columns in pandas?

Does the order of elems in output matter? Would it be okay to have the output as a list or an array? — Divakar, Aug 19 '16 at 07:54
[SFrame's API](https://turi.com/products/create/docs/generated/graphlab.SFrame.html) is pretty poor compared to Pandas API, so i don't think you can do it in a much more efficient way compared to your solutions. Maybe it's time to consider using [Apache Spark](http://spark.apache.org/)? — MaxU - stand with Ukraine, Aug 19 '16 at 09:39

score 2 · Answer 1 · answered Aug 03 '16 at 04:46

2

I dont have SFrame but tested on pd.DataFrame:

  sf[["x", "y"]].stack().value_counts().index.tolist()
  [2, 1, 8, 7, 6, 5, 4]

answered Aug 03 '16 at 04:46

Merlin

24,552
41
131
206

Quick question, in this case, the one column (with duplicate) is appended to another (with duplicate) before doing `.value_counts()` right? – alvas Aug 03 '16 at 05:33
Yes, the code stacks the data, in order to access the underlying values. – Merlin Aug 03 '16 at 06:37
It works, I've upvoted. But i think an sframe answer would be more appropriate. Hope you don't mind =) – alvas Aug 15 '16 at 14:41
I believe `.unique()` is more efficient then `value_counts()`. – ptrj Aug 19 '16 at 17:15

score 2 · Answer 2 · answered Aug 19 '16 at 17:12

2

The easiest way I can think of is to convert to a numpy array then find unique values

np.unique(sf[['x', 'y']].to_numpy())

array([1, 2, 4, 5, 6, 7, 8])

If you needed it in an sframe

SFrame({'xy_unique': np.unique(sf[['x', 'y']].to_numpy())})

answered Aug 19 '16 at 17:12

piRSquared

285,575
57
475
624

score 2 · Answer 3 · answered Aug 19 '16 at 17:12

SFrame

I haven't used SFrame and don't know on which conditions it copies data. (Does selection sf['x'] or append copy data to memory?). There are pack_columns and stack methods in SFrame and if they don't copy data, then this should work:

sf[['x', 'y']].pack_columns(new_column_name='N').stack('N').unique()

pandas

If your data fit into memory then you can probably do it in pandas efficiently without extra copy.

# copies the data to memory
df = sf[['x', 'y']].to_dataframe()

# a reference to the underlying numpy array (no copy)
vals = df.values

# 1d array: 
# (numpy.ravel doesn't copy if it doesn't have to - it depends on the data layout)
if np.isfortran(vals):
    vals_1d = vals.ravel(order='F')
else:
    vals_1d = vals.ravel(order='C')

uniques = pd.unique(vals_1d)

pandas's unique is more efficient than numpy's np.unique because it doesn't sort.

score 2 · Answer 4 · edited May 23 '17 at 12:17

2

Take a look at this answer to a similar question. Note that Pandas' pd.unique function is considerably faster than Numpy's.

>>> pd.unique(sf[['x','y']].values.ravel())
array([2, 8, 5, 4, 1, 7, 6], dtype=object)

edited May 23 '17 at 12:17

Community

1
1

answered Aug 22 '16 at 05:27

PQL

132
8

score 1 · Answer 5 · answered Aug 03 '16 at 05:27

Although I don't know how to do it in SFrame, here's a longer explanation of @Merlin's answer:

>>> import pandas as pd
>>> df = pd.DataFrame({'x':[1,1,2,5,7], 'y':[2,4,6,8,2], 'z':[2,5,8,6,2]})
>>> df[['x', 'y']]
   x  y
0  1  2
1  1  4
2  2  6
3  5  8
4  7  2

To extract only columns X and Y

>>> df[['x', 'y']] # Extract only columns x and y
   x  y
0  1  2
1  1  4
2  2  6
3  5  8
4  7  2

To stack the 2 columns per row into 1 column row, while still being able to access them as a dictionary:

>>> df[['x', 'y']].stack()                       
0  x    1
   y    2
1  x    1
   y    4
2  x    2
   y    6
3  x    5
   y    8
4  x    7
   y    2
dtype: int64
>>> df[['x', 'y']].stack()[0]      
x    1
y    2
dtype: int64
>>> df[['x', 'y']].stack()[0]['x']
1
>>> df[['x', 'y']].stack()[0]['y']
2

Count the individual values of all elements within the combined columns:

>>> df[['x', 'y']].stack().value_counts() # index(i.e. keys)=elements, Value=counts
2    3
1    2
8    1
7    1
6    1
5    1
4    1

To access the index and counts:

>>> df[['x', 'y']].stack().value_counts().index      
Int64Index([2, 1, 8, 7, 6, 5, 4], dtype='int64')
>>> df[['x', 'y']].stack().value_counts().values  
array([3, 2, 1, 1, 1, 1, 1])

Convert to a list:

>>> sf[["x", "y"]].stack().value_counts().index.tolist()
[2, 1, 8, 7, 6, 5, 4]

Still an SFrame answer would be great too. The same syntax doesn't work for SFrame.

IIUC, `stack()` will very likely copy the underlying data - probably something you'd like to avoid. — ptrj, Aug 19 '16 at 17:24

score 1 · Answer 6 · answered Aug 25 '16 at 20:18

Here's a little benchmark between three possible methods:

from sframe import SFrame
import numpy as np
import pandas as pd
import timeit

sf = SFrame({'x': [1, 1, 2, 5, 7], 'y': [2, 4, 6, 8, 2], 'z': [2, 5, 8, 6, 2]})


def f1(sf):
    return sf['x'].unique().append(sf['y'].unique()).unique()


def f2(sf):
    return sf['x'].append(sf['y']).unique()


def f3(sf):
    return np.unique(sf[['x', 'y']].to_numpy())

N = 1000

print timeit.timeit('f1(sf)', setup='from __main__ import f1, sf', number=N)
print timeit.timeit('f2(sf)', setup='from __main__ import f2, sf', number=N)
print timeit.timeit('f3(sf)', setup='from __main__ import f3, sf', number=N)

# 13.3195129933
# 4.66225642657
# 3.65669089489
# [Finished in 23.6s]

Benchmark using python2.7.11 x64 on windows7+i7_2.6ghz

Conclusion: I'd suggest you use np.unique, that's basically f3.

Efficient way to get the unique values from 2 or more columns in a Dataframe

6 Answers6