multi-column factorize in pandas

Question

The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.

I'd like to accomplish the equivalent of pandas.factorize on multiple columns:

import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]

That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to.

Factorize only works on single columns. Is there a multi-column equivalent function in pandas?

the list in the comment -- a unique, sequential index for each distinct (x, y) value — ChrisB, May 09 '13 at 02:49

score 14 · Accepted Answer · answered May 09 '13 at 08:30

14

You need to create a ndarray of tuple first, pandas.lib.fast_zip can do this very fast in cython loop.

import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]

the output is:

[0 1 2 2 1 0]

answered May 09 '13 at 08:30

HYRY

94,853
25
187
187

Thanks -- that gives the answer I'm looking for, in a reasonably compact form – ChrisB May 09 '13 at 11:56
2

I get the following error: {AttributeError}module 'pandas' has no attribute 'lib' – Rylan Schaeffer Aug 26 '20 at 03:09
3

The function can be found under ``pd._libs.lib.fast_zip``. Not sure when it changed. – tobiasraabe Nov 23 '20 at 16:01
you need to use `df.x.values` and `df.y.values` (https://stackoverflow.com/a/23434439/3965888) – Ismael EL ATIFI Mar 25 '22 at 13:58

score 1 · Answer 2 · answered May 09 '13 at 04:40

I am not sure if this is an efficient solution. There might be better solutions for this.

arr=[] #this will hold the unique items of the dataframe
for i in df.index:
   if list(df.iloc[i]) not in arr:
      arr.append(list(df.iloc[i]))

so printing the arr would give you

>>>print arr
[[1,1],[1,2],[2,2]]

to hold the indices, i would declare an ind array

ind=[]
for i in df.index:
   ind.append(arr.index(list(df.iloc[i])))

printing ind would give

 >>>print ind
 [0,1,2,2,1,0]

waitingkuo · Answer 3 · 2013-05-09T03:27:37.620

0

You can use drop_duplicates to drop those duplicated rows

In [23]: df.drop_duplicates()
Out[23]: 
      x  y
   0  1  1
   1  1  2
   2  2  2

EDIT

To achieve your goal, you can join your original df to the drop_duplicated one:

In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]: 
   x  y  index
0  1  1      0
1  1  2      1
2  2  2      2
3  2  2      2
4  1  2      1
5  1  1      0

edited May 09 '13 at 03:27

answered May 09 '13 at 02:58

waitingkuo

89,478
28
112
118

I'm not looking to drop them, but to assign a unique index to each pair of distinct values (i.e. I eventually want to add a new column to the data frame, with values [0, 1, 2, 2, 1, 0]). – ChrisB May 09 '13 at 03:10

score 0 · Answer 4 · answered Sep 13 '17 at 19:58

0

df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]

answered Sep 13 '17 at 19:58

David Hagar

1

Please explain what your code does differently from OP's and how that solves the problem. I recommend this guide on creating a useful answer https://stackoverflow.com/help/how-to-answer – Will Barnwell Sep 13 '17 at 20:32

multi-column factorize in pandas

4 Answers4

EDIT

Linked