assign hash to row of categorical data in pandas

Question

So I have many pandas data frames with 3 columns of categorical variables:

             D              F     False
             T              F     False
             D              F     False
             T              F     False

The first and second columns can take one of three values. The third one is binary. So there are a grand total of 18 possible rows (not all combination may be represented on each data frame).

I would like to assign a number 1-18 to each row, so that rows with the same combination of factors are assigned the same number and vise-versa (no hash collision).

What is the most efficient way to do this in pandas?

So, all_combination_df is a df with all possible combination of the factors. I am trying to turn df such as big_df to a Series with unique numbers in it

import pandas, itertools

def expand_grid(data_dict):
    """Create a dataframe from every combination of given values."""
    rows = itertools.product(*data_dict.values())
    return pandas.DataFrame.from_records(rows, columns=data_dict.keys())

all_combination_df = expand_grid(
                           {'variable_1': ['D', 'A', 'T'],
                           'variable_2': ['C', 'A', 'B'],
                           'variable_3'     : [True, False]})

big_df = pandas.concat([all_combination_df, all_combination_df, all_combination_df])

`pandas.DataFrame.to_dict()` should do this if you first find unique rows (remove duplicates). — p-robot, Nov 05 '16 at 12:37
@P-robot: this is a big aspect of the problem. Finding unique rows and to dict() would take too much time (the data is huge and it is coming in chunks). I know there are 18 possible combinations. Is there a way to use this information? — user189035, Nov 05 '16 at 12:39
OK. I see. Perhaps you could create a dictionary of the 18 possible combinations and then map those combinations to the data? — p-robot, Nov 05 '16 at 12:46
@P-robot: interresting, how do you map a dictionary to a data frame? — user189035, Nov 05 '16 at 12:57
I edited my answer below to illustrate how you could use `map`. — p-robot, Nov 05 '16 at 13:01

score 4 · Accepted Answer · edited May 23 '17 at 11:48

4

UPDATE: as @user189035 mentioned in the comment it's much better to use categorical dtype as it'll save a lot of memory

I would try to use factorize method:

In [112]: df['category'] = \
     ...:     pd.Categorical(
     ...:         pd.factorize((df.a + '~' + df.b + '~' + (df.c*1).astype(str)))[0])
     ...:

In [113]: df
Out[113]:
   a  b      c category
0  A  X   True        0
1  B  Y  False        1
2  A  X   True        0
3  C  Z  False        2
4  A  Z   True        3
5  C  Z   True        4
6  B  Y  False        1
7  C  Z  False        2

In [114]: df.dtypes
Out[114]:
a             object
b             object
c               bool
category    category
dtype: object

Explanation: this simple way we can glue all columns into a single series:

In [115]: df.a + '~' + df.b + '~' + (df.c*1).astype(str)
Out[115]:
0    A~X~1
1    B~Y~0
2    A~X~1
3    C~Z~0
4    A~Z~1
5    C~Z~1
6    B~Y~0
7    C~Z~0
dtype: object

edited May 23 '17 at 11:48

Community

1
1

answered Nov 05 '16 at 13:09

MaxU - stand with Ukraine

205,989
36
386
419

That's the ticket! – p-robot Nov 05 '16 at 13:10
1

@P-robot, thank you! Now i know what [this idiom](http://idioms.thefreedictionary.com/That's+the+ticket!) means ;) – MaxU - stand with Ukraine Nov 05 '16 at 13:12
Isn't there a risk of hash collision if columns a and b have one modality with the same name in comon? – user189035 Nov 05 '16 at 13:18
1

A small comment: if (like I did) need the new colunm to be the same length as the old one, you should use: `(df.a+~+df.b+~df.c.astype(str)).astype('category')` – user189035 Nov 07 '16 at 08:47

score 2 · Answer 2 · edited May 23 '17 at 12:19

Without taking into account issues of efficiency, this would find duplicate rows and give you a dictionary (similar to the question here).

import pandas as pd, numpy as np

# Define data
d = np.array([["D", "T", "D", "T", "U"], 
["F", "F", "F", "J", "K"], 
[False, False, False, False, True]])
df = pd.DataFrame(d.T)

# Find and remove duplicate rows
df_nodupe = df[~df.duplicated()]

# Make a list
df_nodupe.T.to_dict('list')

{0: ['D', 'F', 'False'],
 1: ['T', 'F', 'False'],
 3: ['T', 'J', 'False'],
 4: ['U', 'K', 'True']}

Otherwise, you could use map, like so:

import pandas as pd, numpy as np

# Define data
d = np.array([["D", "T", "D", "T", "U"], 
["F", "F", "F", "J", "K"], 
[False, False, False, False, True]])
df = pd.DataFrame(d.T)
df.columns = ['x', 'y', 'z']

# Define your dictionary of interest
dd = {('D', 'F', 'False'): 0,
 ('T', 'F', 'False'): 1,
 ('T', 'J', 'False'): 2,
 ('U', 'K', 'True'): 3}

# Create a tuple of the rows of interest
df['tupe'] = zip(df.x, df.y, df.z)

# Create a new column based on the row values
df['new_category'] = df.tupe.map(dd)

assign hash to row of categorical data in pandas

2 Answers2