5

So I have many pandas data frames with 3 columns of categorical variables:

             D              F     False
             T              F     False
             D              F     False
             T              F     False

The first and second columns can take one of three values. The third one is binary. So there are a grand total of 18 possible rows (not all combination may be represented on each data frame).

I would like to assign a number 1-18 to each row, so that rows with the same combination of factors are assigned the same number and vise-versa (no hash collision).

What is the most efficient way to do this in pandas?

So, all_combination_df is a df with all possible combination of the factors. I am trying to turn df such as big_df to a Series with unique numbers in it

import pandas, itertools

def expand_grid(data_dict):
    """Create a dataframe from every combination of given values."""
    rows = itertools.product(*data_dict.values())
    return pandas.DataFrame.from_records(rows, columns=data_dict.keys())

all_combination_df = expand_grid(
                           {'variable_1': ['D', 'A', 'T'],
                           'variable_2': ['C', 'A', 'B'],
                           'variable_3'     : [True, False]})

big_df = pandas.concat([all_combination_df, all_combination_df, all_combination_df])
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
user189035
  • 5,589
  • 13
  • 52
  • 112
  • `pandas.DataFrame.to_dict()` should do this if you first find unique rows (remove duplicates). – p-robot Nov 05 '16 at 12:37
  • @P-robot: this is a big aspect of the problem. Finding unique rows and to dict() would take too much time (the data is huge and it is coming in chunks). I know there are 18 possible combinations. Is there a way to use this information? – user189035 Nov 05 '16 at 12:39
  • 1
    OK. I see. Perhaps you could create a dictionary of the 18 possible combinations and then map those combinations to the data? – p-robot Nov 05 '16 at 12:46
  • @P-robot: interresting, how do you map a dictionary to a data frame? – user189035 Nov 05 '16 at 12:57
  • I edited my answer below to illustrate how you could use `map`. – p-robot Nov 05 '16 at 13:01

2 Answers2

4

UPDATE: as @user189035 mentioned in the comment it's much better to use categorical dtype as it'll save a lot of memory

I would try to use factorize method:

In [112]: df['category'] = \
     ...:     pd.Categorical(
     ...:         pd.factorize((df.a + '~' + df.b + '~' + (df.c*1).astype(str)))[0])
     ...:

In [113]: df
Out[113]:
   a  b      c category
0  A  X   True        0
1  B  Y  False        1
2  A  X   True        0
3  C  Z  False        2
4  A  Z   True        3
5  C  Z   True        4
6  B  Y  False        1
7  C  Z  False        2

In [114]: df.dtypes
Out[114]:
a             object
b             object
c               bool
category    category
dtype: object

Explanation: this simple way we can glue all columns into a single series:

In [115]: df.a + '~' + df.b + '~' + (df.c*1).astype(str)
Out[115]:
0    A~X~1
1    B~Y~0
2    A~X~1
3    C~Z~0
4    A~Z~1
5    C~Z~1
6    B~Y~0
7    C~Z~0
dtype: object
Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
2

Without taking into account issues of efficiency, this would find duplicate rows and give you a dictionary (similar to the question here).

import pandas as pd, numpy as np

# Define data
d = np.array([["D", "T", "D", "T", "U"], 
["F", "F", "F", "J", "K"], 
[False, False, False, False, True]])
df = pd.DataFrame(d.T)

# Find and remove duplicate rows
df_nodupe = df[~df.duplicated()]

# Make a list
df_nodupe.T.to_dict('list')

{0: ['D', 'F', 'False'],
 1: ['T', 'F', 'False'],
 3: ['T', 'J', 'False'],
 4: ['U', 'K', 'True']}

Otherwise, you could use map, like so:

import pandas as pd, numpy as np

# Define data
d = np.array([["D", "T", "D", "T", "U"], 
["F", "F", "F", "J", "K"], 
[False, False, False, False, True]])
df = pd.DataFrame(d.T)
df.columns = ['x', 'y', 'z']

# Define your dictionary of interest
dd = {('D', 'F', 'False'): 0,
 ('T', 'F', 'False'): 1,
 ('T', 'J', 'False'): 2,
 ('U', 'K', 'True'): 3}

# Create a tuple of the rows of interest
df['tupe'] = zip(df.x, df.y, df.z)

# Create a new column based on the row values
df['new_category'] = df.tupe.map(dd)
Community
  • 1
  • 1
p-robot
  • 4,652
  • 2
  • 29
  • 38