How to get columns containing names of pre-defined equivalence classes of values in each row of a Pandas dataframe?

Question

# import package
import pandas as pd

The problem

I have a dataframe:

data = {'row1': ['a', 'A', 'B', 'b'],
        'row2': ['a', 'b', 'c', 'd'],
        'row3': ['a', 'b', 'd', 'D']}
df = pd.DataFrame.from_dict(data, orient='index', columns=['col'+str(x) for x in range(4)])

which looks like:

I also have a list of equivalence classes. Each equivalence class consists of items which are taken as equivalent.

equivalenceClasses={'classA':['a','A'],
                    'classB':['b','B'],
                    'classC':['c','C'],
                    'classD':['d','D']}

I would like to create a dataframe in which the rows in the above dataframe are replaced by the names of the equivalence classes the letters in the row belong to. (Each equivalence class should appear no more than once in a row, and we should use NaNs to post-pad rows in which not all columns are fille by a name of an equivalence class). Ie I want this output:

My method

I achieve the goal by:

def differentClasses(colvalues):
    return list(set([equivalenceClassName for colvalue in colvalues
                                          for equivalenceClassName, equivalenceClass in zip(equivalenceClasses.keys(),
                                                                                   equivalenceClasses.values())
                                          if colvalue in equivalenceClass]))

(On list comprehension, on nested list comprehension.)

df['classes'] = df.apply(lambda row : differentClasses(row['col'+str(x)] for x in range(4)), axis = 1)

(Influenced by this.)

The df at this point looks like this:

Finish by:

result_df = pd.DataFrame(df['classes'].tolist(),index=df.index,columns=['classcol'+str(x) for x in range(4)])

result_df is the desired output above.

The question

Is there a more standard way of doing this? Something like:

df.equivalenceClassify(equivalenceClassList)

and I get my output?

BENY · Accepted Answer · 2020-08-03T21:20:57.817

2

We need create the new dict based on your original equivalenceClasses, then just do replace

from collections import ChainMap
d = dict(ChainMap(*[dict.fromkeys(y,x) for x , y in equivalenceClasses.items()]))
df = df.replace(d)
Out[299]: 
        col0    col1    col2    col3
row1  classA  classA  classB  classB
row2  classA  classB  classC  classD
row3  classA  classB  classD  classD

Then

df = df.mask(df.apply(pd.Series.duplicated,1))
Out[307]: 
        col0    col1    col2    col3
row1  classA     NaN  classB     NaN
row2  classA  classB  classC  classD
row3  classA  classB  classD     NaN

edited Aug 03 '20 at 21:20

answered Aug 03 '20 at 21:18

BENY

317,841
20
164
234

Maybe on a large dataframe, could we do better than .replace()? Reading [this](https://stackoverflow.com/questions/42012339/using-replace-efficiently-in-pandas) currently... – zabop Aug 30 '20 at 17:29
@zabop yes , we could you can try create the data , then convert it back via pivot ~ – BENY Aug 30 '20 at 17:35

How to get columns containing names of pre-defined equivalence classes of values in each row of a Pandas dataframe?

The problem

My method

The question

1 Answers1