2
# import package
import pandas as pd

The problem

I have a dataframe:

data = {'row1': ['a', 'A', 'B', 'b'],
        'row2': ['a', 'b', 'c', 'd'],
        'row3': ['a', 'b', 'd', 'D']}
df = pd.DataFrame.from_dict(data, orient='index', columns=['col'+str(x) for x in range(4)])

which looks like:

enter image description here

I also have a list of equivalence classes. Each equivalence class consists of items which are taken as equivalent.

equivalenceClasses={'classA':['a','A'],
                    'classB':['b','B'],
                    'classC':['c','C'],
                    'classD':['d','D']}

I would like to create a dataframe in which the rows in the above dataframe are replaced by the names of the equivalence classes the letters in the row belong to. (Each equivalence class should appear no more than once in a row, and we should use NaNs to post-pad rows in which not all columns are fille by a name of an equivalence class). Ie I want this output:

enter image description here


My method

I achieve the goal by:

def differentClasses(colvalues):
    return list(set([equivalenceClassName for colvalue in colvalues
                                          for equivalenceClassName, equivalenceClass in zip(equivalenceClasses.keys(),
                                                                                   equivalenceClasses.values())
                                          if colvalue in equivalenceClass]))

(On list comprehension, on nested list comprehension.)

df['classes'] = df.apply(lambda row : differentClasses(row['col'+str(x)] for x in range(4)), axis = 1) 

(Influenced by this.)

The df at this point looks like this:

enter image description here

Finish by:

result_df = pd.DataFrame(df['classes'].tolist(),index=df.index,columns=['classcol'+str(x) for x in range(4)])

result_df is the desired output above.


The question

Is there a more standard way of doing this? Something like:

df.equivalenceClassify(equivalenceClassList)

and I get my output?

zabop
  • 6,750
  • 3
  • 39
  • 84

1 Answers1

2

We need create the new dict based on your original equivalenceClasses, then just do replace

from collections import ChainMap
d = dict(ChainMap(*[dict.fromkeys(y,x) for x , y in equivalenceClasses.items()]))
df = df.replace(d)
Out[299]: 
        col0    col1    col2    col3
row1  classA  classA  classB  classB
row2  classA  classB  classC  classD
row3  classA  classB  classD  classD

Then

df = df.mask(df.apply(pd.Series.duplicated,1))
Out[307]: 
        col0    col1    col2    col3
row1  classA     NaN  classB     NaN
row2  classA  classB  classC  classD
row3  classA  classB  classD     NaN
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Maybe on a large dataframe, could we do better than .replace()? Reading [this](https://stackoverflow.com/questions/42012339/using-replace-efficiently-in-pandas) currently... – zabop Aug 30 '20 at 17:29
  • @zabop yes , we could you can try create the data , then convert it back via pivot ~ – BENY Aug 30 '20 at 17:35