# import package
import pandas as pd
The problem
I have a dataframe:
data = {'row1': ['a', 'A', 'B', 'b'],
'row2': ['a', 'b', 'c', 'd'],
'row3': ['a', 'b', 'd', 'D']}
df = pd.DataFrame.from_dict(data, orient='index', columns=['col'+str(x) for x in range(4)])
which looks like:
I also have a list of equivalence classes. Each equivalence class consists of items which are taken as equivalent.
equivalenceClasses={'classA':['a','A'],
'classB':['b','B'],
'classC':['c','C'],
'classD':['d','D']}
I would like to create a dataframe in which the rows in the above dataframe are replaced by the names of the equivalence classes the letters in the row belong to. (Each equivalence class should appear no more than once in a row, and we should use NaN
s to post-pad rows in which not all columns are fille by a name of an equivalence class). Ie I want this output:
My method
I achieve the goal by:
def differentClasses(colvalues):
return list(set([equivalenceClassName for colvalue in colvalues
for equivalenceClassName, equivalenceClass in zip(equivalenceClasses.keys(),
equivalenceClasses.values())
if colvalue in equivalenceClass]))
(On list comprehension, on nested list comprehension.)
df['classes'] = df.apply(lambda row : differentClasses(row['col'+str(x)] for x in range(4)), axis = 1)
(Influenced by this.)
The df
at this point looks like this:
Finish by:
result_df = pd.DataFrame(df['classes'].tolist(),index=df.index,columns=['classcol'+str(x) for x in range(4)])
result_df
is the desired output above.
The question
Is there a more standard way of doing this? Something like:
df.equivalenceClassify(equivalenceClassList)
and I get my output?