I have some boolean variables in a pandas dataframe and I need to get all unique tuples. So my idea was to create a new column of concatenated values of my variables then use pandas.DataFrame.unique() to get all unique tuples.
So my idea was to concatenate using binary developpment. For instance, for the dataframe :
import pandas as pd
df = pd.DataFrame({'v1':[0,1,0,0,1],'v2':[0,0,0,1,1], 'v3':[0,1,1,0,1], 'v4':[0,1,1,1,1]})
I could create a column as such :
df['added'] = df['v1'] + df['v2']*2 + df['v3']*4 + df['v4']*8
My idea was to iterate on the list of variables like this (it should be noted that on my real problem I do not know the number of columns):
variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
df['added'] = df['added'] + df[var] << ind
This however throws an error : "TypeError: unsupported operand type(s) for << : 'Series' and 'int' .
I can solve my problem with pandas.DataFrame.apply() as such :
variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
df['added'] = df['added'] + df[var].apply(lambda x : x << ind )
However, apply is (typically) slow. How can I do things more efficiently?
Thanks in advance
M