I have a pandas data frame like this:
dx1 dx2 dx3 dx4
25041 40391 5856 0
25041 40391 25081 5856
25041 40391 42822 0
25061 40391 0 0
25041 40391 0 5856
40391 25002 5856 3569
Using dummy method, get_dummies, I created dummy table like this:
dummayData = pd.get_dummies(dataFrame,prefix='dx')
dummyData
dx_25041 dx_25061 dx_40391 dx_25002 dx_40391 dx_0 dx_25081 dx_42822 dx_5856 dx_0 dx_3569 dx_5856
1 0 0 0 1 0 0 0 1 1 0 0
1 0 0 0 1 0 1 0 0 0 0 1
1 0 0 0 1 0 0 1 0 1 0 0
0 1 0 0 1 1 0 0 0 1 0 0
1 0 0 0 1 1 0 0 0 0 0 1
0 0 1 1 0 0 0 0 1 0 1 0
The dummy values are repeated, for ex; dx_40391,dx_0,dx_5856 etc here. They can be two or MANY. I want to merge these kind of dummy variables by a UNION operation, so that for dx_40391 all rows will have value 1, and keep only one column in the data frame. Similarly for all other duplicate dummy variables. I have many hundreds thousands dummy variables and many hundred thousands rows. Is there an efficient way to do this?