0

I have a pandas data frame like this:

dx1      dx2    dx3    dx4
25041   40391   5856    0
25041   40391   25081   5856
25041   40391   42822   0
25061   40391   0       0
25041   40391   0       5856
40391   25002   5856    3569

Using dummy method, get_dummies, I created dummy table like this:

dummayData = pd.get_dummies(dataFrame,prefix='dx')
dummyData


dx_25041    dx_25061    dx_40391    dx_25002    dx_40391    dx_0    dx_25081    dx_42822    dx_5856 dx_0    dx_3569 dx_5856
1           0           0           0           1           0       0           0           1       1       0       0
1           0           0           0           1           0       1           0           0       0       0       1
1           0           0           0           1           0       0           1           0       1       0       0
0           1           0           0           1           1       0           0           0       1       0       0
1           0           0           0           1           1       0           0           0       0       0       1
0           0           1           1           0           0       0           0           1       0       1       0

The dummy values are repeated, for ex; dx_40391,dx_0,dx_5856 etc here. They can be two or MANY. I want to merge these kind of dummy variables by a UNION operation, so that for dx_40391 all rows will have value 1, and keep only one column in the data frame. Similarly for all other duplicate dummy variables. I have many hundreds thousands dummy variables and many hundred thousands rows. Is there an efficient way to do this?

Sanoj
  • 1,347
  • 3
  • 15
  • 21

1 Answers1

0

If you use the usual column indexer df[col] on a duplicated column name, you will get back all columns with that name, then you can just take the max along the rows (use max with axis=1):

pd.DataFrame({c: pd.DataFrame(df[c]).max(axis=1) for c in df.columns.unique()})

   dx_0  dx_25002  dx_25041  dx_25061  dx_25081  dx_3569  dx_40391  dx_42822  \
0     0         1         1         0         1        0         0         0
1     1         1         1         0         1        0         0         0
2     0         1         1         0         1        1         0         0
3     0         1         0         1         1        0         0         1
4     0         1         1         0         1        0         0         1
5     1         0         0         0         0        0         1         0

   dx_5856
0        1
1        0
2        1
3        1
4        0
5        1
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • When I applied it to my data frame I am getting this error: ValueError: No axis named 1 for object type – Sanoj Jan 12 '16 at 17:06
  • @maxyahoo: I have df = pd.get_dummies(dataFrame,prefix='dx'). And then I applied your rule on 'df' as you have stated above. But I am getting error mentioned in above comment. Am I missing any step(s)? – Sanoj Jan 12 '16 at 17:34
  • oh there was a typo in my answer try it again now – maxymoo Jan 12 '16 at 22:32
  • Thanks it works for smaller data. But in case of larger data of more than 100 MB CSV file of shape((172214, 198) it gives memory error on my machine while creating dummy variable. Before this I was using code, which also works for smaller data but fails for larger one: mat = X11.as_matrix(columns=None) values, counts = np.unique(mat.astype(str), return_counts=True) for x in values: X11[x] = X11.isin([x]).any(1).astype(int) – Sanoj Jan 12 '16 at 23:22
  • @maxyoo: do you know any other way other than using pd.get_dummies()? Since there are many hundred thousands code embedded within dx columns and it try to build dummies for all that and fails due to memory crunch. – Sanoj Jan 12 '16 at 23:25
  • Maybe try using `np.bincount`, Take a look at this question : http://stackoverflow.com/questions/34755244/apply-bincount-to-each-row-of-a-2d-numpy-array – maxymoo Jan 12 '16 at 23:36
  • Actually you can use a HashingVectorizer, take a look at the answer to that question. – maxymoo Jan 13 '16 at 00:49