1

This is a follow up question to Pivot a dataframe with two columns as the index.

My data is in this format:

Record ID Para  Col2     Col3
1          A        x      a
1          A        x      b
2          B        y      a
2          B        y      b
1          A        z      c
1          C        x      a

I would like to reshape it into:

Record Para  a     b      c    x   y  z 
1       A    1     1      1    1   0  1
1       C    1     1      1    1   0  1
2       B    1     1      0    0   1  0 

I tried

    csv3 = csv2.pivot_table(index=['Record ID', 'Para'], columns=csv2.iloc[:,2:], aggfunc='size', fill_value=0).reset_index()

but don't get the columns right. What do I need to do differently?

UPDATE 1:

I have 10s of columns.

kurious
  • 1,024
  • 10
  • 29

2 Answers2

1

You can aggregate to set and then use get_dummies.

df2 = df.groupby(['RecordID', 'Para'])[df.columns[2:]].aggregate(set)

values = df2.apply(lambda x: set().union(*x.values), axis=1)
dummies = values.str.join('|').str.get_dummies()

res = dummies.reset_index()

print(res)

   RecordID Para  a  b  c  x  y  z
0         1    A  1  1  1  1  0  1
1         2    B  1  1  0  0  1  0
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Edited my comment. This approach seems to be cumbersome if I have 10s of columns – kurious Jun 13 '18 at 00:34
  • @kurious, Then you can use a list comprehension instead of explicitly defining your list. But please update your question specifying *all* your requirements. – jpp Jun 13 '18 at 00:35
1

IIUC get_dummies

pd.get_dummies(df.set_index(['RecordID','Para']),prefix='',prefix_sep = '').sum(level=[0,1]).gt(0).astype(int)
Out[272]: 
               x  y  z  a  b  c
RecordID Para                  
1        A     1  0  1  1  1  1
2        B     0  1  0  1  1  0

Update

pd.get_dummies(df.set_index(['RecordID','Para']),prefix='',prefix_sep = '').sum(level=[0,1]).gt(0).astype(int).replace(0,np.nan).groupby(level=0).ffill().fillna(0)
Out[292]: 
                 x    y    z  a    b    c
RecordID Para                            
1        A     1.0  0.0  1.0  1  1.0  1.0
2        B     0.0  1.0  0.0  1  1.0  0.0
1        C     1.0  0.0  1.0  1  1.0  1.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Upon trying your solution, I realized I omitted data where a `Record ID` can have multiple `Para`s. Can you please update your solution as appropriate. – kurious Jun 13 '18 at 01:56
  • @kurious check the update , I do not think you expected out put is reasonable , can you explained it ? – BENY Jun 13 '18 at 01:58
  • Sure. My record's `Para` is my outcome of interest. I'm trying to predict it using the `Cols`. As a record may have more than 1 para, I want to keep the rest of attributes identical to see if each of the `Para`s can be correctly identified. – kurious Jun 13 '18 at 02:04
  • Also, I don't a difference b/w your original solution and the update. Am I missing something here? – kurious Jun 13 '18 at 02:07
  • Can you explain why C have the same with A in your output? @kurious – BENY Jun 13 '18 at 02:08
  • The record identifier is 1. A and C are its Para. All the columns are attributes of 1, not `A` or `C`. Hence they are identical for the record – kurious Jun 13 '18 at 02:09
  • hi @wen. I just realized an issue. In my output dataframe, there are some columns with no headers and `1` for almost all records. This is very unlike other columns. I'm unable to debug what's causing the issue. The least amount of code that I've seen cause the issue is: `test_df = pd.get_dummies(df.set_index(['RecordID','Para']),prefix='',prefix_sep = '').sum(level=[0,1]).gt(0).astype(int)` – kurious Jun 19 '18 at 04:00
  • It's most likely the blanks – kurious Jun 19 '18 at 04:49
  • @kurious what you going to do with those blank? Drop them? – BENY Jun 19 '18 at 23:54
  • Yeah. I want to drop the blanks – kurious Jun 20 '18 at 01:51
  • Then df.replace('',np.nan).dropna()@kurious – BENY Jun 20 '18 at 02:30