Changing the structure of dataframe

Question

I have a dataframe like this:

id|sem|stu
1|sem2|A
1|sem1|B
1|sem1|C
2|sem1|A
2|sem3|A

I want this to be re-modeled like this:

   sem1|sem2|sem3|sem4
1  B|A|NaN|NaN
1  C|A|NaN|NaN
2  A|NaN|A|NaN

I have tried the following code:

df.pivot(index='stu',columns='id',values='sub')

The Output is not what I expected.Could someone give their ideas?

`df.pivot_table(index = 'id',columns = 'sem',values = 'stu').rename_axis(columns = None,index = None)` — ansev, Jan 10 '20 at 18:10
@ansev Throws an exception DataError: No numeric types to aggregate — Sri Test, Jan 10 '20 at 18:17
Ups, sorry, use: `df.pivot_table(index = 'id',columns = 'sem',values = 'stu',aggfunc = ''.join).rename_axis(columns = None,index = None)` — ansev, Jan 10 '20 at 18:18
https://stackoverflow.com/questions/39229005/pivot-table-no-numeric-types-to-aggregate — ansev, Jan 10 '20 at 18:21
@SriTest Can you update the dataframe in the question to show why the suggested answers didnot work(This will be beneficial to future readers as well) , thanks — anky, Jan 10 '20 at 18:28
@ansev It worked.but when I have more values for 'sem' column,as it is aggregate,it joins them as a single one.Is it possible to display as separate rows.For example,for sem1 if id=1 have A and B too,is it possible to display both of them as separate values. — Sri Test, Jan 10 '20 at 18:38

ansev · Accepted Answer · 2020-01-10T21:12:34.577

I think you need DataFrame.pivot_table witg aggfunc = ''.join or another that is valid for str type.

new_df = (df.pivot_table(index = 'id',columns = 'sem',
                         values = 'stu',aggfunc = ''.join)
            .rename_axis(columns = None,index = None))
print(new_df)
  sem1 sem2 sem3
1    B    A  NaN
2    A  NaN    A

You could use another function to treat the values deduplicated for the same ID and sem, for example first, although the way to not lose information here is ''.join

UPDATE

print(df)
   id   sem stu
0   1  sem2   A
1   1  sem1   B
2   1  sem1   A
3   2  sem1   A
4   2  sem3   A

new_df=( df.assign(count=df.groupby(['id','sem']).cumcount())
           .pivot_table(index = 'id',columns = ['sem','count'],
                        values = 'stu',aggfunc = ''.join)
           .rename_axis(columns = [None,None],index = None) )
print(new_df)
  sem1      sem2 sem3
     0    1    0    0
1    B    A    A  NaN
2    A  NaN  NaN    A

new_df=( df.assign(count=df.groupby(['id','sem']).cumcount())
           .pivot_table(index = ['id','count'],columns = 'sem',
                        values = 'stu',aggfunc = ''.join)
           .rename_axis(columns = None,index = [None,None]) )
print(new_df)
    sem1 sem2 sem3
1 0    B    A  NaN
  1    A  NaN  NaN
2 0    A  NaN    A

Solution without MultIndex:

new_df=( df.assign(count=df.groupby(['id','sem']).cumcount())
           .pivot_table(index = 'id',columns = ['sem','count'],
                        values = 'stu',aggfunc = ''.join)
           .rename_axis(columns = [None,None],index = None) )

#Solution with duplicates names of columns
#new_df.columns = new_df.columns.droplevel(1)
#  sem1 sem1 sem2 sem3
#1    B    C    A  NaN
#2    A  NaN  NaN    A

new_df.columns = [f'{x}_{y}' for x,y in new_df.columns]
print(new_df)
  sem1_0 sem1_1 sem2_0 sem3_0
1      B      C      A    NaN
2      A    NaN    NaN      A

true, really the function to apply depends on how his data is — ansev, Jan 10 '20 at 18:32
It worked.but when I have more values for 'sem' column,as it is aggregate,it joins them as a single one.Is it possible to display as separate rows.For example,for sem1 if id=1 have A and B too,is it possible to display both of them as separate values. — Sri Test, Jan 10 '20 at 18:40
Check update section:) In that case I am forced to duplicate the names of the columns (I do not recommend it) or use MultiIndex — ansev, Jan 10 '20 at 18:43
@ansev MultiIndex is not ideal for me in this scenario. Is there any other workaround? — Sri Test, Jan 10 '20 at 19:44
check solution without MultiIndex, another option is `new_df.columns = new_df.columns.droplevel(1)`, but it gets a duplicates columns — ansev, Jan 10 '20 at 21:13

score 1 · Answer 2 · answered Jan 10 '20 at 18:09

1

Looks like you got wrong the arguments for pivot:

df.pivot('id', 'sem', 'stu')

sem sem1 sem2 sem3
id                
1      B    A  NaN
2      A  NaN    A

answered Jan 10 '20 at 18:09

yatu

86,083
12
84
139

shows an exception as my id contains duplicate entries – Sri Test Jan 10 '20 at 18:14
Should't be a problem for pivot. If anything, its for these cases that it most useful @sri – yatu Jan 10 '20 at 18:15
i agree this is not a dupe since the OP already knew about pivot. However implementing , hence not a dupe +1 – anky Jan 10 '20 at 18:17
Thanks @anky sure I do agree with helping out, perhaps it is fine to close as dupe once the issue is solved thoug :) – yatu Jan 10 '20 at 18:18
@anky_91 ValueError: Index contains duplicate entries, cannot reshape. This is the error I got – Sri Test Jan 10 '20 at 18:21
@yatu may be groupby +cumcount with pivot ,else pivot table. but unclear with existing example :) – anky Jan 10 '20 at 18:21

score 0 · Answer 3 · answered Jan 10 '20 at 18:27

It was simply an error in your parameter args, you 'values' param is incorrect, it should be 'stu', not 'sub'.

You're trying to go from long to wide format, where you're indexing by index, setting your columns with columns, and filling indices with the values parameter.

Docs Link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html

id = [1,1,2,2]
sem = ['sem2','sem1','sem1','sem3']
stu = ['A','B','A','A']
data = {'id':id, 'sem':sem, 'stu':stu}

df = pd.DataFrame(data)

df.pivot(index='id', columns='sem', values='stu')

sem sem1 sem2 sem3
id                
1      B    A  NaN
2      A  NaN    A

Changing the structure of dataframe

3 Answers3