0

I have a data frame like:

person_ID, first_name, last_name, feature_1, feature_2, feature_3
1, John, Talbert, 1,2,3
2, Ted, Thompson, 4,5,6
1, John, Talbert, 7,8,9
2, Ted, Thompson, 13,14,15

And I'd like to re-format it like:

person_ID, first_name, last_name, feature_1_A, feature_2_A, feature_3_A, feature_1_B, feature_2_B, feature_3_B, 
1, John, Talbert, 1,2,3, 7,8,9
2, Ted, Thompson, 4,5,6,13,14,15

Any idea what would be an efficient way to do this? The dataset is small so it doesn't really have to be super efficient.

Thank you in advance for your help.

khemedi
  • 774
  • 3
  • 9
  • 19

3 Answers3

2

This is essentially pivot and rename:

df['col'] = df.groupby('person_ID').cumcount()

out=df.pivot_table(index=['person_ID','first_name','last_name'], 
                   columns='col', aggfunc='first')
out.columns = [f'{x}_{y}' for x,y in out.columns]
out = out.reset_index()

Output:

   person_ID first_name last_name  feature_1_0  feature_1_1  feature_2_0  feature_2_1  feature_3_0  feature_3_1
0          1       John   Talbert            1            7            2            8            3            9
1          2        Ted  Thompson            4           13            5           14            6           15
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
2

You can also do it like this:

idx = df.groupby(['person_ID','first_name', 'last_name']).cumcount().map({0:'A',1:'B'})

df_out = df.set_index(['person_ID', 'first_name', 'last_name', idx]).unstack().sort_index(level=1, axis=1)

df_out.columns = [f'{i}_{j}' for i, j in df_out.columns]
df_out.reset_ind

ex()

Output:

   person_ID first_name last_name  feature_1_A  feature_2_A  feature_3_A  feature_1_B  feature_2_B  feature_3_B
0          1       John   Talbert            1            2            3            7            8            9
1          2        Ted  Thompson            4            5            6           13           14           15
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
1

As suggested by @Quang Hoang in comments, here is the set_index().unstack() approach.

df['col'] = df.groupby('person_ID').cumcount()
out = df.set_index(['person_ID', 'first_name', 'last_name','col']).unstack()
out.columns = [f'{x}_{y}' for x,y in out.columns]
out = out.reset_index()
print(out)
   person_ID  first_name  last_name   feature_1_0   feature_1_1   feature_2_0  \
0          1        John    Talbert             1             7             2   
1          2         Ted   Thompson             4            13             5   

    feature_2_1   feature_3_0   feature_3_1  
0             8             3             9  
1            14             6            15  
Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
  • yea just fixed that. thanks a ton. already getting a hang of the `set_index().unstack()`, alternative to pivots (which i dearly hate). – Akshay Sehgal Feb 18 '21 at 03:14