0

Suppose I have a data frame like this:

   A   B  C  D
0  1  10  x  5
1  1  20  y  5
2  1  30  z  5
3  2  40  x  6
4  2  50  y  6
5  2  60  z  6

This, can be viewed, as a table that stores the value of B as a function of A, C, and D. Now, I would look like to transform the B column into three columns B_x, B_y, B_z, like this:

   A  B_x  B_y  B_z  D
0  1   10   20   30  5
1  2   40   50   60  6

I.e., B_x stores B(A, D) when C = 'x', B_y stores B(A, D) when C = 'y', etc.

What is the most efficient way to do this?

I have a found a solution like this:

frames = []

for c, subframe in df.groupby('C'): 
     subframe = subframe.rename(columns={'B': f'B_{c}'}) 
     subframe = subframe.set_index(['A', 'D']) 
     del subframe['C'] 
     frames.append(subframe) 

out = frames[0]
for frame in frames[1:]:
    out = out.join(frame)

out = out.reset_index()

This gives the correct response, but I feel that it is highly inefficient. I am also not too happy with the fact that to implement this solution one would need to know which columns should not get the prefix in column C explicitly. (In this MWE there were only two of them, but there could be tens in real life.)

Is there a better solution? I.e., a method that says, take a column as a suffix column (in this case C) and a set of 'value' columns (in this case only B); turn the value column names into name_prefix and fill them appropriately?

MikeL
  • 2,369
  • 2
  • 24
  • 38
  • Try `pivot`: ``df.pivot(['A', 'D'], 'C').pipe(lambda df: df.set_axis(df.columns.map('_'.join), axis = 'columns'))`` – sammywemmy Aug 30 '21 at 09:54

1 Answers1

0

Here's one way to do it:

import pandas as pd
df = pd.DataFrame( data = {'A':[1,1,1,2,2,2],
                           'B':[10,20,30,40,50,60],
                           'C':['x','y','z','x','y','z'],
                           'D':[5,5,5,6,6,6]})

df2 = df.pivot_table( index=['A','D'],
                      columns=['C'],
                      values=['B']
                     )

df2.columns = ['_'.join(col) for col in df2.columns.values]
df2 = df2.reset_index()