1

Input Table

index income Education age1to_20 pcd
1 income_1 Education_0 1 A5009
2 income_2 Education_2 1 A3450
3 income_1 Education_0 1 A5009
4 income_3 Education_1 0 A3450

How do I convert this table into

index income_1 income_2 INCOME_3 Education_0 Education_1 Education_2 age1to_20
1 A5009 0 0 A5009 0 0 A5009
2 0 A3450 0 0 0 A3450 A3450
3 A5009 0 0 A5009 0 0 A5009
4 0 0 A3450 0 A3450 0 0

UPDATED THE OUTPUT TABLE

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
nuke
  • 45
  • 6

4 Answers4

4

Another possible solution:

(pd.concat([
    df.pivot(index=['index', 'age1to_20'], columns=['income'], values='pcd'),
    df.pivot(index=['index', 'age1to_20'], columns=['Education'], values='pcd')], axis=1)
 .fillna(0).reset_index())

Output:

   index  age1to_20 income_1 income_2 income_3 Education_0 Education_1 Education_2
0      1          1    A5009        0        0       A5009           0           0
1      2          1        0    A3450        0           0           0       A3450
2      3          1    A5009        0        0       A5009           0           0
3      4          0        0        0    A3450           0       A3450           0

EDIT

In case there are a lot of columns to pivot, the following code does that, by iterating over the list of columns to pivot with map:

cols = ['income', 'Education']

(pd.concat(
    map(lambda x: df.pivot(
        index=['index', 'age1to_20'], columns=x, values='pcd'), cols), axis=1)
 .fillna(0).reset_index())
PaulS
  • 21,159
  • 2
  • 9
  • 26
  • 1
    What if I have 20 columns that I need to pivot? Do I have to concat 20 pivot tables? is there a better way to do this? – nuke Nov 09 '22 at 10:13
  • Yes, @nuke, but that can be fully automated: Please, see my edited solution. – PaulS Nov 09 '22 at 10:25
2

you need to melt and pivot:

cols_to_pivot = ['income', 'Education']

cols = df.columns.difference(cols_to_pivot)
(df.melt(cols).drop(columns='variable')
   .pivot(index=['index', 'age1to_20'], columns='value', values='pcd')
   .fillna(0).reset_index()
)

output:

value  index  age1to_20 Education_0 Education_1 Education_2 income_1 income_2 income_3
0          1          1       A5009           0           0    A5009        0        0
1          2          1           0           0       A3450        0    A3450        0
2          3          1       A5009           0           0    A5009        0        0
3          4          0           0       A3450           0        0        0    A3450
reworking column age1to_20
cols_to_pivot = ['income', 'Education']

cols = df.columns.difference(cols_to_pivot)
(df.assign(age1to_20=df['pcd'].where(df['age1to_20'].eq(1), 0))
   .melt(cols).drop(columns='variable')
   .pivot(index=['index', 'age1to_20'], columns='value', values='pcd')
   .fillna(0).reset_index()
)

output:

value  index age1to_20 Education_0 Education_1 Education_2 income_1 income_2 income_3
0          1     A5009       A5009           0           0    A5009        0        0
1          2     A3450           0           0       A3450        0    A3450        0
2          3     A5009       A5009           0           0    A5009        0        0
3          4         0           0       A3450           0        0        0    A3450
mozway
  • 194,879
  • 13
  • 39
  • 75
  • what should I update to get pcd value filled for age1to_20 column? – nuke Nov 09 '22 at 14:19
  • what do you mean? You would like to have A5009/A3450/A5009/A3450 in place of 1/1/1/0? – mozway Nov 09 '22 at 14:22
  • yes. updated table. A5009/A3450/A5009/0 – nuke Nov 09 '22 at 14:44
  • only issue is I am getting A5009 and A3450 as new columns – nuke Nov 09 '22 at 15:27
  • @nuke it doesn't do it with the provided example, if you want me to debug, please provide a DataFrame constructor that reproduces the issue – mozway Nov 09 '22 at 15:34
  • yes. it's correct. what if there are multiple columns like 'age1to_20'. Is there a better way to write rather than writing df.assign for every single column? – nuke Nov 10 '22 at 01:43
1

One option is to pivot once, run a for loop across the dataframe to get the individual dataframes, and concatenate across the columns, to get the final output:

out = df.pivot(index=['index', 'age1to_20'], 
               columns=['income', 'Education'], 
               values = 'pcd')

out = [out.droplevel(axis=1,level=n) 
       for n in range(out.columns.nlevels)]

pd.concat(out, axis = 1).fillna(0, downcast='infer').reset_index()

   index  age1to_20 Education_0 Education_2 Education_1 income_1 income_2 income_3
0      1          1       A5009           0           0    A5009        0        0
1      2          1           0       A3450           0        0    A3450        0
2      3          1       A5009           0           0    A5009        0        0
3      4          0           0           0       A3450        0        0    A3450
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0

pandas.get_dummies allows us to get almost the desired output, but instead of getting the values as the pcd column, one has 0 or 1 for a given pcd

import pandas as pd

df2 = pd.get_dummies(df, columns=['income', 'Education'])

[Out]:

   index  age1to_20  ... Education_Education_1  Education_Education_2
0      1          1  ...                     0                      0
1      2          1  ...                     0                      1
2      3          1  ...                     0                      0
3      4          0  ...                     1                      0

enter image description here

Then, in order to change the 1s with the respective pcd value (and drop the column pcd at the end), one can use pandas.Series.mask and pandas.DataFrame.drop as follows

cols = df2.columns.drop(['index', 'pcd']) # To select the columns one wants to apply the change
df2[cols] = df2[cols].mask(df2[cols] == 1, df2['pcd'], axis=0)
df2 = df2.drop(['pcd'], axis=1) 

[Out]:

   index age1to_20  ... Education_Education_1 Education_Education_2
0      1     A5009  ...                     0                     0
1      2     A3450  ...                     0                 A3450
2      3     A5009  ...                     0                     0
3      4         0  ...                 A3450                     0

enter image description here


Notes:

  • An alternative to change the 1s to the respective pcd and drop the column pcd would be to use pandas.DataFrame.assign, numpy.where and pandas.DataFrame.drop as follows

    import numpy as np
    
    df2 = df2.assign(income_1 = np.where(df2['income_income_1'] == 1, df2['pcd'], 0),
                  income_2 = np.where(df2['income_income_2'] == 1, df2['pcd'], 0),
                  income_3 = np.where(df2['income_income_3'] == 1, df2['pcd'], 0),
                  Education_0 = np.where(df2['Education_Education_0'] == 1, df2['pcd'], 0),
                  Education_1 = np.where(df2['Education_Education_1'] == 1, df2['pcd'], 0),
                  Education_2 = np.where(df2['Education_Education_2'] == 1, df2['pcd'], 0),
                  Age1to20 = np.where(df2['age1to_20'] == 1, df2['pcd'], 0)).drop(['age1to_20', 'income_income_1', 'income_income_2', 'income_income_3', 'Education_Education_0', 'Education_Education_1', 'Education_Education_2', 'pcd'], axis=1)
    
    [Out]:
    
       index income_1 income_2  ... Education_1 Education_2 Age1to20
    0      1    A5009        0  ...           0           0    A5009
    1      2        0    A3450  ...           0       A3450    A3450
    2      3    A5009        0  ...           0           0    A5009
    3      4        0        0  ...       A3450           0        0
    

    enter image description here

  • There are additional ways to select the columns one wants to consider in a given dataframe. For that this might be useful: Selecting multiple columns in a Pandas dataframe

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83