4

I've searched quite a bit and I can't seem to find something along the line of pivot functionality for my particular problem. I'll convey a simple example of what I'm looking for:

Long Table

dependent_variable  step a  b
         5.5          1  20 30
         5.5          2  25 37
         6.1          1  22 19
         6.1          2  18 29

Desired Wide Table

dependent_variable   a_step1 a_step2 b_step1  b_step2
         5.5            20       25      30       37
         6.1            22       18      19       29

Effectively I would like to pivot on the Step column, and to make the column name for the rest of the independent variables (in this case a and b) include the step number and the a/b value associated with it.

Once pivoted, then I will use the dependent variable column and as a numpy array and the newly pivoted dependent variables to feed into various machine learning algorithms.

When I attempted piRSquared's suggestion (thank you) I got the error: Index contains duplicate entries, cannot reshape.

I then tried (from Here)

d1 =data.set_index(['dependent_variable','step'], append=True).unstack()
d1.columns = d1.columns.map(lambda x: '{}_step{}'.format(*x))
d1.reset_index(inplace=True)

And (using the example table) got the following:

level_0   dependent_variable a_step1 a_step2 b_step1 b_step2
  1               5.5           20      NaN    30       NaN
  2               5.5           NaN     25     NaN      37
  3               6.1           22      NaN    19       NaN
  4               6.1           NaN     18     NaN      29

So, I'm still missing a step

Community
  • 1
  • 1
chemnteach
  • 375
  • 8
  • 23
  • 2
    Someone else may understand what you're after, but I don't. I pass on trying to answer this unless you do a better job presenting what you have, what you want, and what you've tried. I'm going to take a guess that others may feel similarly. Read [MCVE](http://stackoverflow.com/help/mcve), [HowToAsk](http://stackoverflow.com/help/how-to-ask), and [GoodQuestions](http://stackoverflow.com/q/20109391/2336654) to get a better idea on how to maximize your chances of getting your question answered. – piRSquared Jan 03 '17 at 20:07
  • Thank you for the feedback, does the newer revision make more sense? – chemnteach Jan 03 '17 at 21:59
  • @chemnteach, It has been a while, so hopefully you still know. How exactly did you solve the `NaN`'s throughout the table? I tried all solutions in this post, but I can't get it fixed. Probably getting the order or the commands wrong or something alike. – Lena Jun 18 '19 at 13:57

3 Answers3

3

Assuming the name of your dataframe is df and dependent_variable, step are not already in the index

d1 = df.set_index(['dependent_variable', 'step']).unstack()
d1.columns = d1.columns.map(lambda x: '{}_step{}'.format(*x))
d1.reset_index(inplace=True)

print(d1)

   dependent_variable  a_step1  a_step2  b_step1  b_step2
0                 5.5       20       25       30       37
1                 6.1       22       18       19       29
piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

It looks like you are looking for pd.pivot

"If the values argument is omitted, and the input DataFrame has more than one column of values which are not used as column or index inputs to pivot, then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value column- https://pandas.pydata.org/pandas-docs/stable/reshaping.html."

df = pd.DataFrame({'dependent_variable':[5.5,5.5,6.1,6.1],
          'step':[1,2,1,2],
          'a':[20,25,22,18],
          'b':[30,37,19,29],
         })
df = df.pivot(index='dependent_variable',
     columns='step')

yields

        a       b
step    1   2   1   2
dependent_variable              
5.5     20  25  30  37
6.1     22  18  19  29

it has a hierarchical index, which might be more helpful than the output you indicated. However, you can change to a single column index by

df.columns = df.columns.tolist()

the columns don't have the exact names you wanted, but you could then rename.

DVL
  • 354
  • 3
  • 9
0

Note that if your combination of index and columns is duplicated, the solution preferred below won't work, as internally it relies on pd.pivot, which fails on duplicate index/column combo.

One option is with pivot_wider from pyjanitor, using the names_glue parameter to reshape the column names:

# currently in dev
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

df.pivot_wider(
    index = 'dependent_variable', 
    names_from = 'step', 
    values_from = ['a', 'b'],
    names_glue = "{_value}_step{step}"
  )

   dependent_variable  a_step1  a_step2  b_step1  b_step2
0                 5.5       20       25       30       37
1                 6.1       22       18       19       29

in the names_glue string template, _value serves as a placeholder for values from values_from, which are a and b. Anything in the {} brackets should either be from names_from or _value to represent values_from.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31