5

I have a Pandas dataframe with data in a very wide form... for example:

ID  Equipment   Function    Task    exprt_cond1_time    exprt_cond2_time    exprt_cond1_freq    exprt_cond2_freq    novce_cond1_time    novce_cond2_time    novce_cond1_freq    novce_cond2_freq
0   eq_type_1   Fxn_a       task_1  12                  24                  0.031               0.055               15                  31                  0.042               0.059
1   eq_type_1   Fxn_a       task_2  10                  22                  0.028               0.052               12                  29                  0.039               0.055
2   eq_type_1   Fxn_b       task_3  13                  25                  0.033               0.057               18                  34                  0.047               0.062
3   eq_type_1   Fxn_b       task_4  9                   19                  0.027               0.051               10                  28                  0.038               0.054
4   eq_type_2   Fxn_a       task_1  14                  27                  0.036               0.056               16                  32                  0.043               0.061
5   eq_type_2   Fxn_a       task_2  11                  26                  0.030               0.054               14                  30                  0.041               0.058

but I am wanting to convert it to a more tidy long format using the text in the column label to make new columns...e.g., data from the first and last rows from the above might look something more like this:

ID  Equipment   Function    Task    Experience  Condition   Time    Freq
0   eq_type_1   Fxn_a       task_1  expert      cond1       12      0.031
1   eq_type_1   Fxn_a       task_1  expert      cond2       24      0.055
2   eq_type_1   Fxn_a       task_1  novice      cond1       15      0.042
3   eq_type_1   Fxn_a       task_1  novice      cond2       31      0.059
...
16  eq_type_2   Fxn_a       task_2  expert      cond1       11      0.030
17  eq_type_2   Fxn_a       task_2  expert      cond2       26      0.054
18  eq_type_2   Fxn_a       task_2  novice      cond1       14      0.041
19  eq_type_2   Fxn_a       task_2  novice      cond2       30      0.058

I can't figure out the right combination of melt / stack / reshape / MultiIndex or other translation functions to make this happen efficiently, or without my code becoming ugly, unwieldy, & nearly unreadable. This question and this question are close and help me some, but they only seem to convert based on a single attribute in the label. Would love any help or tips from the SO community!

CJH
  • 1,295
  • 1
  • 11
  • 15

3 Answers3

8

Let's try pd.wide_to_long twice with some column renaming make it all possible:

rename_d = {'exprt_cond1_time':'Time_exprt_cond1',
        'exprt_cond2_time':'Time_exprt_cond2',
        'exprt_cond1_freq':'Freq_exprt_cond1',
        'exprt_cond2_freq':'Freq_exprt_cond2',
        'novce_cond1_time':'Time_novce_cond1',
        'novce_cond2_time':'Time_novce_cond2',
        'novce_cond1_freq':'Freq_novce_cond1',
        'novce_cond2_freq':'Freq_novce_cond2'}

f = df.rename(columns=rename_d)

df1 = pd.wide_to_long(df, ['Time_exprt','Freq_exprt','Time_novce','Freq_novce'],i=['Equipment','Function','Task'],j='Condition',sep='_',suffix='.')

df1 = df1.reset_index()

df_out = pd.wide_to_long(df1,['Time','Freq'],i=['Equipment','Function','Task','Condition'],j='Experience',sep='_',suffix='').reset_index().drop('ID',axis=1)

Output:

    Equipment Function    Task Condition Experience  Time   Freq
0   eq_type_1    Fxn_a  task_1     cond1    exprt    12  0.031
1   eq_type_1    Fxn_a  task_1     cond1    novce    15  0.042
2   eq_type_1    Fxn_a  task_1     cond2    exprt    24  0.055
3   eq_type_1    Fxn_a  task_1     cond2    novce    31  0.059
4   eq_type_1    Fxn_a  task_2     cond1    exprt    10  0.028
5   eq_type_1    Fxn_a  task_2     cond1    novce    12  0.039
6   eq_type_1    Fxn_a  task_2     cond2    exprt    22  0.052
7   eq_type_1    Fxn_a  task_2     cond2    novce    29  0.055
8   eq_type_1    Fxn_b  task_3     cond1    exprt    13  0.033
9   eq_type_1    Fxn_b  task_3     cond1    novce    18  0.047
10  eq_type_1    Fxn_b  task_3     cond2    exprt    25  0.057
11  eq_type_1    Fxn_b  task_3     cond2    novce    34  0.062
12  eq_type_1    Fxn_b  task_4     cond1    exprt     9  0.027
13  eq_type_1    Fxn_b  task_4     cond1    novce    10  0.038
14  eq_type_1    Fxn_b  task_4     cond2    exprt    19  0.051
15  eq_type_1    Fxn_b  task_4     cond2    novce    28  0.054
16  eq_type_2    Fxn_a  task_1     cond1    exprt    14  0.036
17  eq_type_2    Fxn_a  task_1     cond1    novce    16  0.043
18  eq_type_2    Fxn_a  task_1     cond2    exprt    27  0.056
19  eq_type_2    Fxn_a  task_1     cond2    novce    32  0.061
20  eq_type_2    Fxn_a  task_2     cond1    exprt    11  0.030
21  eq_type_2    Fxn_a  task_2     cond1    novce    14  0.041
22  eq_type_2    Fxn_a  task_2     cond2    exprt    26  0.054
23  eq_type_2    Fxn_a  task_2     cond2    novce    30  0.058

pd.wide_to_long handles "simultaneous melts" in Pandas. First, we need to rename those columns to make the stubnames in pd.wide_to_long work.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
4

Here is my attempt - i tried to use pd.wide_to_long only once:

import re

cols = ['Equipment', 'Function', 'Task']

renamer = lambda c: re.sub(r'([^_]*)_(cond\d+)_(time|freq)', r'\3_\1_\2', c)

r = pd.wide_to_long(df.drop('ID',1).rename(columns=renamer),
                    stubnames=['time','freq'], i=cols, j='Measures',
                    sep='_',suffix='(?:exprt|novce)_cond\d+') \
      .reset_index()

r[['Experience','Condition']] = r.pop('Measures').str.split('_', expand=True)

Result:

In [192]: r
Out[192]:
    Equipment Function    Task  time   freq Experience Condition
0   eq_type_1    Fxn_a  task_1    12  0.031      exprt     cond1
1   eq_type_1    Fxn_a  task_1    24  0.055      exprt     cond2
2   eq_type_1    Fxn_a  task_1    15  0.042      novce     cond1
3   eq_type_1    Fxn_a  task_1    31  0.059      novce     cond2
4   eq_type_1    Fxn_a  task_2    10  0.028      exprt     cond1
5   eq_type_1    Fxn_a  task_2    22  0.052      exprt     cond2
6   eq_type_1    Fxn_a  task_2    12  0.039      novce     cond1
7   eq_type_1    Fxn_a  task_2    29  0.055      novce     cond2
8   eq_type_1    Fxn_b  task_3    13  0.033      exprt     cond1
9   eq_type_1    Fxn_b  task_3    25  0.057      exprt     cond2
10  eq_type_1    Fxn_b  task_3    18  0.047      novce     cond1
11  eq_type_1    Fxn_b  task_3    34  0.062      novce     cond2
12  eq_type_1    Fxn_b  task_4     9  0.027      exprt     cond1
13  eq_type_1    Fxn_b  task_4    19  0.051      exprt     cond2
14  eq_type_1    Fxn_b  task_4    10  0.038      novce     cond1
15  eq_type_1    Fxn_b  task_4    28  0.054      novce     cond2
16  eq_type_2    Fxn_a  task_1    14  0.036      exprt     cond1
17  eq_type_2    Fxn_a  task_1    27  0.056      exprt     cond2
18  eq_type_2    Fxn_a  task_1    16  0.043      novce     cond1
19  eq_type_2    Fxn_a  task_1    32  0.061      novce     cond2
20  eq_type_2    Fxn_a  task_2    11  0.030      exprt     cond1
21  eq_type_2    Fxn_a  task_2    26  0.054      exprt     cond2
22  eq_type_2    Fxn_a  task_2    14  0.041      novce     cond1
23  eq_type_2    Fxn_a  task_2    30  0.058      novce     cond2

Explanation:

In [198]: df.drop('ID',1).rename(columns=renamer)
Out[198]:
   Equipment Function    Task  time_exprt_cond1  time_exprt_cond2  freq_exprt_cond1  freq_exprt_cond2  time_novce_cond1  \
0  eq_type_1    Fxn_a  task_1                12                24             0.031             0.055                15
1  eq_type_1    Fxn_a  task_2                10                22             0.028             0.052                12
2  eq_type_1    Fxn_b  task_3                13                25             0.033             0.057                18
3  eq_type_1    Fxn_b  task_4                 9                19             0.027             0.051                10
4  eq_type_2    Fxn_a  task_1                14                27             0.036             0.056                16
5  eq_type_2    Fxn_a  task_2                11                26             0.030             0.054                14

   time_novce_cond2  freq_novce_cond1  freq_novce_cond2
0                31             0.042             0.059
1                29             0.039             0.055
2                34             0.047             0.062
3                28             0.038             0.054
4                32             0.043             0.061
5                30             0.041             0.058

In [199]: pd.wide_to_long(df.drop('ID',1).rename(columns=renamer),
     ...:                 stubnames=['time','freq'], i=cols,j='Measures',
     ...:                 sep='_',suffix='(?:exprt|novce)_cond\d+') \
     ...:   .reset_index()
     ...:
Out[199]:
    Equipment Function    Task     Measures  time   freq
0   eq_type_1    Fxn_a  task_1  exprt_cond1    12  0.031
1   eq_type_1    Fxn_a  task_1  exprt_cond2    24  0.055
2   eq_type_1    Fxn_a  task_1  novce_cond1    15  0.042
3   eq_type_1    Fxn_a  task_1  novce_cond2    31  0.059
4   eq_type_1    Fxn_a  task_2  exprt_cond1    10  0.028
5   eq_type_1    Fxn_a  task_2  exprt_cond2    22  0.052
6   eq_type_1    Fxn_a  task_2  novce_cond1    12  0.039
7   eq_type_1    Fxn_a  task_2  novce_cond2    29  0.055
8   eq_type_1    Fxn_b  task_3  exprt_cond1    13  0.033
9   eq_type_1    Fxn_b  task_3  exprt_cond2    25  0.057
10  eq_type_1    Fxn_b  task_3  novce_cond1    18  0.047
11  eq_type_1    Fxn_b  task_3  novce_cond2    34  0.062
12  eq_type_1    Fxn_b  task_4  exprt_cond1     9  0.027
13  eq_type_1    Fxn_b  task_4  exprt_cond2    19  0.051
14  eq_type_1    Fxn_b  task_4  novce_cond1    10  0.038
15  eq_type_1    Fxn_b  task_4  novce_cond2    28  0.054
16  eq_type_2    Fxn_a  task_1  exprt_cond1    14  0.036
17  eq_type_2    Fxn_a  task_1  exprt_cond2    27  0.056
18  eq_type_2    Fxn_a  task_1  novce_cond1    16  0.043
19  eq_type_2    Fxn_a  task_1  novce_cond2    32  0.061
20  eq_type_2    Fxn_a  task_2  exprt_cond1    11  0.030
21  eq_type_2    Fxn_a  task_2  exprt_cond2    26  0.054
22  eq_type_2    Fxn_a  task_2  novce_cond1    14  0.041
23  eq_type_2    Fxn_a  task_2  novce_cond2    30  0.058
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

One option is with pivot_longer from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(index = slice('ID','Task'), 
                names_to = ('Experience', 'Condition', '.value'), 
                names_sep = '_')

    ID  Equipment Function    Task Experience Condition  time   freq
0    0  eq_type_1    Fxn_a  task_1      exprt     cond1    12  0.031
1    1  eq_type_1    Fxn_a  task_2      exprt     cond1    10  0.028
2    2  eq_type_1    Fxn_b  task_3      exprt     cond1    13  0.033
3    3  eq_type_1    Fxn_b  task_4      exprt     cond1     9  0.027
4    4  eq_type_2    Fxn_a  task_1      exprt     cond1    14  0.036
5    5  eq_type_2    Fxn_a  task_2      exprt     cond1    11  0.030
6    0  eq_type_1    Fxn_a  task_1      exprt     cond2    24  0.055
7    1  eq_type_1    Fxn_a  task_2      exprt     cond2    22  0.052
8    2  eq_type_1    Fxn_b  task_3      exprt     cond2    25  0.057
9    3  eq_type_1    Fxn_b  task_4      exprt     cond2    19  0.051
10   4  eq_type_2    Fxn_a  task_1      exprt     cond2    27  0.056
11   5  eq_type_2    Fxn_a  task_2      exprt     cond2    26  0.054
12   0  eq_type_1    Fxn_a  task_1      novce     cond1    15  0.042
13   1  eq_type_1    Fxn_a  task_2      novce     cond1    12  0.039
14   2  eq_type_1    Fxn_b  task_3      novce     cond1    18  0.047
15   3  eq_type_1    Fxn_b  task_4      novce     cond1    10  0.038
16   4  eq_type_2    Fxn_a  task_1      novce     cond1    16  0.043
17   5  eq_type_2    Fxn_a  task_2      novce     cond1    14  0.041
18   0  eq_type_1    Fxn_a  task_1      novce     cond2    31  0.059
19   1  eq_type_1    Fxn_a  task_2      novce     cond2    29  0.055
20   2  eq_type_1    Fxn_b  task_3      novce     cond2    34  0.062
21   3  eq_type_1    Fxn_b  task_4      novce     cond2    28  0.054
22   4  eq_type_2    Fxn_a  task_1      novce     cond2    32  0.061
23   5  eq_type_2    Fxn_a  task_2      novce     cond2    30  0.058

The .value in names_to determines which parts of the column labels stay as headers, while the others are collated into Experience and Condition columns.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31