0

I was hoping someone could help me convert my current dataframe from a wide to long format. I am using Pandas 0.18.0 and I can't seem to find any other solution on stackoverflow that fits my need.

Any help would be greatly appreciated!

I have 50 steps each with two categories(status/time) that I need to melt, these categories alternate in my dataframe. Below is an example with only 3 sets but this pattern continues until it reaches 50.

status can be either: yes/no/NaN

time can be either: timestamp/NaN

Current Dataframe:

       cl_id  cl_template_id status-1 time-1                     status-2 time-2                     status-3 time-3                    
0      18434   107            NaN                            NaN  NaN                            NaN  NaN                            NaN
1      18280   117            yes      2016-12-28T18:21:58+00:00  yes      2016-12-28T20:47:31+00:00  yes      2016-12-28T20:47:32+00:00
2      18356   413            yes      2017-01-11T19:23:10+00:00  yes      2017-01-11T19:23:11+00:00  yes      2017-01-11T19:23:11+00:00
3      18358   430            NaN                            NaN  NaN                            NaN  NaN                            NaN
4      18359   430            yes      2017-01-11T19:20:32+00:00  yes      2017-01-11T19:20:34+00:00  NaN                            NaN
.
.
.

Target Dataframe:

cl_id cl_template_id   step   status   time
18434 107               1      NaN      NaN
18434 107               2      NaN      NaN
18434 107               3      NaN      NaN
18280 117               1      yes      2016-12-28T18:21:58+00:00
18280 117               2      yes      2016-12-28T20:47:31+00:00
18280 117               3      yes      2016-12-28T20:47:32+00:00
18356 413               1      yes      2017-01-11T19:23:10+00:00
18356 413               2      yes      2017-01-11T19:23:11+00:00
18356 413               3      yes      2017-01-11T19:23:11+00:00
.
.
.
Jeremy L
  • 1
  • 1

2 Answers2

0

Hopefully this answer provides some insight to the problem.

First, I'll recreate an example from your dataframe:

# Make example dataframe
df = pd.DataFrame({'cl_id' : [18434, 18280, 18356, 18358, 18359],
                   'cl_template_id' : [107, 117, 413, 430, 430],
                   'status_1' : [np.NaN, 'yes', 'yes', np.NaN, 'yes'],
                   'time_1' : [np.NaN, '2016-12-28T18:21:58+00:00', '2017-01-11T19:23:10+00:00', np.NaN, '2017-01-11T19:20:32+00:00'],
                   'status_2' : [np.NaN, 'yes', 'yes', np.NaN, 'yes'],
                   'time_2' : [np.NaN, '2016-12-28T20:47:31+00:00', '2017-01-11T19:23:11+00:00', np.NaN, '2017-01-11T19:20:34+00:00'],
                   'status_3' : [np.NaN, 'yes', 'yes', np.NaN, np.NaN],
                   'time_3' : [np.NaN, '2016-12-28T20:47:32+00:00', '2017-01-11T19:23:11+00:00', np.NaN, np.NaN]})

Second, convert time_1,2,3 into datetimes:

# Convert time_1,2,3 to datetime
df.loc[:, 'time_1'] = pd.to_datetime(df.loc[:, 'time_1'])
df.loc[:, 'time_2'] = pd.to_datetime(df.loc[:, 'time_2'])
df.loc[:, 'time_3'] = pd.to_datetime(df.loc[:, 'time_3'])

Third, split the dataframe into two, one with status and the other with time:

# Split df into a status, time dataframe
df_status = df.loc[:, :'status_3']
df_time = df.loc[:, ['cl_id', 'cl_template_id']].merge(df.loc[:, 'time_1':],
                                                       left_index = True,
                                                       right_index = True)

Fourth, melt the status and time dataframes:

# Melt status
df_status = df_status.melt(id_vars = ['cl_id',
                                      'cl_template_id'],
                           value_vars = ['status_1',
                                         'status_2',
                                         'status_3'],
                           var_name = 'step',
                           value_name = 'status')

# Melt time
df_time = df_time.melt(id_vars = ['cl_id',
                                  'cl_template_id'],
                       value_vars = ['time_1',
                                     'time_2',
                                     'time_3'],
                       var_name = 'step',
                       value_name = 'time')

Fifth, clean up the 'step' column in both the status and time dataframes to only keep the number:

# Clean step in status, time
df_status.loc[:, 'step'] = df_status.loc[:, 'step'].str.partition('_')[2]
df_time.loc[:, 'step'] = df_time.loc[:, 'step'].str.partition('_')[2]

Sixth, merge the status and time dataframes back together into the final dataframe:

# Merge status, time back together on cl_id, cl_template_id
final = df_status.merge(df_time,
                        how = 'inner',
                        on = ['cl_id',
                              'cl_template_id',
                              'step']).sort_values(by = ['cl_template_id',
                                                         'cl_id']).reset_index(drop = True)

Voila! The answer you were looking for:

final df

Ian Thompson
  • 2,914
  • 2
  • 18
  • 31
0

Old thread, but I was facing the same issue and I think this answer by Ted Petrou could have helped you perfectly here: Pandas Melt several groups of columns into multiple target columns by name

pd.wide_to_long(df, stubnames, i, j, sep, suffix) 

In short: The pd.wide_to_long() function allows you to specify the common component between the various columns you're trying to unpivot.

For example, my dataframe was similar to yours, as follows: pd.wide_to_long to unpivot groups of columns

pd.melt and pd.unstack get you close, but don't allow you to target these incremental groups of columns by their common denominator.

Robin Teuwens
  • 121
  • 1
  • 5