1

Can someone please help me understand the steps to convert a Python pandas DataFrame that is in record form (data set A), into one that is pivoted with nested columns (as shown in data set B)?

For this question the underlying schema has the following rules:

  • Each ProjectID appears once
  • Each ProjectID is associated to a single PM
  • Each ProjectID is associated to a single Category
  • Multiple ProjectIDs can be associated with a single Category
  • Multiple ProjectIDs can be associated with a single PM

Input Data Set A

df_A = pd.DataFrame({'ProjectID':[1,2,3,4,5,6,7,8],
          'PM':['Bob','Jill','Jack','Jack','Jill','Amy','Jill','Jack'],
          'Category':['Category A','Category B','Category C','Category B','Category A','Category D','Category B','Category B'],
          'Comments':['Justification 1','Justification 2','Justification 3','Justification 4','Justification 5','Justification 6','Justification 7','Justification 8'],
          'Score':[10,7,10,5,15,10,0,2]})

enter image description here

Desired Output enter image description here Notice above the addition of a nested index across the columns. Also notice that 'Comments' and 'Score' both appear at the same level beneath 'ProjectID'. Finally see how the desired output does NOT aggregate any data, but groups/merges the category data into one row per category value.

I have tried so far:

  • df_A.set_index(['Category','ProjectID'],append=True).unstack() - This would only work if I first create a nested index of ['Category','ProjectID] and ADD that to the original numerical index created with a standard dataframe, however it repeats each instance of a Category/ProjectID match as its own row (because of the original index).
  • df_A.groupby() - I wasn't able to use this because it appears to force aggregation of some sort in order to get all of the values of a single category on a single row.
  • df_A.pivot('Category','ProjectID',values='Comments') - I can perform a pivot to avoid unwanted aggregation and it starts to look similar to my intended output, but can only see the 'Comments' field and also cannot set nested columns this way. I receive an error when trying to set values=['Comments','Score'] in the pivot statement.

I think the answer is somewhere between pivot, unstack, set_index, or groupby, but I don't know how to complete the pivot, and then add the appropriate nested column index.

I'd appreciate any thoughts you all have.
Question updated based on Mr. T's comments. Thank you.

  • 1
    I suggest editing your question to include a toy dataset that people can use. Not only that it is a lot to ask from volunteers to type data from a picture, but their dataframe might not correctly reflect the type of your data. Please read [How to create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). If it is text (code, data, error messages), you should post it as a text on SO, not as a link or a picture. – Mr. T Jul 15 '18 at 07:25
  • Apart from that: [I think #7 applies here.](https://stackoverflow.com/q/47152691/8881141) – Mr. T Jul 15 '18 at 07:29

1 Answers1

0

I think this is what you are looking for:

pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2)

Out[4]:
PM                        Amy                    Bob        ...              Jill
ProjectID                   6                      1        ...                 5                      7
                     Comments Score         Comments Score  ...          Comments Score         Comments Score
  Category                                                  ...
0 Category A              NaN   NaN  Justification 1    10  ...   Justification 5    15              NaN   NaN
  Category B              NaN   NaN              NaN   NaN  ...               NaN   NaN  Justification 7     0
  Category C              NaN   NaN              NaN   NaN  ...               NaN   NaN              NaN   NaN
  Category D  Justification 6    10              NaN   NaN  ...               NaN   NaN              NaN   NaN

[4 rows x 16 columns]

EDIT: To select rows by category you should get rid of the row index 0 by adding .xs():

In [3]: df_A_transformed = pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2).xs(0)

In [4]: df_A_transformed
Out[4]:
PM                      Amy                    Bob        ...              Jill
ProjectID                 6                      1        ...                 5                      7
                   Comments Score         Comments Score  ...          Comments Score         Comments Score
Category                                                  ...
Category A              NaN   NaN  Justification 1    10  ...   Justification 5    15              NaN   NaN
Category B              NaN   NaN              NaN   NaN  ...               NaN   NaN  Justification 7     0
Category C              NaN   NaN              NaN   NaN  ...               NaN   NaN              NaN   NaN
Category D  Justification 6    10              NaN   NaN  ...               NaN   NaN              NaN   NaN

[4 rows x 16 columns]

In [5]: df_A_transformed.loc['Category B']
Out[5]:
PM    ProjectID
Amy   6          Comments                NaN
                 Score                   NaN
Bob   1          Comments                NaN
                 Score                   NaN
Jack  3          Comments                NaN
                 Score                   NaN
      4          Comments    Justification 4
                 Score                     5
      8          Comments    Justification 8
                 Score                     2
Jill  2          Comments    Justification 2
                 Score                     7
      5          Comments                NaN
                 Score                   NaN
      7          Comments    Justification 7
                 Score                     0
Name: Category B, dtype: object
Benjamin
  • 571
  • 4
  • 8
  • I was able to run your answer over df_A and it worked perfectly. Thank you Benjamin, I'm relieved to know this is possible! Also, It looks like I got the same answer by modifying your answer to exclude the first '.T' instance. In other words, the following code: pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2) produces the same result. What does the first .T (transpose) after sort_index actually do? – Dave Guenther Jul 16 '18 at 19:15
  • Hi Dave, you're right. The first .T does nothing and can be removed. I'll edit my answer accordingly. – Benjamin Jul 18 '18 at 09:43