0

I have a flat table of records about objects. Object have a type (ObjType) and are hosted in containers (ContainerId). The records also have some other attributes about the objects. However, they are not of interest at present. So, basically, the data looks like this:

Id  ObjName XT  ObjType ContainerId
2   name1   x1  A   2
3   name2   x5  B   2
22  name5   x3  D   7
25  name6   x2  E   7
35  name7   x3  G   7
..
..
92  name23  x2  A   17
95  name24  x8  B   17
99  name25  x5  A   21

What I am trying to do is 're-pivot' this data to further analyze which containers are 'similar' by looking at the types of objects they host in aggregate.

So, I am looking to convert the above data to the form below:

ObjType        A    B    C    D    E    F    G
ContainerId                                   
2            2.0  1.0  1.0  0.0  0.0  0.0  0.0
7            0.0  0.0  0.0  1.0  2.0  1.0  1.0
9            1.0  1.0  0.0  1.0  0.0  0.0  0.0
11           0.0  0.0  0.0  2.0  3.0  1.0  1.0
14           1.0  1.0  0.0  1.0  0.0  0.0  0.0
17           1.0  1.0  0.0  0.0  0.0  0.0  0.0
21           1.0  0.0  0.0  0.0  0.0  0.0  0.0

This is how I have managed to do it currently (after a lot of stumbling and using various tips from questions such as this one). I am getting the right results but, being new to Pandas and Python, I feel that I must be taking a long route. (I have added a few comments to explain the pain points.)

import pandas as pd
rdf = pd.read_csv('.\\testdata.csv')

#The data in the below group-by is all that is needed but in a re-pivoted format...
rdfg = rdf.groupby('ContainerId').ObjType.value_counts()

#Remove 'ContainerId' and 'ObjType' from the index
#Had to do reset_index in two steps because otherwise there's a conflict with 'ObjType'. 
#That is, just rdfg.reset_index() does not work!
rdx = rdfg.reset_index(level=['ContainerId'])

#Renaming the 'ObjType' column helps get around the conflict so the 2nd reset_index works.
rdx.rename(columns={'ObjType':'Count'}, inplace=True)
cdx = rdx.reset_index()

#After this a simple pivot seems to do it
cdf = cdx.pivot(index='ContainerId', columns='ObjType',values='Count')
#Replacing the NaNs because not all containers have all object types
cdf.fillna(0, inplace=True)

Ask: Can someone please share other possible approaches that could perform this transformation?

hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51
  • Indeed, my bad. I searched a lot using 'groupby' and 'dataframe' during the whole effort. Got the couple of snippets from a separate blog about pivot-table towards the very end of my ordeal and I think it didn't occur to me to search once more on SoF just for 'pandas' and 'pivot-table'. I read so many other things along the way...but hitting upon [this](https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe) would really have helped. Wonderful write-up! – mprabhu11 Jun 11 '18 at 05:44

1 Answers1

1

This is a use case for pd.crosstab. Docs.

e.g.

In [539]: pd.crosstab(df.ContainerId, df.ObjType)
Out[539]: 
ObjType      A  B  D  E  G
ContainerId
2            1  1  0  0  0
7            0  0  1  1  1
17           1  1  0  0  0
21           1  0  0  0  0
hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51