0

Related to following question I asked earlier: Python pandas dataframe pivot only works with pivot_table() but not with set_index() and unstack()

I have been able to pivot the following sample data successfully, using both set_index() with unstack() and using pivot_table() with aggfunc=first parameter.

Sample Data:

id  responseTime    label   answers
ABC 2018-06-24  Category_1  [3]
ABC 2018-06-24  Category_2  [10]
ABC 2018-06-24  Category_3  [10]
DEF 2018-06-25  Category_1  [7]
DEF 2018-06-25  Category_8  [10]
GHI 2018-06-28  Category_3  [7]

Desired Output:

id  responseTime    category_1  category_2 category_3 category_8
ABC  2018-06-24           [3]     [10]         [10]       NULL
DEF  2018-06-25           [7]     NULL         NULL       [10]
GHI  2018-06-28           NULL    NULL         [7]        NULL

Code:

#this works but having issues with reset_index so leaving it here as comment. 
#df=pdDF.pivot_table(index=['items_id','responseTime'], columns='label', values='answers', aggfunc='first')

df=pdDF.set_index(['items_id','responseTime','label']).unstack('label')

#reset the index so all columns can be preserved for table creation
df.reset_index(inplace=True)

#create pyspark dataframe from pandas dataframe after pivoting is done.
psDF=spark.createDataFrame(df)

#create hive table
psDF.write.mode('overwrite').saveAsTable('default.test_table')

When I use the second piece of code with set_index() and unstack(), the resulting output has additional header answers when printing the dataframe. This causes duplicate columns when I create a hive table out of this dataframe.

Dataframe header before reset_index():

                                   answers
id  responseTime    category_1  category_2 category_3 category_8

Dataframe columns after reset_index:

('items_id', '')|('responseTime', '')|('answers', u'category_1')|('answers', u'category_2')|('answers', u'cateogry_3')|('answers', u'category_8')

Hive column names:

_'items_id'_''_     
_'responsetime'_''_
_'answers'_u'category_1'_
_'answers'_u'category_2'_
_'answers'_u'category_3'_
_'answers'_u'category_8'_

I believe this is happening because unstack() creates hierarchical columns with multiple levels. Is there a way to make answer level disappear and to drop these junk underscore characters and answer references in the dataframe itself so I can create normal hive columns?

rp1
  • 159
  • 2
  • 10

1 Answers1

1

Answering my own question here.

I can use droplevel() function to drop the top most level from the dataframe.

Right after set_index() and unstack(), I can add following line to drop answer level from the dataframe.

df.columns = df.columns.droplevel(0)

After this, reset_index() can be called to preserve all columns in dataframe just like the code above.

My dataframe columns and hive columns now don't contain level information with underscores.

|items_id|responseTime|category_1|category_2|category_3|category_8|

Additional reference to droplevel() is available on:

Stackoverlfow Question: Pandas: drop a level from a multi-level column index?

Pandas API: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.MultiIndex.droplevel.html#pandas.MultiIndex.droplevel

rp1
  • 159
  • 2
  • 10