Related to following question I asked earlier: Python pandas dataframe pivot only works with pivot_table() but not with set_index() and unstack()
I have been able to pivot the following sample data successfully, using both set_index()
with unstack()
and using pivot_table()
with aggfunc=first
parameter.
Sample Data:
id responseTime label answers
ABC 2018-06-24 Category_1 [3]
ABC 2018-06-24 Category_2 [10]
ABC 2018-06-24 Category_3 [10]
DEF 2018-06-25 Category_1 [7]
DEF 2018-06-25 Category_8 [10]
GHI 2018-06-28 Category_3 [7]
Desired Output:
id responseTime category_1 category_2 category_3 category_8
ABC 2018-06-24 [3] [10] [10] NULL
DEF 2018-06-25 [7] NULL NULL [10]
GHI 2018-06-28 NULL NULL [7] NULL
Code:
#this works but having issues with reset_index so leaving it here as comment.
#df=pdDF.pivot_table(index=['items_id','responseTime'], columns='label', values='answers', aggfunc='first')
df=pdDF.set_index(['items_id','responseTime','label']).unstack('label')
#reset the index so all columns can be preserved for table creation
df.reset_index(inplace=True)
#create pyspark dataframe from pandas dataframe after pivoting is done.
psDF=spark.createDataFrame(df)
#create hive table
psDF.write.mode('overwrite').saveAsTable('default.test_table')
When I use the second piece of code with set_index()
and unstack()
, the resulting output has additional header answers
when printing the dataframe. This causes duplicate columns when I create a hive table out of this dataframe.
Dataframe header before reset_index():
answers
id responseTime category_1 category_2 category_3 category_8
Dataframe columns after reset_index:
('items_id', '')|('responseTime', '')|('answers', u'category_1')|('answers', u'category_2')|('answers', u'cateogry_3')|('answers', u'category_8')
Hive column names:
_'items_id'_''_
_'responsetime'_''_
_'answers'_u'category_1'_
_'answers'_u'category_2'_
_'answers'_u'category_3'_
_'answers'_u'category_8'_
I believe this is happening because unstack()
creates hierarchical columns with multiple levels. Is there a way to make answer
level disappear and to drop these junk underscore characters and answer
references in the dataframe itself so I can create normal hive columns?