I am attempting to convert some code my organization uses from pandas dataframes to pandas api on spark dataframes. We have run into a problem when we try to convert our pivot functions where pandas api on spark does not allow pivot operations on string columns. I have recreated a small example to demonstrate the problem. The following pandas operation completes successfully.
import pandas as pd
pd_df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': [10, 11, 12, 13, 14, 15]})
pd_df = pd_df.pivot(columns='bar').sort_index()
pd_df.head()
The output for this is the following
index | foo | - | - | baz | - | - | zoo | - | - |
---|---|---|---|---|---|---|---|---|---|
bar | A | B | C | A | B | C | A | B | C |
0 | one | NaN | NaN | 1.0 | NaN | NaN | 10.0 | NaN | NaN |
1 | NaN | one | NaN | NaN | 2.0 | NaN | NaN | 11.0 | NaN |
2 | NaN | NaN | one | NaN | NaN | 3.0 | NaN | NaN | 12.0 |
3 | two | NaN | NaN | 4.0 | NaN | NaN | 13.0 | NaN | NaN |
4 | NaN | two | NaN | NaN | 5.0 | NaN | NaN | 14.0 | NaN |
5 | NaN | NaN | two | NaN | NaN | 6.0 | NaN | NaN | 15.0 |
Note the bar,A,B,C row represent our multi-indexed column names and is part of the header, not the data.
While the pandas runs without a problem, our pandas api on spark pivot fails with the error "values should be a numeric type". This is due to our 'foo' column containing strings.
import pyspark.pandas as ps
ps_df = ps.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': [10, 11, 12, 13, 14, 15]})
ps_df = ps_df.pivot(columns='bar', values=['foo', 'baz', 'zoo']).sort_index()
ps_df.head()
I am trying to figure out how to reproduce the output from the pandas pivot call using pandas api on spark. Is there any way to do this? I have seen this question (Pivot String column on Pyspark Dataframe) but it uses pyspark rather than pandas api on spark.
I want to stick to pandas api on spark as much as possible as much of our team isn't familiar with spark so I was hoping to find a solution that would not require that. Does anyone know a way to perform this operation in pandas api on spark?