0

I have a spark dataframe looks like this:

from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql.types import StringType, IntegerType, StructType,      StructField,LongType
from pyspark.sql.functions import sum, mean

rdd = sc.parallelize([('retail','food'),
('retail','food'),
('retail','auto'),
('retail','shoes'),
('wholesale','healthsupply'),
('wholesale','foodsupply'),
('wholesale','foodsupply'),
('retail','toy'),
('retail','toy'),
('wholesale','foodsupply'])
schema = StructType([StructField('division', StringType(), True),
             StructField('category', StringType(), True)
             ])
df = sqlContext.createDataFrame(rdd, schema)

I want to generate a table like this, get the division name, division totol records number, top 1 and top2 category within each division and their record number:

 division division_total    cat_top_1   top1_cnt  cat_top_2     top2_cnt
 retail   5                 food        2          toy          2
 wholesale4                 foodsupply  3          healthsupply 1

Now I could generate the cat_top_1, cat_top_2 by using window functions in spark, but how to pivot to row, also add a column of division_total, I could not do it right

df_by_div = df.groupby('division','revenue').sort(asc("division"),desc("count"))

windowSpec = Window().partitionBy("division").orderBy(col("count").desc())



df_list = df_by_div.withColumn("rn", rowNumber()\
              .over(windowSpec).cast('int'))\
              .where(col("rn")<=2)\
              .orderBy("division",desc("count"))
newleaf
  • 2,257
  • 8
  • 32
  • 52
  • Possible duplicate of [Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames](http://stackoverflow.com/questions/30260015/reshaping-pivoting-data-in-spark-rdd-and-or-spark-dataframes) –  Oct 10 '16 at 17:30
  • Thanks. It's good to read through it. The case when solution is similar to mine. – newleaf Oct 10 '16 at 19:05

0 Answers0