I have a table like this:
+-------+-----+------+------+
|user_id|apple|good banana|carrot|
+-------+-----+------+------+
| user_0| 0| 3| 1|
| user_1| 1| 0| 2|
| user_2| 5| 1| 2|
+-------+-----+------+------+
Here, for each fruits, I want to get the list of customers who bought the most items. The required output is following:
max_user max_count
apple [user_2] 5
banana [user_0] 3
carrot [user_1, user_2] 2
MWE
import numpy as np
import pandas as pd
import pyspark
from pyspark.sql import functions as F
spark = pyspark.sql.SparkSession.builder.getOrCreate()
sc = spark.sparkContext
sqlContext = pyspark.SQLContext(sc)
# pandas dataframe
pdf = pd.DataFrame({'user_id': ['user_0','user_1','user_2'],
'apple': [0,1,5],
'good banana': [3,0,1],
'carrot': [1,2,2]})
# spark dataframe
df = sqlContext.createDataFrame(pdf)
# df.show()
df.createOrReplaceTempView("grocery")
spark.sql('select * from grocery').show()
Question 1
How to get the required output using Pyspark?
Question 2
How to get the required output using Pyspark sql?
References
I have already done some research and searched multiple pages. So far I have come up with one close answer, but it requires transposed table and here my table is normal. Also, I am learning multiple methods such as Spark method and SQL method.