I have a list of maps that contains something like this:
fields = [{"trials": 1.0, "name": "Alice", "score": 8.0}, {"trials": 2.0, "name": "Bob", "score": 10.0"}]
The list of maps is returned as a JSON blob from an API call. When I convert this to a dataframe in PySpark, I'll get the following:
+-------------------------------------------+---------+
|fields |key |
+-------------------------------------------+---------+
|[1.0, Alice, 8.0] |key1 |
|[2.0, Bob, 10.0] |key2 |
|[1.0, Charlie, 8.0] |key3 |
|[2.0, Sue, 10.0] |key4 |
|[1.0, Clark, 8.0] |key5 |
|[3.0, Sarah, 10.0] |key6 |
I would like to get it into this form:
+-------------------------------------------+---------+
|trials| name | score |key |
+-------------------------------------------+---------+
|1.0 |Alice | 8.0 |key1 |
|2.0 | Bob | 10.0 |key2 |
|1.0 |Charlie| 8.0 |key3 |
|2.0 |Sue | 10.0 |key4 |
|1.0 |Clark | 8.0 |key5 |
|3.0 |Sarah | 10.0 |key6 |
What is the best way of going about this? This is what I have so far:
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
rdd = sc.parallelize(results)
df = sqlContext.read.json(rdd)
df.show()