0

I have a list of maps that contains something like this:

fields = [{"trials": 1.0, "name": "Alice", "score": 8.0}, {"trials": 2.0, "name": "Bob", "score": 10.0"}]

The list of maps is returned as a JSON blob from an API call. When I convert this to a dataframe in PySpark, I'll get the following:

+-------------------------------------------+---------+
|fields                                     |key      |
+-------------------------------------------+---------+
|[1.0, Alice, 8.0]                          |key1     |
|[2.0, Bob, 10.0]                           |key2     |
|[1.0, Charlie, 8.0]                        |key3     |
|[2.0, Sue, 10.0]                           |key4     |
|[1.0, Clark, 8.0]                          |key5     |
|[3.0, Sarah, 10.0]                         |key6     |

I would like to get it into this form:

+-------------------------------------------+---------+
|trials| name | score                       |key      |
+-------------------------------------------+---------+
|1.0   |Alice  | 8.0                        |key1     |
|2.0   | Bob   | 10.0                       |key2     |
|1.0   |Charlie| 8.0                        |key3     |
|2.0   |Sue    | 10.0                       |key4     |
|1.0   |Clark  | 8.0                        |key5     |
|3.0   |Sarah  | 10.0                       |key6     |

What is the best way of going about this? This is what I have so far:

from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)
rdd = sc.parallelize(results)
df = sqlContext.read.json(rdd)
df.show()
TheOrangeRemix
  • 85
  • 3
  • 3
  • 10

0 Answers0