I'm working in pyspark to do some graph analysis, and I need to read in the data for the vertices from a JSON file. The file contains one object, which is really just a dictionary:
{"id_1": [array of features], "id_2": [array of features], ... "id_n": [array of features]}
I want to create a spark dataframe with only two columns, ['id','features']. Each entry in the JSON dictionary should be a row.
I'm loading the data as follows:
vertices = spark.read.json("filepath/json_file.json")
This output is incorrect, the schema is a mess (small sample below):
StructType(List(StructField(0,ArrayType(LongType,true),true),StructField(1,ArrayType(LongType,true),true),StructField(10,ArrayType(LongType,true),true),StructField(100,ArrayType(LongType,true),true),StructField(1000,ArrayType(LongType,true),true),StructField(1001,ArrayType(LongType,true),true),StructField(1002,ArrayType(LongType,true),true),StructField(1003,ArrayType(LongType,true),true),StructField(1004,ArrayType(LongType,true),true)...
Desired output is as follows:
+-------+---------+
| id |features |
+-------+---------+
| id_1 | [array] |
+-------+---------+
| id_2 | [array] |
+-------+---------+
| id_3 | [array] |
+-------+---------+
Alternatively, I could also work with it in a series of key, feature pairs (either in a DF or in an rdd is fine)
+-------+---------+
| id |features |
+-------+---------+
| id_1 |feature1 |
+-------+---------+
| id_1 |feature2 |
+-------+---------+
| id_1 |feature3 |
+-------+---------+
Is there anything I can do either within the dataframe or rdd.map/flatmap to get the desired output?
This seems like it should be very simple, but I can't quite seem to parse it correctly and I haven't found a related answer that works. Not sure what I am missing here.
Thanks!