0

I'm working in pyspark to do some graph analysis, and I need to read in the data for the vertices from a JSON file. The file contains one object, which is really just a dictionary:

{"id_1": [array of features], "id_2": [array of features], ... "id_n": [array of features]}

I want to create a spark dataframe with only two columns, ['id','features']. Each entry in the JSON dictionary should be a row.

I'm loading the data as follows:

vertices = spark.read.json("filepath/json_file.json")

This output is incorrect, the schema is a mess (small sample below):

StructType(List(StructField(0,ArrayType(LongType,true),true),StructField(1,ArrayType(LongType,true),true),StructField(10,ArrayType(LongType,true),true),StructField(100,ArrayType(LongType,true),true),StructField(1000,ArrayType(LongType,true),true),StructField(1001,ArrayType(LongType,true),true),StructField(1002,ArrayType(LongType,true),true),StructField(1003,ArrayType(LongType,true),true),StructField(1004,ArrayType(LongType,true),true)...

Desired output is as follows:

+-------+---------+
|  id   |features |
+-------+---------+
| id_1  | [array] | 
+-------+---------+
| id_2  | [array] | 
+-------+---------+
| id_3  | [array] | 
+-------+---------+

Alternatively, I could also work with it in a series of key, feature pairs (either in a DF or in an rdd is fine)

+-------+---------+
|  id   |features |
+-------+---------+
| id_1  |feature1 | 
+-------+---------+
| id_1  |feature2 | 
+-------+---------+
| id_1  |feature3 | 
+-------+---------+

Is there anything I can do either within the dataframe or rdd.map/flatmap to get the desired output?

This seems like it should be very simple, but I can't quite seem to parse it correctly and I haven't found a related answer that works. Not sure what I am missing here.

Thanks!

  • If you can post a sample JSON with nested structure and your expected output, that would help. This blog post can also help you as all the details are available here. (https://stackoverflow.com/questions/57811415/reading-a-nested-json-file-in-pyspark) – H Roy Apr 15 '20 at 04:45
  • probably this can also help https://stackoverflow.com/questions/61218462/using-spark-to-expand-json-string-by-rows-and-columns – H Roy Apr 15 '20 at 04:51
  • @HRoy The sample is included in the descrpition. There is no nested structure. It essentially looks like a python dictionary, where each key,value is a unique string: array of features. – washedupengineer Apr 15 '20 at 04:57

0 Answers0