1

How to process json data inside a csv file I am trying to use from_json but with that I need to specify my schema where as my schema keeps varying.

sample input:-

userid   type          data
26594    p.v    {}                                                                                                                                                                                                                                                                                                                                                             
26594    s.s    {"sIds":["1173","1342","12345"]}

26594    s.r    {"bp":"sw"}                                                                                                                                                                                                                                                                                                                                
26594    p.v      {}                                                                                                                                                                                                                                                                                                                                                             
26594    s.r     {"c":"tracking","action":"n","label":"ank"}                                                                                                                                                                                                                                                                            
26593    p.v     {}                                                                                                                                                                                                                                                                                                                                                             
26594    p.sr     {"pId":"11234","pName":"sahkas","s":"n","is":"F","totalCount":0,"scount":0}  

I am looking to convert this into a dataframe using which we can query the json.

Looking for output like:-

 userid    type    data_sids    data_bp    data_c    data_action    data_label
 26594     p.v      null         null       null     null    null
 26594     s.s      1173         null       null     null    null
 26594     s.s      1173         null       null     null    null  
 26594     s.s      1342         null       null     null    null
 26594     s.s      12345         null       null     null    null  
 26594      s.r     null          sw          null    null     null

Is this doable?

Could you please help me with this.

Thanks,

Ankush Reddy.

ankush reddy
  • 481
  • 1
  • 5
  • 28
  • Possible duplicate of [How to query JSON data column using Spark DataFrames?](https://stackoverflow.com/questions/34069282/how-to-query-json-data-column-using-spark-dataframes) – Alper t. Turker May 27 '18 at 18:20
  • I cannot specify schema here it keeps varying. So I tried that approach but it didn't solved my problem. – ankush reddy May 27 '18 at 19:43
  • I was directed here from a comment in the above related question. I agree with the answer given by @dogli980. DataFrame doesn't have the flexibility to handle this for you. – ImDarrenG May 31 '18 at 16:42

1 Answers1

1

My advice is working with RDDs for this task. Write something like this:

rdd = # collection of Rows with the following fields: userid, type, data - your CSV
def flatten_json(userid, type, data_json):
    final_row = {"userid": userid, "type": type, "data_sids": data_json["sids"], ...}
    return Row(**final_row)
rdd = rdd.map(lambda row: flatten_json(row["userid"], row["type"], row["data"]))
df = spark.createDataFrame(rdd)

And that is it :)

dogli980
  • 19
  • 3