How Can I process json data set inside a column in data frame

Question

I've a dataframe in spark, having one column which has json type data.

column3:
z:{
    k:{
        q1:null,
        q2:1,
        q3:23,
        q4:null,
        q5:{v1:null, v2:wers, v3:null}
        a1:['sdsad','wqeqw'],
        d1:'123_23'
    },
    l:{ 
        w1:wwew
        w2:null
        w4:123
    }
}

How can I process the content inside above json and perform some operations like: exploding column d1:'123_23' on '_' and add as another column in the data frame.

How can I read how many keys have not null values inside the json. And if there is any array then how to count the elements of that array.

So I do have data frame as :

Below is the example dataframe:

col1 : gf23431  
col2 : 6728103  
col3 : "z:{
 k:{
  q1:null,
  q2:1,
  q3:23,
  q4:null,
  q5:{v1:null, v2:wers, v3:null}
  a1:['sdsad','wqeqw'],
  d1:'123_23'
 },
 l:{ 
  w1:wwew
  w2:null
  w4:123
 }
}"  
col4 : 3658

Desired Output columns:

Total keys under "k:" 7
Total non-null values under key "k:" 5 //count of keys having non-null values

Total keys under key "q5:" 3
Total non-null values under key "q5:" 1
Total values under "a1:" 2
split values under "d1:" and add another column 246 //multiply 1st vallue with 2 and add as another column in dataframe

so output columns will be:

col5 : 7
col6 : 5
col7 : 3
col8 : 1
col9 : 2
col10: 246

It's hard to understand what you are asking. Can you please [edit] your question to include a small [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and the corresponding desired output? — pault, Sep 07 '18 at 21:15

score 0 · Answer 1 · answered Sep 08 '18 at 19:36

Use something like the get_json_object function to extract the field you want. You can them compare with null etc.. as if these fields were just regular dataframe columns. Also check out the other functions for things like array length, maps etc..

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.get_json_object

How Can I process json data set inside a column in data frame

1 Answers1