0

i have a data frame with schema like below: (I have large number of keys )

 |-- loginRequest: struct (nullable = true)
 |    |-- responseHeader: struct (nullable = true)
 |    |    |-- status: long (nullable = true)
 |    |    |-- code: long (nullable = true)
 |-- loginResponse: struct (nullable = true)
 |    |-- responseHeader: struct (nullable = true)
 |    |    |-- status: long (nullable = true)
 |    |    |-- code: long (nullable = true)

 
 
 

I want to create a column with status of all the keys of responseHeader.status

Expected

+--------------------+--------------------+------------+
|        loginRequest|       loginResponse|  status    |
+--------------------+--------------------+------------+
|[0,1]               |                null|      0     |
|                null|[0,1]               |      0     |
|                null|               [0,1]|      0     |
|                null|               [1,0]|      1     |
+--------------------+--------------------+-------------

Thanks in Advance

Learnis
  • 526
  • 5
  • 25

2 Answers2

0

A simple select will solve your problem.

You have a nest field :

loginResponse: struct (nullable = true)
 |    |-- responseHeader: struct (nullable = true)
 |    |    |-- status

A quick way would be to flatten your dataframe.

Doing something like this:

df.select(df.col("loginRequest.*"),df.col("loginResponse.*"))

And get it working from there:

Or,

You could use something like this:

var explodeDF = df.withColumn("statusRequest", df("loginRequest. responseHeader"))

which you helped me into and these questions:

In order to get it to populate either from response or request, you can use and when condition in spark. - How to use AND or OR condition in when in Spark

OBarros
  • 132
  • 1
  • 9
  • 1
    HI OBarros, I have a large number of keys where i dont have count and also the exact keynames – Learnis Jun 30 '19 at 10:26
  • What does it mean to have a large number of keys? Can you create an example that we can run, or show a schema that matches the problem better? – OBarros Jun 30 '19 at 13:16
  • 1
    I can use your Hint which you have provided.The example mentioned here is having only two keys (loginRequest,loginResponse).But In mycase there are 100's of keys are there like loginRequest,loginResponse.I cannont write like this "var explodeDF = df.withColumn("statusRequest", df("loginRequest. responseHeader"))" for every Key. i already tried like this "df.withColumn("statusRequest", df("*. responseHeader.status")) " But its not working – Learnis Jun 30 '19 at 13:28
  • 1
    Its a just file read with json operation.Json schema in every record is different – Learnis Jun 30 '19 at 14:58
0

You are able to get the subfields with the . delimiter in the select statement and with the help of the coalesce method, you should get exactly what you aim for, i.e. let's call the input dataframe df with your specified input schema, then this piece of code should do the work:

import org.apache.spark.sql.functions.{coalesce, col}

val df_status = df.withColumn("status", 
                              coalesce(
                               col("loginRequest.responseHeader.status"),
                               col("loginResponse.responseHeader.status")
                               )
                              )

What coalesce does, is that it takes first non-null value in the order of the input columns to the method and in case there is no non-null value, it will return null (see https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/functions.html#coalesce-org.apache.spark.sql.Column...-).

Richard Nemeth
  • 1,784
  • 1
  • 6
  • 16
  • 1
    HI Richard Yes, i can do this. But main thing is, I have large number of keys – Learnis Jun 30 '19 at 10:28
  • 1
    .The example mentioned here is having only two keys (loginRequest,loginResponse).But In mycase there are 100's of keys are there like loginRequest,loginResponse.I cannot add all the keys in coalesce. – Learnis Jun 30 '19 at 13:30