0

I have a dataframe like below.

ID,       details_Json
1         {"name":"Anne","Age":"12","country":"Denmark"}
2         {"name":"Zen","Age":"24"}
3         {"name":"Fred","Age":"20","country":"France"}
4         {"name":"Mona","Age":"18","country":"Denmark"}

As you can see fields in the json are not fixed. It can be include more than given fields. I mean sometimes name, Age, country and another time it can be something like name, Age, country, University or name, Age, university

I want to filter rows which are include country in its json and country is equal to Denmark.

My output should be look like below.

ID,       details_Json
1         {"name":"Anne","Age":"12","country":"Denmark"}
4         {"name":"Mona","Age":"18","country":"Denmark"}

Is there any way to do that?

Thank you:)

CRV
  • 69
  • 8
  • 1
    How does your rdd looks like? can you tell me – Raman Mishra Jan 17 '19 at 06:45
  • 3
    Possible duplicate of [How to query JSON data column using Spark DataFrames?](https://stackoverflow.com/questions/34069282/how-to-query-json-data-column-using-spark-dataframes) – NNK Jan 17 '19 at 07:48

1 Answers1

1

Here is a way:

//Construct dataframe
val df = sc.parallelize(Seq((1,"{\"name\":\"Anne\",\"Age\":\"12\",\"country\":\"Denmark\"}"), 
             (2, "{\"name\":\"Zen\",\"Age\":\"24\"}"), 
             (3, "{\"name\":\"Fred\",\"Age\":\"20\",\"country\":\"France\"}"), 
             (4, "{\"name\":\"Mona\",\"Age\":\"18\",\"country\":\"Denmark\"}"))).toDF("ID", "details_Json")

df.show

+---+--------------------+
| ID|        details_Json|
+---+--------------------+
|  1|{"name":"Anne","A...|
|  2|{"name":"Zen","Ag...|
|  3|{"name":"Fred","A...|
|  4|{"name":"Mona","A...|
+---+--------------------+

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val struct =
  StructType(
    StructField("name", StringType, true) ::
    StructField("Age", StringType, true) ::
    StructField("country", StringType, true) :: Nil)
val df2 = df.withColumn("details_Struct", from_json($"details_Json", struct)).withColumn("country", $"details_Struct".getField("country")).filter($"country".equalTo("Denmark")).drop("country", "details_Struct")

df2.show
+---+--------------------+
| ID|        details_Json|
+---+--------------------+
|  1|{"name":"Anne","A...|
|  4|{"name":"Mona","A...|
+---+--------------------+

Answer above is in Apache Spark 2.3.1. Which version do you use? In version 2.4.1, there is a schema_of_json function that infers schema automatically. You might want to also check that. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@schema_of_json(json:String):org.apache.spark.sql.Column

Gofrette
  • 468
  • 1
  • 8
  • 18
  • Thank you very much for your answer. I tried it. But it always gave me a empty result. Even the times when I'm very sure that I should have a answer. Is it because that my json has no fixed structure? I mean some times it can have more than mentioned fields`(name, Age, country)` – CRV Jan 17 '19 at 10:09