Explicitly specify schema for reading JSON and mark missing fields as null

Question

I'm generating a DataSet<Person> like so:

DataSet<Person> personDs = sparkSession.read().json("people.json").as(Encoders.bean(Person.class));

where Person is

class Person {
    private String name;
    private String placeOfBirth;

    //Getters and setters
    ...
}

If my input data only contains a name ({"name" : "bob"}), I get an error org.apache.spark.sql.AnalysisException: cannot resolve 'placeOfBirth' given input columns: [name].

Is there any way for me to tell Spark that placeOfBirth (or any other field) can be null?

Have you tried to add second record with both - name and place of birth? This should do the trick. — Vladislav Varslavans, May 09 '18 at 13:35
Well yes, that works as expected. But my problem is that my data doesn't always have both fields (my actual bean contains about 20 fields, and input data can contain any subset of those fields; I just simplified it into a MVCE). — Ben Watson, May 09 '18 at 13:39

score 1 · Accepted Answer · answered May 09 '18 at 14:11

1

In Spark 2.3.0 and Scala 2.11.12 that code worked for me:

sparkSession.read.schema("name String, placeOfBirth String").json("people.json").as(Encoders.bean(classOf[Person])).show()

Output:

+----+------------+
|name|placeOfBirth|
+----+------------+
| bob|        null|
+----+------------+

answered May 09 '18 at 14:11

Vladislav Varslavans

2,775
4
18
33

Thanks. This works but it's not ideal - I have to specify the schema twice - once in my bean and another time as a plaintext schema. Whilst it works, I'd much prefer a solution that doesn't require two changes if my data model changes. – Ben Watson May 09 '18 at 14:14
I think you can use something similar as in [this](https://stackoverflow.com/questions/36746055/generate-a-spark-structtype-schema-from-a-case-class) question to do it automatically – Vladislav Varslavans May 09 '18 at 14:17
Magical, that linked answer (specifically `Encoders.bean(Person.class).schema();`) was exactly what I needed. I'll accept this an also mark my question as a duplicate. – Ben Watson May 09 '18 at 14:27
1

You can do it without specifying it twice : val schema = Encoders.bean(classOf[Person]).schema val person = spark.read.schema(schema).json("/path/test.json"), this will give the desired result. – Neha Kumari Apr 05 '20 at 14:53

Explicitly specify schema for reading JSON and mark missing fields as null

1 Answers1

Linked