Inconsistent JSON schema guess with Spark dataframes

Question

Trying to read a JSON file with Spark 1.4.1 dataframes and to navigate inside. Seems the guessed schema is incorrect.

JSON file is:

{
    "FILE": {
        "TUPLE_CLI": [{
            "ID_CLI": "C3-00000004",
            "TUPLE_ABO": [{
                "ID_ABO": "T0630000000000004",
                "TUPLE_CRA": {
                    "CRA": "T070000550330",
                    "EFF": "Success"
                },
                "TITRE_ABO": ["Mr",
                "OOESGUCKDO"],
                "DATNAISS": "1949-02-05"
            },
            {
                "ID_ABO": "T0630000000100004",
                "TUPLE_CRA": [{
                    "CRA": "T070000080280",
                    "EFF": "Success"
                },
                {
                    "CRA": "T070010770366",
                    "EFF": "Failed"
                }],
                "TITRE_ABO": ["Mrs",
                "NP"],
                "DATNAISS": "1970-02-05"
            }]
        },
        {
            "ID_CLI": "C3-00000005",
            "TUPLE_ABO": [{
                "ID_ABO": "T0630000000000005",
                "TUPLE_CRA": [{
                    "CRA": "T070000200512",
                    "EFF": "Success"
                },
                {
                    "CRA": "T070010410078",
                    "EFF": "Success"
                }],
                "TITRE_ABO": ["Miss",
                "OB"],
                "DATNAISS": "1926-11-22"
            }]
        }]
    }
}

Spark code is:

val j = sqlContext.read.json("/user/arthur/test.json")
j.printSchema

Result is:

root
 |-- FILE: struct (nullable = true)
 |    |-- TUPLE_CLI: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- ID_CLI: string (nullable = true)
 |    |    |    |-- TUPLE_ABO: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- DATNAISS: string (nullable = true)
 |    |    |    |    |    |-- ID_ABO: string (nullable = true)
 |    |    |    |    |    |-- TITRE_ABO: array (nullable = true)
 |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |-- TUPLE_CRA: string (nullable = true)

It's pretty obvious that TUPLE_CRA is an array. I can't understand why it's not guessed. In my opinion, inferred schema should be:

root
 |-- FILE: struct (nullable = true)
 |    |-- TUPLE_CLI: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- ID_CLI: string (nullable = true)
 |    |    |    |-- TUPLE_ABO: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- DATNAISS: string (nullable = true)
 |    |    |    |    |    |-- ID_ABO: string (nullable = true)
 |    |    |    |    |    |-- TITRE_ABO: array (nullable = true)
 |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |-- TUPLE_CRA: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- CRA: string (nullable = true)
 |    |    |    |    |    |    |    |-- EFF: string (nullable = true)

Does someone have an explanation? Is there a way to tell Spark easily what's the actual schema if JSON schema is way more complex?

score 3 · Accepted Answer · answered Nov 26 '15 at 15:25

Well, finally understood that the JSON is not the expected one. You'll notice that the first TUPLE_CRA is an element without brackets []. The others TUPLE_CRA are array with brackets and several elements inside. That's the reason why Spark is unable to accuratly guess the structure. So the problem comes from the generation of this JSON. I need to modify it to make every TUPLE_CRA an array even if only one element inside.

Inconsistent JSON schema guess with Spark dataframes

1 Answers1