Spark - JSON empty Array of String show up as String

Question

I am trying to load a few JSON files' data to HIVE using spark-shell and Scala.

 import org.apache.spark.SparkConf
 import org.apache.spark.sql.SQLContext

 val conf = new SparkConf().setAppName("App").setMaster("local")

 import org.apache.spark.sql.hive._

 val hiveContext = new HiveContext(sc)

 val df = hiveContext.read.schema(buildSchema()).json(<path to json>)
 df.printSchema()
 df.show()

 df.write.mode("append").saveAsTable("data")

The issue is that some of the fields in my json files are Array of Strings. If any given file has even one record with the Array of Strings field with some valid value, then the resulting dataframe has the right data type (i.e. Array of Strings) for the said field but if all the records in a given json file has empty value in the Array of Strings fields, then the record is written as String type in the dataframe. And when eventually the data has to be appended from dataframe to the HIVE table, the records are rejected because of the data type mismatch. How do I ensure that this data type mismatch is avoided and irrespective of any value or not in the said field, it's data type is read as Array of Strings (maybe with null value if the Array is empty)?

Can you post a few lines of sample data explaining both cases? — shanmuga, May 21 '18 at 11:40
You can look at [this](https://stackoverflow.com/questions/50254423/explicitly-specify-schema-for-reading-json-and-mark-missing-fields-as-null/50255555#50255555) post for an answer. — Vladislav Varslavans, May 21 '18 at 11:47
cast the columns as Array[String] before you save them to Hive then — Ramesh Maharjan, May 21 '18 at 11:54

score 0 · Answer 1 · edited May 22 '18 at 05:38

0

Try below code, I am casting to Array[String] and works for my sample data.

df.select(df("Your_Column_Name").cast(ArrayType(StringType))).write.mode("append").saveAsTable("data")

edited May 22 '18 at 05:38

vindev

2,240
2
13
20

answered May 21 '18 at 20:15

Looks like this would mean that I need to specify a whole select query before write. My json file structure is quite big and I don't want to specify the whole cast and select query before writing to the table. Is there any other way? – Neha May 22 '18 at 18:33
This select is overriding the existing column , you would not have to select all columns again. – May 22 '18 at 18:46
I tried this but got an error message:0org.apache.spark.sql.AnalysisException: cannot resolve 'cast( as array)' due to data type mismatch: cannot cast StringType to ArrayType(StringType,true); – Neha May 23 '18 at 17:46

Spark - JSON empty Array of String show up as String

1 Answers1