I am trying to load a few JSON files' data to HIVE using spark-shell and Scala.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setAppName("App").setMaster("local")
import org.apache.spark.sql.hive._
val hiveContext = new HiveContext(sc)
val df = hiveContext.read.schema(buildSchema()).json(<path to json>)
df.printSchema()
df.show()
df.write.mode("append").saveAsTable("data")
The issue is that some of the fields in my json files are Array of Strings. If any given file has even one record with the Array of Strings field with some valid value, then the resulting dataframe has the right data type (i.e. Array of Strings) for the said field but if all the records in a given json file has empty value in the Array of Strings fields, then the record is written as String type in the dataframe. And when eventually the data has to be appended from dataframe to the HIVE table, the records are rejected because of the data type mismatch. How do I ensure that this data type mismatch is avoided and irrespective of any value or not in the said field, it's data type is read as Array of Strings (maybe with null value if the Array is empty)?