I have created a hive external table stored as textfile partitioned by event_date Date.
How do we have to specify a specific format of csv while reading in spark from Hive table ?
The environment is
1. 1.Spark 1.5.0 - cdh5.5.1 Using Scala version 2.10.4(Java HotSpot(TM) 64 - Bit Server VM, Java 1.7.0_67)
2. Hive 1.1, CDH 5.5.1
scala script
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
val distData = sc.parallelize(Array((1, 1, 1), (2, 2, 2), (3, 3, 3))).toDF
val distData_1 = distData.withColumn("event_date", current_date())
distData_1: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3: int, event_date: date]
scala > distData_1.show
+ ---+---+---+----------+
|_1 |_2 |_3 | event_date |
| 1 | 1 | 1 | 2016-03-25 |
| 2 | 2 | 2 | 2016-03-25 |
| 3 | 3 | 3 | 2016-03-25 |
distData_1.write.mode("append").partitionBy("event_date").saveAsTable("part_table")
scala > sqlContext.sql("select * from part_table").show
| a | b | c | event_date |
|1,1,1 | null | null | 2016-03-25 |
|2,2,2 | null | null | 2016-03-25 |
|3,3,3 | null | null | 2016-03-25 |
Hive table
create external table part_table (a String, b int, c bigint)
partitioned by (event_date Date)
row format delimited fields terminated by ','
stored as textfile LOCATION "/user/hdfs/hive/part_table";
select * from part_table shows
|part_table.a | part_table.b | part_table.c | part_table.event_date |
|1 |1 |1 |2016-03-25
|2 |2 |2 |2016-03-25
|3 |3 |3 |2016-03-25
Looking at the hdfs
The path has 2 part files /user/hdfs/hive/part_table/event_date=2016-03-25
part-00000
part-00001
part-00000 content
1,1,1
part-00001 content
2,2,2
3,3,3
P.S. if we store the table as orc it writes and reads the data as expected.
If the 'fields terminated by' is default then Spark can read the data as expected hence i guess this would be a bug.