3

I have created a hive external table stored as textfile partitioned by event_date Date.

How do we have to specify a specific format of csv while reading in spark from Hive table ?

The environment is

 1. 1.Spark 1.5.0 - cdh5.5.1 Using Scala version 2.10.4(Java HotSpot(TM) 64 - Bit Server VM, Java 1.7.0_67)   
 2. Hive 1.1, CDH 5.5.1

scala script

    sqlContext.setConf("hive.exec.dynamic.partition", "true")
    sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

    val distData = sc.parallelize(Array((1, 1, 1), (2, 2, 2), (3, 3, 3))).toDF
    val distData_1 = distData.withColumn("event_date", current_date())
    distData_1: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3: int, event_date: date]

    scala > distData_1.show 
    + ---+---+---+----------+ 
    |_1 |_2 |_3 | event_date | 
    | 1 | 1 | 1 | 2016-03-25 | 
    | 2 | 2 | 2 | 2016-03-25 | 
    | 3 | 3 | 3 | 2016-03-25 | 

    distData_1.write.mode("append").partitionBy("event_date").saveAsTable("part_table")

    scala > sqlContext.sql("select * from part_table").show 
    | a    | b    | c    | event_date | 
    |1,1,1 | null | null | 2016-03-25 | 
    |2,2,2 | null | null | 2016-03-25 | 
    |3,3,3 | null | null | 2016-03-25 |

Hive table

    create external table part_table (a String, b int, c bigint) 
    partitioned by (event_date Date)
    row format delimited fields terminated by ',' 
    stored as textfile  LOCATION "/user/hdfs/hive/part_table";

    select * from part_table shows
    |part_table.a | part_table.b | part_table.c | part_table.event_date |
    |1            |1             |1             |2016-03-25
    |2            |2             |2             |2016-03-25
    |3            |3             |3             |2016-03-25

Looking at the hdfs

    The path has 2 part files /user/hdfs/hive/part_table/event_date=2016-03-25
    part-00000
    part-00001

      part-00000 content
        1,1,1
      part-00001 content
        2,2,2
        3,3,3

P.S. if we store the table as orc it writes and reads the data as expected.

If the 'fields terminated by' is default then Spark can read the data as expected hence i guess this would be a bug.

Shiva Achari
  • 955
  • 1
  • 9
  • 18

0 Answers0