1

I've created the following table in Hive:

CREATE TABLE mytable (..columns...) PARTITIONED BY (load_date string) STORED AS ...

And I'm trying to insert data to my table with spark as follow:

Dataset<Row> dfSelect = df.withColumn("load_date","15_07_2018");
dfSelect.write().mode("append").partitionBy("load_date").save(path);

And also make the following configuration:

sqlContext().setConf("hive.exec.dynamic.partition","true");
sqlContext().setConf("hive.exec.dynamic.partition.mode","nonstrict");

And after I make the write command I see on HDFS the directory /myDbPath/load_date=15_07_2018, which contains the file that I've written but when I make query like:

show partitions mytable

or

select * from mytable where load_date="15_07_2018"

I get 0 records.

What happened and how can I fix this?

EDIT

If I run the following command in Hue:

msck repair table mytable

I solve the problem, how can I do it in my code?

stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
Ya Ko
  • 509
  • 2
  • 4
  • 19

1 Answers1

1

Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command (or) .save..etc), the metastore (and hence Hive) will not be aware of these partitions unless the user runs either of the below commands

  1. Meta store check command (msck repair table)

msck repair table <db.name>.<table_name>;

(or)

  1. ALTER TABLE table_name ADD PARTITION commands on each of the newly added partitions.

We can also add partition by using alter table statement by using this way we need to add each and every newly created partition to the table

alter table <db.name>.<table_name> add partition(load_date="15_07_2018") location <hdfs-location>;

Run either of the above statements and then check the data again for load_date="15_07_2018"

For more details refer these links add partitions and msck repair table

notNull
  • 30,258
  • 4
  • 35
  • 50
  • Probably this link will give u some insights regarding adding partitions using java https://stackoverflow.com/questions/39807098/extract-hive-table-partition-in-spark-java/39807932#39807932 – notNull Jul 15 '18 at 12:41
  • Thank you, Can I do it from SparkSQL? – Ya Ko Jul 15 '18 at 13:31
  • 1
    Yes, Initialize and use hive context then execute the msck repair table **hiveContext.sql("msck repair table .")** .. if you are using Spark 2.0+ then **spark.sql("msck repair table .")** – notNull Jul 15 '18 at 13:44
  • Ok, But I need to do it every time when I make save command? Its not should make performace problem? – Ya Ko Jul 15 '18 at 13:59
  • 1
    if you are writing data to **HDFS location directly** then **every time** needs to execute **msck repair table stmt** to add newly created partitions to hive table and i don't think there is any performance issues. another way of doing this is creating dynamic partitions in hive table using spark,refer to this link https://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive.. if we create partitions like this way you don't have to run msck repair table again. – notNull Jul 15 '18 at 14:15