Hive queries using partitioned columns does not retrieve all the records. Why?

Question

I have a Hive statement as below:

INSERT INTO TABLE myTable partioned (myDate) SELECT * from myOthertable

myOthertable contains 1 million records and, while executing the above Insert, not all rows are inserted into myTable. As it is a SELECT * query without any WHERE clause ideally the Insert should be done for all the rows from myOthertable into myTable. It ignores some of the rows while inserting.

Can anyone suggest why this is happening?

Try not using INSERT INTO, but instead INSERT OVERWRITE. Also explicitly list your column names and make sure there exists a column named myDate. — Bryan, Nov 15 '12 at 15:11
How do you check that not all rows was inserted? If you are using `count(*)` then this may be a problem with statistics, see here: https://stackoverflow.com/a/39914232/2700344 — leftjoin, Apr 01 '18 at 19:55

score 0 · Answer 1 · answered Dec 01 '14 at 06:58

The issue may be due to ,If the table is large enough the above query wont work seems like due to the larger number of files created on initial map task.

So in that cases group the records in your hive query on the map process and process them on the reduce side. You can implement the same in your hive query itself with the usage of DISTRIBUTE BY. Below is the query .

FROM myOthertable 
INSERT OVERWRITE TABLE myTable(myDate) 
SELECT other1, other2 DISTRIBUTE BY myDate;

This link may help

Hive queries using partitioned columns does not retrieve all the records. Why?

1 Answers1