I have a partitioned table tab
and I want to create some tmp table test1
from it. Here is how I created the tmp table:
CREATE TABLE IF NOT EXISTS test1
(
COL1 string,
COL2 string,
COL3 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
Write to this table with:
INSERT OVERWRITE TABLE test1
SELECT TAB.COL1 as COL1,
TAB.COL2 as COL2,
TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01';
Then I count records in test1, it has 94493486 records, while the following SQL returns count 149248486:
SELECT COUNT(*) FROM
(SELECT TAB.COL1 as COL1,
TAB.COL2 as COL2,
TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01') AS TMP;
Also, when I save the selected partition(PT is the partition column) to HDFS, the record count is correct:
INSERT OVERWRITE directory '/user/me/wtfhive' row format delimited fields terminated by '|'
SELECT TAB.COL1 as COL1,
TAB.COL2 as COL2,
TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01';
My Hive version is 3.1.0 coming with Ambari 2.7.1.0. Anyone have any idea what may cause this issue? Thanks.
=================== UPDATE =================
I find something might related to this issue.
The table tab
uses ORC as storage format. Its data is imported from ORC data file of another table, in another Hive cluster, with following script:
LOAD DATA INPATH '/data/hive_dw/db/tablename/pt=2019-04-16' INTO TABLE tab PARTITION(pt='2019-04-16');
As the 2 table shares same format, the loading procedure is basically just moves data file from HDFS source directory to Hive directory.
In following procedure, I can load without any issue:
- export data from ORC table
tab
to HDFS text file - load from the text file to a Hive temp table
- load data back to
tab
from the temp table - now I can select/export from
tab
to other tables without any record missing
I suspect the issue is in ORC format. I just don't understand why it can export to HDFS text file without any problem, but export to another table(no matter what storage format the other table uses) will loss data.