2

I have a partitioned table tab and I want to create some tmp table test1 from it. Here is how I created the tmp table:

CREATE TABLE IF NOT EXISTS test1
(
 COL1 string,
 COL2 string,
 COL3 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

Write to this table with:

INSERT OVERWRITE TABLE test1 
SELECT TAB.COL1 as COL1,
    TAB.COL2 as COL2,
    TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01';

Then I count records in test1, it has 94493486 records, while the following SQL returns count 149248486:

SELECT COUNT(*) FROM
(SELECT TAB.COL1 as COL1,
    TAB.COL2 as COL2,
    TAB.COL3 as COL3
  FROM TAB
  WHERE PT='2019-05-01') AS TMP;

Also, when I save the selected partition(PT is the partition column) to HDFS, the record count is correct:

INSERT OVERWRITE directory '/user/me/wtfhive' row format delimited fields terminated by '|'
SELECT TAB.COL1 as COL1,
TAB.COL2 as COL2,
TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01';

My Hive version is 3.1.0 coming with Ambari 2.7.1.0. Anyone have any idea what may cause this issue? Thanks.

=================== UPDATE =================

I find something might related to this issue. The table tab uses ORC as storage format. Its data is imported from ORC data file of another table, in another Hive cluster, with following script:

LOAD DATA INPATH '/data/hive_dw/db/tablename/pt=2019-04-16' INTO TABLE tab PARTITION(pt='2019-04-16');

As the 2 table shares same format, the loading procedure is basically just moves data file from HDFS source directory to Hive directory.

In following procedure, I can load without any issue:

  1. export data from ORC table tab to HDFS text file
  2. load from the text file to a Hive temp table
  3. load data back to tab from the temp table
  4. now I can select/export from tab to other tables without any record missing

I suspect the issue is in ORC format. I just don't understand why it can export to HDFS text file without any problem, but export to another table(no matter what storage format the other table uses) will loss data.

tuo
  • 586
  • 1
  • 4
  • 20
  • This can be an issue with statistics: https://stackoverflow.com/a/39914232/2700344 – leftjoin May 07 '19 at 13:11
  • @leftjoin That's what I thought at first, but after I set `hive.compute.query.using.stats` to false, the counts was still the same. I tried calculating other metrics of the tmp table "test1", the sum of column values also reflected the missing of records. – tuo May 08 '19 at 02:20
  • May be records get appended into the table test1 – saravanatn Oct 25 '19 at 04:50

1 Answers1

-1

use this below serde properties :

) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"" ) STORED as TEXTFILE