A weird behaviour I never saw.
A have parquet files in my HDFS generated by Spark, split into 3 partitions.
Here is the count for all partitions. It was calculated by Spark.
+----------+--------+
|created_at|count(1)|
+----------+--------+
| 20190101|12774895|
| 20181231|18648432|
| 20190102|30010065|
+----------+--------+
Total: 61433392 records
Hive
Created a table.
CREATE EXTERNAL TABLE raw.event (
account_type STRING,
event_name STRING,
payload MAP<STRING, STRING>
)
PARTITIONED BY(created_at INT, product_name STRING, event_type STRING)
STORED AS PARQUET LOCATION '/datalake/raw/event'
TBLPROPERTIES('PARQUET.COMPRESS'='SNAPPY');
Added all partitions.
show partitions raw.event;
+---------------------------------------------------------------------+--+
| partition |
+---------------------------------------------------------------------+--+
| created_at=20181231/product_name=A/event_type=X |
| created_at=20190101/product_name=A/event_type=X |
| created_at=20190102/product_name=A/event_type=X |
+---------------------------------------------------------------------+--+
Execute a full count to make sure everything run smoothly:
0: jdbc:hive2://headnodehost:10001/> select count(*) from raw.event;
+-----------+--+
| _c0 |
+-----------+--+
| 61433392 |
+-----------+--+
Expected result! Ok :)
Now count just from one partition.
0: jdbc:hive2://headnodehost:10001/> select count(*) from raw.event where created_at=20190102 and product_name='A' and event_type='X';
Instead get the expected 12774895 count rows I am still getting the full count.
+-----------+--+
| _c0 |
+-----------+--+
| 61433392 |
+-----------+--+
Now I am trying to count and group by the created_at partition.
0: jdbc:hive2://headnodehost:10001/> SELECT created_at, count(*) FROM raw.event group by created_at;
+-------------+-----------+--+
| created_at | _c1 |
+-------------+-----------+--+
| 20190102 | 61433392 |
+-------------+-----------+--+
I am still getting the full count of rows, and just the last added partition.
I am pretty sure the parquet files contains different created_at values.
Avoid using stats, also does not help.
set hive.compute.query.using.stats=false;
Analyse table also does not help:
0: jdbc:hive2://headnodehost:10001/> ANALYZE TABLE raw.event PARTITION(created_at, product_name, event_type) COMPUTE STATISTICS;
INFO : Partition raw.event{created_at=20181231, product_name=A, event_type=X} stats: [numFiles=111, numRows=**61433392**, totalSize=19637211108, rawDataSize=1965868544]
INFO : Partition raw.event{created_at=20190101, product_name=A, event_type=X} stats: [numFiles=111, numRows=0, totalSize=19637211108, rawDataSize=0]
INFO : Partition raw.event{created_at=20190102, product_name=A, event_type=X} stats: [numFiles=111, numRows=0, totalSize=19637211108, rawDataSize=0]
Forcing a partition brings the same result.
0: jdbc:hive2://headnodehost:10001/> ANALYZE TABLE raw.event PARTITION(created_at=20190102, product_name, event_type) COMPUTE STATISTICS;
INFO : Partition raw.event{created_at=20190102, product_name=A, event_type=X} stats: [numFiles=111, numRows=**61433392**, totalSize=19637211108, rawDataSize=1965868544]
Any suggestion?