2

A weird behaviour I never saw.

A have parquet files in my HDFS generated by Spark, split into 3 partitions.

Here is the count for all partitions. It was calculated by Spark.

+----------+--------+
|created_at|count(1)|
+----------+--------+
|  20190101|12774895|
|  20181231|18648432|
|  20190102|30010065|
+----------+--------+

Total: 61433392 records

Hive

Created a table.

CREATE EXTERNAL TABLE raw.event (
    account_type STRING,
    event_name STRING,
    payload MAP<STRING, STRING>
)
PARTITIONED BY(created_at INT, product_name STRING, event_type STRING)
STORED AS PARQUET LOCATION '/datalake/raw/event'
TBLPROPERTIES('PARQUET.COMPRESS'='SNAPPY');

Added all partitions.

show partitions raw.event;

+---------------------------------------------------------------------+--+
|                              partition                              |
+---------------------------------------------------------------------+--+
| created_at=20181231/product_name=A/event_type=X  |
| created_at=20190101/product_name=A/event_type=X  |
| created_at=20190102/product_name=A/event_type=X  |
+---------------------------------------------------------------------+--+

Execute a full count to make sure everything run smoothly:

0: jdbc:hive2://headnodehost:10001/> select count(*) from raw.event;

+-----------+--+
|    _c0    |
+-----------+--+
| 61433392  |
+-----------+--+

Expected result! Ok :)

Now count just from one partition.

0: jdbc:hive2://headnodehost:10001/> select count(*) from raw.event where created_at=20190102 and product_name='A' and event_type='X';

Instead get the expected 12774895 count rows I am still getting the full count.

+-----------+--+
|    _c0    |
+-----------+--+
| 61433392  |
+-----------+--+

Now I am trying to count and group by the created_at partition.

0: jdbc:hive2://headnodehost:10001/> SELECT created_at, count(*) FROM raw.event group by created_at;

+-------------+-----------+--+
| created_at  |    _c1    |
+-------------+-----------+--+
| 20190102    | 61433392  |
+-------------+-----------+--+

I am still getting the full count of rows, and just the last added partition.

I am pretty sure the parquet files contains different created_at values.

Avoid using stats, also does not help.

set hive.compute.query.using.stats=false;

Analyse table also does not help:

0: jdbc:hive2://headnodehost:10001/> ANALYZE TABLE raw.event PARTITION(created_at, product_name, event_type) COMPUTE STATISTICS;

INFO  : Partition raw.event{created_at=20181231, product_name=A, event_type=X} stats: [numFiles=111, numRows=**61433392**, totalSize=19637211108, rawDataSize=1965868544]
INFO  : Partition raw.event{created_at=20190101, product_name=A, event_type=X} stats: [numFiles=111, numRows=0, totalSize=19637211108, rawDataSize=0]
INFO  : Partition raw.event{created_at=20190102, product_name=A, event_type=X} stats: [numFiles=111, numRows=0, totalSize=19637211108, rawDataSize=0]

Forcing a partition brings the same result.

0: jdbc:hive2://headnodehost:10001/> ANALYZE TABLE raw.event PARTITION(created_at=20190102, product_name, event_type) COMPUTE STATISTICS;

INFO  : Partition raw.event{created_at=20190102, product_name=A, event_type=X} stats: [numFiles=111, numRows=**61433392**, totalSize=19637211108, rawDataSize=1965868544]

Any suggestion?

10465355
  • 4,481
  • 2
  • 20
  • 44
bsao
  • 411
  • 4
  • 7

1 Answers1

0

This is probably related to how you added the partitions. It looks like all your data is under the same directory in the HDFS. Partitions point to a specific location in the hdfs (not multiple locations), so I would assume 1 partition is pointing to the whole dataset. You can run the following command to verify partition location

describe formatted raw.event  partition ( created_at=20181231,product_name="A",event_type="X");

Also, partitions do not depend on the data itself or even directory, location and value depend on how you added them to the table

hlagos
  • 7,690
  • 3
  • 23
  • 41
  • There should be a space in between `raw.event` and `partition` in the command given. The minimum edit is 6 characters so I could not edit it myself. Also, I'm on the same team as OP and running that command verified the partition location was at `/raw/event/created_at=20190101/product_name=A/event_type=X` – Vishaal Kalwani Jan 24 '19 at 19:02
  • @VishaalKalwani, what about the rest of the partition? whjat is the location and is there any data there? – hlagos Jan 24 '19 at 19:15
  • For all 3 partitions (20191291, 20190101, 20190102), the locations all follow the same pattern. In fact, the query is coming back with entirely different results tan before (using both `*.stats=true` and `*.stats=false`, but it's the *same* different result). Is there a propagation period for configuration settings? The query `SELECT created_at, count(*) FROM raw.event group by created_at;` now returns: `{ 20181231 => 1262881758, 20190101 => 1177216179, 20190102 => 2148549023 }`. I modified the format of the output to be legible in this comment. – Vishaal Kalwani Jan 24 '19 at 19:19
  • Also, it looks like it matches the Databricks results now, albeit with different numbers than OP posted. I will confer with OP about these results. Very strange that it did not initially work. – Vishaal Kalwani Jan 24 '19 at 19:43