sparksql:why in clause on partition col lead to full table scan

Asked Dec 17 '19 at 13:24

Active Dec 17 '19 at 13:24

Viewed 243 times

I want to get the most recent data from a partitioned hive table, I used sql like select * from table where date in (select max(date) from table t),date is the partition column, but it trigged hive full table scan, why can't sparksql query hdfs directories and get the max date then only scan only one partition？I found many answers which explains how to avoid full table scan, but what I really want to know is why!

asked Dec 17 '19 at 13:24

sevenitch

Welcome to stack overflow! Unfortunately, this question is not detailed enough to give you any meaningful help. Please edit your question to include a minimal reproducible example for the issue, including sample input, preferred output, and code for what you've tried so far. Also, since you have an error, please include the full error traceback in the text of the question. – E. Zeytinci Dec 17 '19 at 14:22
some one has same issue. you can try to read this [post](https://stackoverflow.com/questions/56994923/spark-subquery-scan-whole-partition) – Bin Zhang Feb 20 '20 at 15:25

sparksql:why in clause on partition col lead to full table scan

0 Answers0