Select with where on partition key too slow with limit on Spark Sql

Asked Mar 06 '19 at 08:35

Active Mar 06 '19 at 08:35

Viewed 221 times

I am executing a query like select <column> from <mytable> where <partition_key> = <value> limit 10

and it is taking FOREVER to execute. I looked at the physical plan and I see a HiveTableScan in there and that looked fishy, does that mean the query is scanning the entire table? I was expecting the query to

A. exactly scan 1 partition and no more

B. end the scan as soon as it returns 10 rows

Is my understanding incorrect? How do I make spark perform exactly this?

asked Mar 06 '19 at 08:35

user1639848

LIMIT, but no ORDER BY? – jarlh Mar 06 '19 at 08:38
@jarlh what do you mean? I don't have an order by in my query. It looks exactly like what I have mentioned. I think you mean orderby might need a global scan to do a full sort on the partition but definitely I am not asking it to do that – user1639848 Mar 06 '19 at 09:02
Without an ORDER BY those 10 rows will be just 10 arbitrary rows, but perhaps that doesn't matter? – jarlh Mar 06 '19 at 09:12
It seems that your spark version doesn't optimize the limit grammer yet. What's the spark version you're using? – Jiayi Liao Mar 06 '19 at 09:23
Please read [How to make good reproducible Apache Spark examples](https://stackoverflow.com/q/48427185/10465355) and [edit] your question accordingly. At least we'll need a definition of the table, how it was created (Hive or Spark) and the execution plan. – 10465355 Mar 06 '19 at 10:06

Select with where on partition key too slow with limit on Spark Sql

0 Answers0