0

I am executing a query like select <column> from <mytable> where <partition_key> = <value> limit 10

and it is taking FOREVER to execute. I looked at the physical plan and I see a HiveTableScan in there and that looked fishy, does that mean the query is scanning the entire table? I was expecting the query to

A. exactly scan 1 partition and no more

B. end the scan as soon as it returns 10 rows

Is my understanding incorrect? How do I make spark perform exactly this?

user1639848
  • 562
  • 5
  • 15
  • LIMIT, but no ORDER BY? – jarlh Mar 06 '19 at 08:38
  • @jarlh what do you mean? I don't have an order by in my query. It looks exactly like what I have mentioned. I think you mean orderby might need a global scan to do a full sort on the partition but definitely I am not asking it to do that – user1639848 Mar 06 '19 at 09:02
  • Without an ORDER BY those 10 rows will be just 10 arbitrary rows, but perhaps that doesn't matter? – jarlh Mar 06 '19 at 09:12
  • It seems that your spark version doesn't optimize the limit grammer yet. What's the spark version you're using? – Jiayi Liao Mar 06 '19 at 09:23
  • Please read [How to make good reproducible Apache Spark examples](https://stackoverflow.com/q/48427185/10465355) and [edit] your question accordingly. At least we'll need a definition of the table, how it was created (Hive or Spark) and the execution plan. – 10465355 Mar 06 '19 at 10:06

0 Answers0