15

The documentation of HIVE notes that LIMIT clause returns rows chosen at random. I have been running a SELECT table on a table with more than 800,000 records with LIMIT 1, but it always return me the same record.

I'm using the Shark distribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated.

Thanks, Visakh

ChrisGPT was on strike
  • 127,765
  • 105
  • 273
  • 257
visakh
  • 2,503
  • 8
  • 29
  • 55
  • 4
    I do not know HIVE, so I will not answer. In most RDBMS, the randomness declared in the use of LIMIT is due to the fact that which rows you get depends on how the optimizer decides to get them. In other words, if you want specific rows, you'll need to declare it explicitly (usually with an ORDER BY). It does not mean that the engine is rolling die every time you execute the query, it just means you cannot expect the result to always be the same. – Frazz May 22 '14 at 08:59
  • In that case, do you think it would be better to create a sequence, assign to the output records and then pick one by random? Are you aware of any industry standard way of retrieving truly random records from any RDBMS? – visakh May 22 '14 at 09:02
  • 1
    Seems `HIVE` supports `RAND()`. So, I think by doing `ORDER BY RAND()` and then `LIMIT 1` does the job. – visakh May 22 '14 at 09:05

3 Answers3

12

Even though the documentation states it returns rows at random, it's not actually true.

It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.

As soon as you slap a order by x DESC limit 5 on there, it returns the last 5 rows of whatever you're selecting from.

To get rows returned at random, you would need to use something like: order by rand() LIMIT 1

However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset

user3036342
  • 1,023
  • 7
  • 15
  • Thank you...just saw the same [here](http://mail-archives.apache.org/mod_mbox/hive-user/201208.mbox/%3C899850488-1345046651-cardhu_decombobulator_blackberry.rim.net-1080470187-@b2.c15.bise7.blackberry%3E) – visakh May 22 '14 at 09:06
  • Since I'm on `Shark`, indexes aren't supported anyway. And my table is cached, so for now, performance is not an issue...:-) – visakh May 22 '14 at 09:07
11

To be safe you want to use

select * from table

distribute by rand()

sort by rand()

limit 10000;

Community
  • 1
  • 1
Keith
  • 4,646
  • 7
  • 43
  • 72
4

The documentation may have been updated since this question was originally posted in 2014, but as of December,2017, the documentation now reads, "The following query returns 5 arbitrary customers".

In this case, "arbitrary" means method of selecting either is not deterministic or may not be worth the trouble to document. In other words, you shouldn't count on it as a reliable method for getting specific subset of records (e.g., for sampling). You should only use the Limit clause without an Order By clause if you are looking for expediency and want to get a small result set as quickly as possible (e.g., for QA purposes). Otherwise, use one of Order By, Cluster By, or Distribute By/Sort By as appropriate.

Rick
  • 41
  • 1