6

Most resources that describe a SELECT TOP ... query in Postgres say that you should use LIMIT instead, possibly with an ORDER BY clause if you need to select the top elements by some ordering.

What do you do if you need to select the top N elements from a recursive query, where there is no ordering and there is a possibility the query can return fewer than N rows without the recursion (so that the TOP part is necessary to ensure the result set is at least N rows, whereas LIMIT can allow fewer rows)?

My specific use case is a modification of the dynamic SQL pattern for selecting a random subsample of a table.

Here is a link to the sql source of my modification. The easiest thing is to look at the final function defined there, _random_select. It follows very closely to the above link, but has been modified to be polymorphic in the input table and output result set, as well as correctly accounting for the need to return only the columns that already existed in the input table (yet another dynamic SQL hack to exclude the interim row_number result from the final result set).

It's an eyesore, but it's the closest thing I have to a reproducible example. If you use _random_select and attempt to get something around 4500 rows from a table larger than 4500 rows, you begin to see smaller result sets with high probability, and it only gets worse as you increase the size of your desired sample (because the occurrence of duplicates gets worse as your desired sample gets larger).

Note that in my modification, I am not using the _gaps trick from this link, meant to over-sample to offset sampling inefficiency if there are gaps in a certain index column. That part doesn't relate to this question, and in my case I am using row_number to ensure that there is an integer column with no possible gaps.

The CTE is recursive, to ensure that if the first, non-recursive part of the CTE doesn't give you enough rows (because of duplicates removed by the UNION), then it will go back through another round of recursive call of the CTE, and keep tacking on more results until you've got enough.

In the linked example, LIMIT is used, but I am finding that this does not work. The method returns fewer results because LIMIT is only an at most N rows guarantee.

How do you get an at least N rows guarantee? Selecting the TOP N rows seemed like the natural way to do this (so that the recursive CTE had to keep chugging along until it gets enough rows to satisfy the TOP condition), but this is not available in Postgres.

Community
  • 1
  • 1
ely
  • 74,674
  • 34
  • 147
  • 228
  • Say, you aim for top 5 categories, but your shop has only 3. What should be shown for the extra 2 — "Future category 1" and "Future category 2"? – vyegorov Mar 20 '16 at 20:50
  • It's a recursive CTE so this is literally impossible. Its generating random rows and can continue doing so forever if needed. The query is shorting out before it reaches a number of samples that is already less than the available rows in the table. – ely Mar 20 '16 at 20:55
  • `TOP` is a key word used by SQL Server, Sybase and maybe some other databases. It works pretty much the same as `LIMIT` in Postgres. I'm not aware of any database that would have your desired behaviour (If you request N rows from a data stream that can only provide fewer than N rows, then the desired behavior is that it should "hang"). Maybe, you'd better focus on your actual problem, rather than on your solution. If you want to select a random subsample from a table, then ask a question about that, rather than about `LIMIT`. `TOP/LIMIT` only literary limits the result set, never expands it. – Vladimir Baranov Mar 23 '16 at 02:38
  • The question then is how to continue requesting rows from a recursive CTE until a minimum set of rows is returned. The example problem I'm working on is to randomly sample without duplicates, but it could be any problem where you need exactly N rows from an infinite data stream. What are the rules that govern when it will return early? One would think that a recursive CTE with no explicit stopping criterion would continue producing results until `LIMIT` stopped it. But it can be the case that intermediate results, before being joined to the recursive call, can be returned. How does that work? – ely Mar 23 '16 at 03:18
  • A simpler version of the problem might be to generate a column with the first N fibonacci numbers, by unioning the calculation at step i with a recursive call for step i+1, and then allow `LIMIT` to be the mechanism by which the recursion terminates. [This link](http://blog.mno2.org/posts/2015-03-07-fib-in-postgres.html) discusses it, even with the same comparison to a lazy stream like a Haskell list. But extending it to work with a more generic "pull until you get enough" through `LIMIT` (or something else) is exactly the problem. – ely Mar 23 '16 at 03:21
  • What's really puzzling me is that in my GitHub example, even if you remove all usage of `LIMIT` from *within* the CTE, it *still* terminates early. In that case, you would certainly expect that the only way it can terminate at all is because of the `LIMIT` outside of the CTE, in which case it should provide exactly the limit number of rows. How could the query internal to the CTE be finishing early and return a result set? – ely Mar 23 '16 at 03:29
  • I do apologize for all of the comments, but the closest idea I have so far is that it must be a property of `UNION` .. since `UNION` requires to sort the result set for the sake of removing duplicates, and sorting the result set of a possibly infinite recursive stream is impossible, it must have some rule where it cuts off the recursive call if doing so wouldn't violate any higher level `LIMIT` constraints that the optimizer is aware of. So even a top-most `LIMIT` won't be enough since `UNION` will determine less than that, for sorting, is OK. `UNION ALL` does cause it to hang in my code. – ely Mar 23 '16 at 03:41
  • Just a quick question: You apply the sampling on a table level, as PostgreSQL is a relational DBMS and is thus based on tables with a Primary Key there should be no duplicate rows, so no need for removing them (and if there are duplicates and you remove them it's not a truly random sample). And why don't you apply a `TABELSAMPLE`, which was introduced in PG 9.5, the version your code is supposed to work with? – dnoeth Mar 23 '16 at 16:29
  • Randomly sampling with replacement always has the chance of producing duplicates, and so you always need some other process to ensure duplicates are removed. This is using the `random` function and converting it to the right range either based on a primary key or on `row_number`. A column generated by `random` can have duplicates, so the overall sample can. One alternative which I'm exploring is to instead first select a random shuffle of the key column (or of `row_number`), then use `LIMIT` and then join. But that random shuffle is no easier. – ely Mar 23 '16 at 16:39
  • As for `TABLESAMPLE` see the first link that I shared, it specifically mentions that `TABLESAMPLE` results are not actually uniformly random, and that it too returns a variable number of rows and can be very inefficient when an exact row count is needed. – ely Mar 23 '16 at 16:41
  • @ely I think this could conceptually be done by loading all relevant ids into an array (or temp table?), and then shifting them out, one-by-one in a recursive call, choosing from random based on the current array length. I'm not sure if that can be done in a single query though. – PinnyM Jun 18 '17 at 21:55

3 Answers3

3

This is too long for a comment, but is informative about what's going on with my existing query. From the documentation on recursive query evaluation, the steps that a recursive query will take are:

Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.

So long as the working table is not empty, repeat these steps:

a. Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.

b. Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.

So my hunch in the comments (after trying with UNION ALL) was mostly on the right track.

As the documentation states, this is actually just a type of iteration that re-uses the previous non-recursive result part in place of the recursive name used in the recursive part.

So it's more like an ever shrinking process, where the "working table" used to substitute in for the recursive name consists only of the particular subset of results at the most recent recursive round which were not duplicates of previous results.

Here's an example. Say we have 5000 rows in the table and we want to sample 3000 unique rows, doing a recursive sample of 1000 (potentially not unique) samples at a time.

We do the first batch of 1000, remove duplicates so we end up with about 818 non-dupes based on the binomial distribution for these large numbers (N=5000, m = 1000, k=1, rearrange terms to avoid overflow).

These 818 become the working table and this result set is subbed in as the recursive term for our next pass. We draw another set of about 818 unique rows, but then have to remove duplicates (UNION) when comparing to the original 818 that are in the working table. Two different random draws of 818 will have significant overlap (somewhere around 150 on average), so all of those are discarded and whatever new unique rows remain become the new working table.

So you'll add about 818 unique samples on the first draw, then the working table shrinks, you'll about 650 on the second draw, the working table shrinks, ... and you keep doing this until either you reach the overall total samples desired (3000 in this case) or the working table ends up being empty.

Once the working table is small enough, there will be a very high probability that everything within it will be duplicated in the next draw of 1000, at which point the working table becomes empty and the recursion terminates.

For drawing 3000 samples, you might be able to do this working table update enough times. But as you move from 3000 closer to the table's total size of 5000, the probability shrinks to nearly zero very fast.

So instead of this being an optimizer issue that's short-circuiting with a smaller result set, it's actually an issue with the specific way "recursion" is implemented in Postgres -- it's not actually recursion but simple iteration that operates on the set difference between the current working table and the most recent recursive query result set. For random sampling like this, that working table will get smaller very fast with each iteration, until it's extremely likely to be empty due to the high probability of selecting a duplicate.

Community
  • 1
  • 1
ely
  • 74,674
  • 34
  • 147
  • 228
  • So, it boils down to how recursive queries with `UNION` (not ALL) are implemented in Postgres. It is good to see that you got to the bottom of it and spelled it out in the answer. I never looked into it in such details and it was interesting to know. From what you've written so far, it seems to me that for this particular task of selecting subsample from a table your recursive approach is not really appropriate. I'd try to look at other (non-recursive) methods. – Vladimir Baranov Mar 24 '16 at 00:44
2

Your assessment is to the point. The recursive query in my referenced answer is only somewhat more flexible than the original simple query. It still requires relatively few gaps in the ID space and a sample size that is substantially smaller than the table size to be reliable.

While we need a comfortable surplus ("limit + buffer") in the simple query to cover the worst case scenario of misses and duplicates, we can work with a smaller surplus that's typically enough - since we have the safety net of a recursive query that will fill in if we should fall short of the limit in the first pass.

Either way, the technique is intended to get a small random selection from a big table quickly.

The technique is pointless for cases with too many gaps or (your focus) a sample size that is too close to the total table size - so that the recursive term might run dry before the limit is reached. For such cases a plain old:

SELECT *   -- or DISTINCT * to fold duplicates like UNION does
FROM   TABLE
ORDER  BY random()
LIMIT  n;

.. is more efficient: you are going to read most of the table anyway.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Though the question happens to be in service of the random sampling problem, the *real* question is how to create a recursive query that acts as a lazy infinite stream, such that only the use of `LIMIT` would cause the query to terminate and produce a result set. It appears that recursive CTEs are not capable of this in general (I guess you could get an infinite query *with duplicates* if you used `UNION ALL`, which could be useful), because of the way it re-uses the most recent unique result set. Perhaps recursive functions are the right way to do it in Postgres. – ely Mar 26 '16 at 15:46
  • Do you think a temporary indexed array (or temp table), could help here? There would be no gaps, and the structure length could be used to get unique elements one-by-one in a recursive loop. The structure would need to shift out chosen elements during this loop (or virtually resize and translate to an unchanging structure)... I have no idea if this is something viable or perform as needed, but could you see this working? – PinnyM Jun 18 '17 at 21:59
  • @PinnyM: Once you can work *without gaps*, the solutions is very fast and simple. There may be complications with newly added rows under concurrent load ... Maybe ask a new *question* with details of your idea? Comments are not the place for new aspects. You can always link to this one for context. – Erwin Brandstetter Jun 19 '17 at 14:31
0

It is possible to generate a known number of rows using Set Returning Functions (SRF). Also, OUTER joins guarantee, that one side of the join will be returned in full.

In this case, I assume FULL JOIN between a generate_series(1, 100) (if you aim for at least 100 rows) should do the trick. In fact, LEFT join might also do the trick, but it may as well filter out the extra rows that are actually wanted, therefore I would choose FULL.

P.S. If you could show your code, it'd be easier to provide some examples.

vyegorov
  • 21,787
  • 7
  • 59
  • 73
  • I did link to my code above. I think you misunderstand. The `generate_series` is not part of the join, rather a random integer column that is the same *size* as the `generate_series` is used for the join. The problem has nothing to do with that join. The problem is that duplicate results have to be removed, which is what the `UNION` does. But the query optimizer looks at it and says "Aha, after I remove duplicates with `UNION` but *before* I bother with a recursive call, the smaller set of non-dupes is already less than or equal to the `LIMIT` size, so just return it, don't try to get more." – ely Mar 20 '16 at 22:02
  • Whereas, the hope would be that the recursive CTE is treated like an *infinite* stream of random rows, and the only thing preventing it from hanging forever in an infinite recursion would be the outermost use of `LIMIT` ... if that was how it worked, then `LIMIT` would be enough to ensure that at least the desired number of rows was returned. Unfortunately, the optimizer seems to try to be more clever and check to see if a partial result set of a recursive CTE already has a satisfactory number of rows before entering the next round of recursion. And anything less than the limit will appear OK. – ely Mar 20 '16 at 22:06