How to EFFICIENTLY query n records per category

Question

To select N records per category one can do:

SELECT category, category_id, value FROM
(
    SELECT category, value, row_number() OVER (PARTITION by category) as category_id
    FROM myTable
)
WHERE  category_id < N;

The inner SELECT will first partition the records per category and assign each record per category an id called category_id. The outer query will then use the category_id to limit the number of records it queries per category.

This is extremely inefficient on BIG tables as it will be going through assigning ids to all the records even though we are only interested in N records per category.

The following does not work on the sql engine that I am working with - not sure if it works on any engine at all.

SELECT category, value, row_number() OVER (PARTITION by category) as category_id
FROM myTable
WHERE category_id < N

Does anyone know of any other ways of achieving this with a better time complexity?

More thoughts:

Time profiling the following algorithm against above query might provide more insights as to how the query runs behind the scene:

   1. SELECT DISTINCT(category) FROM myTable
   2. FOREACH category SELECT N rows

more info: my data is physically partitioned by category, being able to explicitly leverage that would be useful

even if your second query would work on some RDBMS, the execution plan would probably be the same as the first one — Lamak, Oct 06 '17 at 15:24
you could try to dump your derived table into a #temp and create index on that, then query it — LONG, Oct 06 '17 at 15:40
@LONG to dynamically dump into a temp table (with a guarantee that it contains at least N records per category), I will need to perform a similar query. If you know of a different way to do so, kindly write it up in an answer. — Zahra, Oct 06 '17 at 15:55
Your sample query doesn't make much sense -- you're returning `n` rows each containing the same value (`category`), you might as well just hard-code it. Can you please update the question with a more realistic query? — mustaccio, Oct 06 '17 at 17:12
@mustaccio I added a `value` column to make it easier to understand; however, this does not add any useful info to my original question which is asking about the time complexity. — Zahra, Oct 06 '17 at 18:26
@Lamak probably. I was kind of hoping since the latter keeps partitioning and the filter limit N on the same scope, sql engine would be smarter at optimizing the query. — Zahra, Oct 06 '17 at 18:35
I disagree that it's inefficient. Any algorithm would have to sort each row (w/r to its partition) to determine if it's < N and in order to sort all rows you must look at all rows at least once. — FuzzyTree, Oct 06 '17 at 18:36
@FuzzyTree it does not have to sort it unless the query explicitly specifies `(PARTITION BY category ORDER BY something)`. what I am wondering is whether I can somehow push the partitioning procedure to stop its increment as soon as it numbers N rows per category. — Zahra, Oct 06 '17 at 18:41

mustaccio · Answer 1 · 2017-10-06T19:37:31.780

As @Lamak mentioned in a comment, you cannot avoid sorting all rows in the table, but not for the reason stated. A sort is required to determine distinct categories by which the result set should be partitioned, and, in the absence of explicit ordering within each partition, row numbers are easily determined as a side effect of the category sort.

How the query runs "behind the scenes", or, if using the correct term, its execution plan is determined by the presence (or absence) of an index that might help avoid that category sort. If you had a covering index on (category, value), and whatever other columns you need in the result, your query would run much more efficiently.

In that latter case the simplified algorithm might look more like this:

Read the pre-sorted records containing all necessary columns, including row numbers, from the index.
Discard records with row number greater than n.

Your "ideal" query

SELECT category, value, row_number() OVER (PARTITION by category) as
category_id FROM myTable WHERE category_id < N

probably wouldn't run in any SQL database, because the SELECT list is processed after the WHERE clause predicates, so category_id is unknown when the predicates are evaluated.

in my case the table is physically partitioned on hdfs by `category`, db2 does not seem to be smart enough to leverage that. — Zahra, Oct 06 '17 at 19:39
You may want to describe your software stack in greater detail. DB2 proper doesn't know or care about hdfs. Are you talking about BigSQL, may be? — mustaccio, Oct 06 '17 at 19:53
do you guys know which component in ibm InfoSphere BigInsights creates the execution plan? — Zahra, Oct 10 '17 at 21:17

score 0 · Answer 2 · answered Oct 07 '17 at 07:40

Other method of rownumber, but i have doubts of performance too. I agree @mustaccio. My example take 5 rows...

select distinct f1.category, f3.*             
from yourtable f1                        
inner join lateral                                          
(                                                           
 select f2.value from yourtable f2              
 where f2.category=f1.category 
 fetch first 5 rows only                                    
) f3 on 1=1

How to EFFICIENTLY query n records per category

2 Answers2