Random row selection with weighted filters in SQL/PostgreSQL

Question

I have a questions table and I need to get X questions to prepare a test. The questions need to be filtered according to multiple criteria (subject, institution, area, etc.), each with different weights.

The filters weight are dynamically setted and normalized outside the query. Ex.:

Subject 1 — 0.4
Subject 2 — 0.1
Subject 3 — 0.5
Institution 1 — 0.2
Institution 2 — 0.04
Institution 3 — 0.76
Area 1 — 1

Some other points:

Today, I have 10 different filters (subject, institution, area, etc.), but the user can select in a multiple and mixed way (ex.: 10 subjects, 5 institutions, 30 areas, etc.), like in the sample above.
The questions table have ~500k rows;
The filters are N — N with the questions;
After the filtering, I want to limit the returned rows;
If some filter can't offer any more questions, the other ones must be considered (remember: I want to prepare a test -- if I have questions left, they must be used)
I’m very concerned with the performance of this query.

To illustrate, if I didn’t want to weight the filters, I would do something like that:

SELECT
    *
FROM
    public.questions q
    INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
    INNER JOIN public.subjects s ON s.id = sq.subject_id
    INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
    INNER JOIN public.institutions i ON i.id = iq.institution_id
    INNER JOIN public.areas_questions aq ON aq.question_id = q.id
    INNER JOIN public.areas a ON a.id = aq.area_id
WHERE
    s.id IN :subjects
    AND a.id IN :areas
    AND i.id IN :institutions
ORDER BY
    random() limit 200

Desired output:

Question — Subject — Institution — Area

I thought in something along the lines:

Create a CTE with the questions returned by the filter; must consider that the same question can be returned by more than one filter — do I need to evaluate each filter apart and UNION ALL then to solve this? Must assign, too, from what filter the question came from;
Create another CTE with weights and the respective filter associated;
JOIN the CTE’s, but at this point the questions must be grouped and the weights SUMmed;
Apply a Window Function and return the results, limitted to X rows (LIMIT X).

How would you write such query / solve this problem?

Sample data and desired results would really help. It is not necessarily possible to meet such constraints across different dimensions if you are pulling records from an existing sample. — Gordon Linoff, Jul 05 '18 at 19:09
To clarify, with the above example you want a record with Subject = 1 to have a 40% chance of getting selected, Subject = 2 10%, ...Institution = 3 76% , etc... And these parameters are already calculated and stored in a table which you can query? — Error_2646, Jul 05 '18 at 19:14
@Error_2646 Yes, but they are not stored in any table (that's why I suggested the creation of a second CTE in step 2). Like I said, they are dynamically setted outside the query. — delta, Jul 05 '18 at 19:18
@GordonLinoff, I don't understand why you need sample data. You can consider that all tables mentioned in the example query have an autoincremented ID column only. It wouldn’t make any difference. — delta, Jul 05 '18 at 19:19
@delta If a solution queries a Filter table and ignores the implementation you'd be able to make the modifications on your end right? It seems like that's a simple wrinkle to work out. — Error_2646, Jul 05 '18 at 19:25
@delta How many filters are present? Is it too many to have explicit where conditions for? If so I think you'll need to do this in a stored procedure with dynamic SQL. — Error_2646, Jul 05 '18 at 19:35
@Error_2646, edited to answer your question -- the size of the WHERE is not a problem now. It can be some day, but you can ignore that. And yes, the weights are coming from outside the schema, like I said, but if you want so, you can suppose they are coming from an existing table. — delta, Jul 05 '18 at 19:43

score 0 · Answer 1 · answered Jul 05 '18 at 19:51

0

What about something like this. This is just to demonstrate the idea, I'll leave the details up to you. In case you aren't familiar with this random selection method, if you randomly generate a number between 0 and 1, it has a 40% chance of being under .4. So rand() <= .4 will return true 40% of the time.

The assume you have or can create a "Filters" entity which looks a bit like this

CREATE TABLE Filters
  ( FieldName VARCHAR(100), 
    FieldValue VARCHAR(100),
    Prob Float -- probability of selection based on Name and Value
  );

SELECT DISTINCT TMP.* -- The fields you want. Distinct needed to get rid of 
                      -- records which pass multiple conditions.
  FROM (SELECT YRSWF.*,
               RAND() AS rnd
          FROM YourResultSetWithoutFilters YRSWF -- You can code the details
       ) TMP  
 INNER
  JOIN Filters F
    ON (
       TMP.Subject = F.FieldValue
   AND F.FieldName = 'Subject'
   AND TMP.rnd <= F.prob
       )
    OR (
       TMP.Institution = F.FieldValue
   AND F.FieldName = 'Institution'
   AND TMP.rnd <= F.prob
       )
    OR ( 
       TMP.Area = F.FieldValue
   AND F.FieldName = 'Area'
   AND TMP.rnd <= F.prob
       );

answered Jul 05 '18 at 19:51

Error_2646

2,555
1
10
22

Considering that filters are not applied in the subquery, there’s the possibility of getting a question of Subject 1 and Area 7. The filters are “OR” when they have the same type, but “AND” between each type (this is stated in my example query). This can be solved by applying the filters in the subquery too. My question only makes sense when limitting the rows. If I want to return all rows, I don’t have to weight anything. When you consider that, this solution can’t work. – delta Jul 05 '18 at 21:37
This solution can return a variable number of rows, even if you have questions left that meet the filters condition. Ex.: You have Subject 1 (0.5) and Subject 2 (0.5) with 10 questions each; I want to limit 15 rows; with your query, I would not get 15 rows. I think that if I had said that I wanted to get the X questions to make a test, it would be clearer. I edited the question to reflect that. RAND() should be RANDOM(). I want PostgreSQL SQL. – delta Jul 05 '18 at 21:37
Sorry, i'd mixed up the dbms. But you'd asked Gordon earlier why sample data matters, this is why. I'll update the solution when I'm back at my computer. – Error_2646 Jul 05 '18 at 22:34
In this case, the sample data has nothing to do with it. Like I said to him, you can consider that all tables have an autoincremented ID column only or 300 columns; there's no difference. – delta Jul 05 '18 at 23:05
@delta You might want to reconsider your tone, in this comment and in others. People are taking time to help on this site out of altruism and out of restitution for help they've previously received. Sample data and expected results make things easier. Period. Gordon is the best SQL person on this site (maybe in the world.) He's not asking for sample data for the fun of it. – Error_2646 Jul 05 '18 at 23:31
There’s nothing wrong with my tone. You just didn’t attack the arguments, but what you consider “my tone”. I did a detailed review of your answer and pointed why it’s wrong. All the sample data needed is just after: “ […] The filters weight are dynamically setted and normalized outside the query. Ex.: “. The expected results/output are very detailed too. I’m very grateful for the answers, but if you can’t help, don't post anymore. – delta Jul 06 '18 at 00:05
@delta I didn't attack you, I gave some (I hope) constructive advice because my first thought was "do I really want to help this guy." There is something wrong with your tone. "All the sample data is posted just after..." You've posted no data. I, and many others, can solve your problem but your MO should be "how can I make this as easy as possible for people taking time out of their day to help me". I have a suggestion, scroll through some of the SQL questions asked today and try to answer them - I guarantee you'll have an easier time with the ones with sample data and expected result. – Error_2646 Jul 06 '18 at 00:29

delta · Accepted Answer · 2018-07-06T13:39:03.227

Ok. Managed to solve it. Basically, used the strategy already outlined in the question and a little help from here -- I had already seen this post before, but I was (and still am) trying to solve in a more elegant way -- something like this but for multiple rows --, not needing to create the "bounds" by hand.

Let's try step-by-step:

Since the filters, with the weights, come from outside the schema, let's create a CTE:

WITH filters (type, id, weight) AS (
    SELECT 'subject', '148232e0-dece-40d9-81e0-0fa675f040e5'::uuid, 0.5
    UNION SELECT 'subject', '854431bb-18ee-4efb-803f-185757d25235'::uuid, 0.4
    UNION SELECT 'area', 'e12863fb-afb7-45cf-9198-f9f58ebc80cf'::uuid, 1
    UNION SELECT 'institution', '7f56c89f-705e-45c7-98fb-fee470550edf'::uuid, 0.5
    UNION SELECT 'institution', '0066257b-b2e3-4ee8-8075-517a2aa1379e'::uuid, 0.5
)

Now, let's filter the rows, ignoring the weight (for now), so later we don't need to work with the whole table:

WITH filtered_questions AS (
    SELECT
        q.id,
        s.id subject_id,
        a.id area_id,
        i.id institution_id
    FROM
        public.questions q
        INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
        INNER JOIN public.subjects s ON s.id = sq.subject_id
        INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
        INNER JOIN public.institutions i ON i.id = iq.institution_id
        INNER JOIN public.areas_questions aq ON aq.question_id = q.id
        INNER JOIN public.areas a ON a.id = aq.area_id
    WHERE
        subject_id IN (SELECT id from filters where type = 'subject')
        and institution_id IN (SELECT id from filters where type = 'institution')
        and area_id IN (SELECT id from filters where type = 'area')
)

The same question can be selected by multiple filters, increasing the chance of it being selected. We must update the weights to solve this.

WITH filtered_questions_weights_sum AS (
    SELECT
        q.id,
        SUM(filters.weight) weight_sum
    FROM filtered_questions q
    INNER JOIN filters
    ON (filters.type = 'subject' AND q.subject_id IN(filters.id))
    OR (filters.type = 'area' AND q.area_id IN(filters.id))
    OR (filters.type = 'institution' AND q.institution_id IN(filters.id))
    GROUP BY q.id
)

Generating the bounds, like exposed here.

WITH cumulative_prob AS (
    SELECT
        id,
        SUM(weight_sum) OVER (ORDER BY id) AS cum_prob
    FROM filtered_questions_weights_sum
),
cumulative_bounds AS (
    SELECT
        id,
        COALESCE( lag(cum_prob) OVER (ORDER BY cum_prob, id), 0 ) AS lower_cum_bound,
        cum_prob AS upper_cum_bound
    FROM cumulative_prob
)

Generating the random series. Had to re-normalize (random() * (SELECT SUM(weight_sum)) because the weights were updated in a previous step. 10 is the number of rows that we want to return.

WITH random_series AS (
    SELECT generate_series (1,10),random() * (SELECT SUM(weight_sum) FROM filtered_questions_weights_sum) AS R
)

And finally:

SELECT
      id, lower_cum_bound, upper_cum_bound, R
FROM random_series
JOIN cumulative_bounds
ON R::NUMERIC <@ numrange(lower_cum_bound::NUMERIC, upper_cum_bound::NUMERIC, '(]')

And we get the following distribution:

id                                   lower_cum_bound upper_cum_bound r                   
------------------------------------ --------------- --------------- ------------------- 
380f46e9-f373-4b89-a863-05f484e6b3b6 0               2.0             0.41090718149207534 
42bcb088-fc19-4272-8c49-e77999edd01c 2.0             3.9             3.4483200465794654  
46a97f1d-789f-46e7-9d3b-bd881a22a32e 3.9             5.9             5.159445870062337   
46a97f1d-789f-46e7-9d3b-bd881a22a32e 3.9             5.9             5.524481557868421   
972d0296-acc3-4b44-b67d-928049d5e9c2 5.9             7.8             6.842470594821498   
bdcc26f7-ccaf-4f8f-9e0b-81b9a6d29cdb 11.6            13.5            12.207371663767844  
bdcc26f7-ccaf-4f8f-9e0b-81b9a6d29cdb 11.6            13.5            12.674184153741226  
c935e3de-f1b6-4399-b5eb-ed3a9194eb7b 15.5            17.5            17.16804686235264   
e5061aeb-53b7-4247-8404-87508c5ac723 21.4            23.4            22.622627633158118  
f8c37700-0c3a-457e-8882-7c65269482ea 25.4            27.3            26.841821723571048

Putting it all together:

WITH filters (type, id, weight) AS (
        SELECT 'subject', '148232e0-dece-40d9-81e0-0fa675f040e5'::uuid, 0.5
        UNION SELECT 'subject', '854431bb-18ee-4efb-803f-185757d25235'::uuid, 0.4
        UNION SELECT 'area', 'e12863fb-afb7-45cf-9198-f9f58ebc80cf'::uuid, 1
        UNION SELECT 'institution', '7f56c89f-705e-45c7-98fb-fee470550edf'::uuid, 0.5
        UNION SELECT 'institution', '0066257b-b2e3-4ee8-8075-517a2aa1379e'::uuid, 0.5
        )
    ,
    filtered_questions AS
    (
        SELECT
            q.id,
            SUM(filters.weight) weight_sum
        FROM
        public.questions q
        INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
        INNER JOIN public.subjects s ON s.id = sq.subject_id
        INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
        INNER JOIN public.institutions i ON i.id = iq.institution_id
        INNER JOIN public.activity_areas_questions aq ON aq.question_id = q.id
        INNER JOIN public.activity_areas a ON a.id = aq.activity_area_id
        INNER JOIN filters
            ON (filters.type = 'subject' AND s.id IN(filters.id))
            OR (filters.type = 'area' AND a.id IN(filters.id))
            OR (filters.type = 'institution' AND i.id IN(filters.id))
        WHERE
            s.id IN (SELECT id from filters where type = 'subject')
            and i.id IN (SELECT id from filters where type = 'institution')
            and a.id IN (SELECT id from filters where type = 'area')
        GROUP BY q.id
    )
    ,
    cumulative_prob AS (
        SELECT
            id,
            SUM(weight_sum) OVER (ORDER BY id) AS cum_prob
        FROM filtered_questions
    )
    ,
    cumulative_bounds AS (
        SELECT
            id,
            COALESCE( lag(cum_prob) OVER (ORDER BY cum_prob, id), 0 ) AS lower_cum_bound,
            cum_prob AS upper_cum_bound
        FROM cumulative_prob
    )
    ,
    random_series AS
    (
        SELECT generate_series (1,14),random() * (SELECT SUM(weight_sum) FROM filtered_questions) AS R
    )
SELECT id, lower_cum_bound, upper_cum_bound, R
FROM random_series
JOIN cumulative_bounds
ON R::NUMERIC <@ numrange(lower_cum_bound::NUMERIC, upper_cum_bound::NUMERIC, '(]')

Random row selection with weighted filters in SQL/PostgreSQL

2 Answers2