11

I have two tables, custassets and tags. To generate some test data I'd like to do an INSERT INTO a many-to-many table with a SELECT that gets random rows from each (so that a random primary key from one table is paired with a random primary key from the second). To my surprise this isn't as easy as I first thought, so I'm persisting with this to teach myself.

Here's my first attempt. I select 10 custassets and 3 tags, but both are the same in each case. I'd be fine with the first table being fixed, but I'd like to randomise the tags assigned.

SELECT
    custassets_rand.id custassets_id,
    tags_rand.id tags_rand_id
FROM
    (
        SELECT id FROM custassets WHERE defunct = false ORDER BY RANDOM() LIMIT 10
    ) AS custassets_rand
,
    (
        SELECT id FROM tags WHERE defunct = false ORDER BY RANDOM() LIMIT 3
    ) AS tags_rand

This produces:

custassets_id | tags_rand_id 
---------------+--------------
          9849 |         3322  }
          9849 |         4871  } this pattern of tag PKs is repeated
          9849 |         5188  }
         12145 |         3322
         12145 |         4871
         12145 |         5188
         17837 |         3322
         17837 |         4871
         17837 |         5188
....

I then tried the following approach: doing the second RANDOM() call in the SELECT column list. However this one was worse, as it chooses a single tag PK and sticks with it.

SELECT
    custassets_rand.id custassets_id,
    (SELECT id FROM tags WHERE defunct = false ORDER BY RANDOM() LIMIT 1) tags_rand_id
FROM
    (
        SELECT id FROM custassets WHERE defunct = false ORDER BY RANDOM() LIMIT 30
    ) AS custassets_rand

Result:

 custassets_id | tags_rand_id 
---------------+--------------
         16694 |         1537
         14204 |         1537
         23823 |         1537
         34799 |         1537
         36388 |         1537
....

This would be easy in a scripting language, and I'm sure can be done quite easily with a stored procedure or temporary table. But can I do it just with a INSERT INTO SELECT?

I did think of choosing integer primary keys using a random function, but unfortunately the primary keys for both tables have gaps in the increment sequences (and so an empty row might be chosen in each table). That would have been fine otherwise!

halfer
  • 19,824
  • 17
  • 99
  • 186

6 Answers6

18

Note that what you are looking for is not a Cartesian product, which would produce n*m rows; rather a random 1:1 association, which produces GREATEST(n,m) rows.

To produce truly random combinations, it's enough to randomize rn for the bigger set:

SELECT c_id, t_id
FROM  (
   SELECT id AS c_id, row_number() OVER (ORDER BY random()) AS rn
   FROM   custassets
   ) x
JOIN   (SELECT id AS t_id, row_number() OVER () AS rn FROM tags) y USING (rn);

If arbitrary combinations are good enough, this is faster (especially for big tables):

SELECT c_id, t_id
FROM   (SELECT id AS c_id, row_number() OVER () AS rn FROM custassets) x
JOIN   (SELECT id AS t_id, row_number() OVER () AS rn FROM tags) y USING (rn);

If the number of rows in both tables do not match and you do not want to lose rows from the bigger table, use the modulo operator % to join rows from the smaller table multiple times:

SELECT c_id, t_id
FROM  (
   SELECT id AS c_id, row_number() OVER () AS rn
   FROM   custassets -- table with fewer rows
   ) x
JOIN  (
   SELECT id AS t_id, (row_number() OVER () % small.ct) + 1 AS rn
   FROM   tags
       , (SELECT count(*) AS ct FROM custassets) AS small
   ) y USING (rn);

Window functions were added with PostgreSQL 8.4.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Erwin, thanks for your thorough answer - much appreciated. I now ought to to look up `WITH` and `USING` as well! ` :)` – halfer May 16 '12 at 12:07
  • @halfer: No worries, both are easy to understand. CTE's are basically subqueries that can be used multiple times and `USING (rn)` is basically short for `ON x.rn = y.rn`. There are subtle differences, though. Just follow my links. – Erwin Brandstetter May 16 '12 at 12:14
5
WITH a_ttl AS (
    SELECT count(*) AS ttl FROM custassets c),
b_ttl AS (
    SELECT count(*) AS ttl FROM tags),
rows AS (
    SELECT gs.*
      FROM generate_series(1,
           (SELECT max(ttl) AS ttl FROM
              (SELECT ttl FROM a_ttl UNION SELECT ttl FROM b_ttl) AS m))
           AS gs(row)),
tab_a_rand AS (
    SELECT custassets_id, row_number() OVER (order by random()) as row
      FROM custassets),
tab_b_rand AS (
    SELECT id, row_number() OVER (order by random()) as row
      FROM tags)
SELECT a.custassets_id, b.id
  FROM rows r
  JOIN a_ttl ON 1=1 JOIN b_ttl ON 1=1
  LEFT JOIN tab_a_rand a ON a.row = (r.row % a_ttl.ttl)+1
  LEFT JOIN tab_b_rand b ON b.row = (r.row % b_ttl.ttl)+1
 ORDER BY 1,2;

You can test this query on SQL Fiddle.

vyegorov
  • 21,787
  • 7
  • 59
  • 73
2

Here is a different approach to pick a single combination from 2 tables by random, assuming two tables a and b, both with primary key id. The tables needn't be of same size, and the second row is independently chosen from the first, which might not be that important for testdata.

SELECT * FROM a, b 
 WHERE a.id = (
    SELECT id 
    FROM a 
    OFFSET (
        SELECT random () * (SELECT count(*) FROM a)
    ) 
    LIMIT 1) 
 AND b.id = (
    SELECT id 
    FROM b 
    OFFSET (
        SELECT random () * (SELECT count(*) FROM b)
        ) 
    LIMIT 1);

Tested with two tables, one of size 7000 rows, one with 100k rows, result: immediately. For more than one result, you have to call the query repeatedly - increasing the LIMIT and changing x.id = to x.id IN would produce (aA, aB, bA, bB) result patterns.

user unknown
  • 35,537
  • 11
  • 75
  • 121
1

It bugs me that after all these years of relational databases, there doesn't seem to be very good cross database ways of doing things like this. The MSDN article http://msdn.microsoft.com/en-us/library/cc441928.aspx seems to have some interesting ideas, but of course that's not PostgreSQL. And even then, their solution requires a single pass, when I'd think it ought to be able to be done without the scan.

I can imagine a few ways that might work without a pass (in selection), but it would involve creating another table that maps your table's primary keys to random numbers (or to linear sequences that you later randomly select, which in some ways may actually be better), and of course, that may have issues as well.

I realize this is probably a non-useful comment, I just felt I needed to rant a bit.

JayC
  • 7,053
  • 2
  • 25
  • 41
  • Heh, well, if the answer is 'not possible', then that's fair enough `:)`. We'll see what other answers come in. – halfer May 15 '12 at 15:51
  • Indeed, I also want to see what other answers come in. I didn't mean to imply the answer isn't possible, I just meant "not good" as in a particular solution either seems to require a lot of setup or near full table scans. I have to admit I'm not sure what's wrong with your query. – JayC May 15 '12 at 15:56
  • 3
    You may be missing that most modern RDBMS support window functions nowadays (MySQL being the inglorious exception). All answers here should basically work the same in MSSQL, Oracle and PostgreSQL. – Erwin Brandstetter May 15 '12 at 20:30
1

If you just want to get a random set of rows from each side, use a pseudo-random number generator. I would use something like:

select *
from (select a.*, row_number() over (order by NULL) as rownum -- NULL may not work, "(SELECT NULL)" works in MSSQL
      from a
     ) a cross join
     (select b.*,  row_number() over (order by NULL) as rownum
      from b
     ) b
where a.rownum <= 30 and b.rownum <= 30

This is doing a Cartesian product, which returns 900 rows assuming a and b each have at least 30 rows.

However, I interpreted your question as getting random combinations. Once again, I'd go for the pseudo-random approach.

select *
from (select a.*, row_number() over (order by NULL) as rownum -- NULL may not work, "(SELECT NULL)" works in MSSQL
      from a
     ) a cross join
     (select b.*,  row_number() over (order by NULL) as rownum
      from b
     ) b
where modf(a.rownum*107+b.rownum*257+17, 101) < <some vaue>

This let's you get combinations among arbitrary rows.

halfer
  • 19,824
  • 17
  • 99
  • 186
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks for the reply; yes, it's random combinations I need (I've added the problematic resultsets to the question for clarity). I tried your second query, but I'm not sure `OVER` is supported by Postgres (8.4). Is that a MSSQL Server-only keyword? – halfer May 15 '12 at 19:02
  • @halfer: Window functions (including `row_number()`) are [supported in Postgres 8.4](http://www.postgresql.org/docs/8.4/interactive/functions-window.html). However, `OVER (ORDER BY NULL)` is just noise and can be simplified to `OVER ()`. Neither is good at producing random results. You get an implementation specific, arbitrary order, mostly in the same sequence as the rows were entered. – Erwin Brandstetter May 15 '12 at 19:42
  • @ErwinBrandstetter - thanks for that. I thoroughly searched for 'postgresql over', but must have missed it - perhaps 'over' is too common a word! I'm not familiar with this set of functions, so I will read up on them. – halfer May 15 '12 at 20:44
  • The point of the order by in the windowing clause is to generate a sequence in an arbitrary order. Other values can definitely be included. The psuedo-random number generator is going to behave rather like an "nth" selection rather than a "top" selection, so even having a real ordering will probably give pseudo-random results. – Gordon Linoff May 15 '12 at 20:50
  • Gordon, many thanks. PG complained about `modf` so I put my own `%` thing in, with a `LIMIT 50`, and it worked a treat. You're right about pseudo-randomness - I tried `ORDER BY RANDOM()` and the query carried on running for many minutes after I cancelled it! (I have 50k and 5k rows in these tables respectively). And my use case doesn't need true randomness anyway. – halfer May 15 '12 at 20:54
1

Just a plain carthesian product ON random() appears to work reasonably well. Simple comme bonjour...

-- Cartesian product
-- EXPLAIN ANALYZE
INSERT INTO dirgraph(point_from,point_to,costs)
SELECT p1.the_point , p2.the_point, (1000*random() ) +1
FROM allpoints p1
JOIN allpoints p2 ON random() < 0.002
        ;
wildplasser
  • 43,142
  • 8
  • 66
  • 109