6

I have a PostgreSQL table of this form:

base_id int | mods smallint[]
     3      |   {7,15,48}

I need to populate a table of this form:

combo_id int | base_id int | mods smallint[]
     1       |     3       |      
     2       |     3       |      {7}
     3       |     3       |      {7,15}   
     4       |     3       |      {7,48}   
     5       |     3       |      {7,15,48}
     6       |     3       |      {15}
     7       |     3       |      {15,48}
     8       |     3       |      {48}

I think I could accomplish this using a function that does almost exactly this, iterating over the first table and writing combinations to the second table: Generate all combinations in SQL

But, I'm a Postgres novice and cannot for the life of me figure out how to do this using plpgsql. It doesn't need to be particularly fast; it will only be run periodically on the backend. The first table has approximately 80 records and a rough calculation suggests we can expect around 2600 records for the second table.

Can anybody at least point me in the right direction?

Edit: Craig: I've got PostgreSQL 9.0. I was successfully able to use UNNEST():

FOR messvar IN SELECT * FROM UNNEST(mods) AS mod WHERE mod BETWEEN 0 AND POWER(2, @n) - 1
LOOP
    RAISE NOTICE '%', messvar;
END LOOP;

but then didn't know where to go next.

Edit: For reference, I ended up using Erwin's solution, with a single line added to add a null result ('{}') to each set and the special case Erwin refers to removed:

CREATE OR REPLACE FUNCTION f_combos(_arr integer[], _a integer[] DEFAULT '{}'::integer[], _z integer[] DEFAULT '{}'::integer[])
RETURNS SETOF integer[] LANGUAGE plpgsql AS
$BODY$
DECLARE
 i int;
 j int;
 _up int;
BEGIN
 IF array_length(_arr,1) > 0 THEN 
    _up := array_upper(_arr, 1);

    IF _a = '{}' AND _z = '{}' THEN RETURN QUERY SELECT '{}'::int[]; END IF;
    FOR i IN array_lower(_arr, 1) .. _up LOOP
       FOR j IN i .. _up  LOOP
          CASE j-i
          WHEN 0,1 THEN
             RETURN NEXT _a || _arr[i:j] || _z;
          ELSE
             RETURN NEXT _a || _arr[i:i] || _arr[j:j] || _z;
             RETURN QUERY SELECT *
             FROM f_combos(_arr[i+1:j-1], _a || _arr[i], _arr[j] || _z);
          END CASE;
       END LOOP;
    END LOOP;
 ELSE
    RETURN NEXT _arr;
 END IF;
END;
$BODY$

Then, I used that function to populate my table:

INSERT INTO e_ecosystem_modified (ide_ecosystem, modifiers) 
(SELECT ide_ecosystem, f_combos(modifiers) AS modifiers FROM e_ecosystem WHERE ecosystemgroup <> 'modifier' ORDER BY ide_ecosystem, modifiers);

From 79 rows in my source table with a maximum of 7 items in the modifiers array, the query took 250ms to populate 2630 rows in my output table. Fantastic.

Community
  • 1
  • 1
Kim
  • 185
  • 8
  • This is for permutations, but close: http://wiki.postgresql.org/wiki/Permutations – Craig Ringer Aug 17 '12 at 00:41
  • 3
    Before trying anything with this: What PostgreSQL version are you using? It's vital for a question like this, where some features are only available in newer versions. For example if you don't have CTEs or array_unnest, then the I'd say "install plperl; write it in plperl". – Craig Ringer Aug 17 '12 at 00:43
  • OK, 9.0 is reasonable; you have window functions, recursive CTEs, array unnesting and more to play with. A couple more questions: Is the set always 3 elements? Or it is variable size? Also, do the combo IDs need to be ordered in any way? – Craig Ringer Aug 17 '12 at 01:51
  • Thanks for the help Craig. The array is variable size and can be null (in which case the function should write only one record, with base_id set but mods null). combo_id should be a serial (I think -- what we'd call autoincrement in other dbs). – Kim Aug 17 '12 at 01:59
  • OK, what I mean is: Does combo_id have to start at 1 for each set of combos and count up, like in your example? Or can it be *any* unique number? If it has to be counting up, you'd need a `row_number` window function; if it just has to be unique a plain `serial` can be used. – Craig Ringer Aug 17 '12 at 02:17
  • @Craig: I hope I did not cause confusion with my first buggy version. This one should be good now. – Erwin Brandstetter Aug 17 '12 at 04:19
  • @ErwinBrandstetter Yeah, I'm matching your results now, cool. Thanks. – Craig Ringer Aug 17 '12 at 04:21
  • Craig and @Erwin: This was my first experience posting to SO, and I just learned a lot about combinatorics and postgres thanks to both of you. Interesting that the bitwise approach I originally assumed would be the right one turned out not to be. In the meantime: what do I do now? I posted my final solution as an edit above. It's a version of Erwin's, just because I saw his first and it seemed simpler and easier to understand for a newbie. But both answers are great. I can only accept one answer, right? But I guess I can mark both as useful? – Kim Aug 17 '12 at 17:02

2 Answers2

5

After I slept over it I had a completely new, simpler, faster idea:

CREATE OR REPLACE FUNCTION f_combos(_arr anyarray)
  RETURNS TABLE (combo anyarray) LANGUAGE plpgsql AS
$BODY$
BEGIN
    IF array_upper(_arr, 1) IS NULL THEN
        combo := _arr; RETURN NEXT; RETURN;
    END IF;

    CASE array_upper(_arr, 1)
--  WHEN 0 THEN -- does not exist
    WHEN 1 THEN
        RETURN QUERY VALUES ('{}'), (_arr);
    WHEN 2 THEN
        RETURN QUERY VALUES ('{}'), (_arr[1:1]), (_arr), (_arr[2:2]);
    ELSE
        RETURN QUERY
        WITH x AS (
            SELECT f.combo FROM f_combos(_arr[1:array_upper(_arr, 1)-1]) f
            )
        SELECT x.combo FROM x
        UNION ALL
        SELECT x.combo || _arr[array_upper(_arr, 1)] FROM x;
    END CASE;
END
$BODY$;

Call:

SELECT * FROM f_combos('{1,2,3,4,5,6,7,8,9}'::int[]) ORDER BY 1;

512 rows, total runtime: 2.899 ms

Explain

  • Treat special cases with NULL and empty array.
  • Build combinations for a primitive array of two.
  • Any longer array is broken down into:
    • the combinations for same array of length n-1
    • plus all of those combined with element n .. recursively.

Really simple, once you got it.

  • Works for 1-dimensional integer arrays starting with subscript 1 (see below).
  • 2-3 times as fast as old solution, scales better.
  • Works for any element type again (using polymorphic types).
  • Includes the empty array in the result as is displayed in the question (and as @Craig pointed out to me in the comments).
  • Shorter, more elegant.

This assumes array subscripts starting at 1 (Default). If you are not sure about your values, call the function like this to normalize:

SELECT * FROM  f_combos(_arr[array_lower(_arr, 1):array_upper(_arr, 1)]);

Not sure if there is a more elegant way to normalize array subscripts. I posted a question about that:
Normalize array subscripts for 1-dimensional array so they start with 1

Old solution (slower)

CREATE OR REPLACE FUNCTION f_combos2(_arr int[], _a int[] = '{}', _z int[] = '{}')
 RETURNS SETOF int[] LANGUAGE plpgsql AS
$BODY$
DECLARE
   i int;
   j int;
   _up int;
BEGIN
   IF array_length(_arr,1) > 0 THEN 
      _up := array_upper(_arr, 1);

      FOR i IN array_lower(_arr, 1) .. _up LOOP
         FOR j IN i .. _up  LOOP
            CASE j-i
            WHEN 0,1 THEN
               RETURN NEXT _a || _arr[i:j] || _z;
            WHEN 2 THEN
               RETURN NEXT _a || _arr[i:i] || _arr[j:j] || _z;
               RETURN NEXT _a || _arr[i:j] || _z;
            ELSE
               RETURN NEXT _a || _arr[i:i] || _arr[j:j] || _z;
               RETURN QUERY SELECT *
               FROM f_combos2(_arr[i+1:j-1], _a || _arr[i], _arr[j] || _z);
            END CASE;
         END LOOP;
      END LOOP;
   ELSE
      RETURN NEXT _arr;
   END IF;
END;
$BODY$;

Call:

SELECT * FROM f_combos2('{7,15,48}'::int[]) ORDER BY 1;

Works for 1-dimensional integer arrays. This could be further optimized, but that's certainly not needed for the scope of this question.
ORDER BY to impose the order displayed in the question.

Provide for NULL or empty array, as NULL is mentioned in the comments.

Tested with PostgreSQL 9.1, but should work with any halfway modern version. array_lower() and array_upper() have been around for at least since PostgreSQL 7.4. Only parameter defaults are new in version 8.4. Could easily be replaced.

Performance is decent.

SELECT DISTINCT * FROM f_combos('{1,2,3,4,5,6,7,8,9}'::int[]) ORDER BY 1;

511 rows, total runtime: 7.729 ms

Explanation

It builds on this simple form that only creates all combinations of neighboring elements:

CREATE FUNCTION f_combos(_arr int[])
  RETURNS SETOF int[] LANGUAGE plpgsql AS
$BODY$
DECLARE
   i  int;
   j  int;
  _up int;
BEGIN
   _up := array_upper(_arr, 1);

   FOR i in array_lower(_arr, 1) .. _up LOOP
      FOR j in i .. _up LOOP
         RETURN NEXT _arr[i:j];
      END LOOP;
   END LOOP;
END;
$BODY$;

But this will fail for sub-arrays with more than two elements. So:

  • For any sub-array with 3 elements one array with just the outer two elements is added. this is a shortcut for this special case that improves performance and is not strictly needed.

  • For any sub-array with more than 3 elements I take the outer two elements and fill in with all combinations of inner elements built by the same function recursively.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Funny. Both the approaches we took perform the same on my 9.1 system to within 5%, scaling exactly the same with number of elements. – Craig Ringer Aug 17 '12 at 04:26
  • @CraigRinger: In my tests the plpgsql function is slightly faster by ~ 10 %. That's amazingly close for completely different approaches. Maybe the fact that I ditched polymorphic types to allow for empty array as default in the function definition helps performance - for the cost of being less versatile. Results are identical except for an empty array that you return additionally. Check with `SELECT * FROM combinations('{1,2,3,4,5,6,7,8,9}'::int[]) c(x) where x not IN ( SELECT * FROM f_combos('{1,2,3,4,5,6,7,8,9}'::int[]))` – Erwin Brandstetter Aug 17 '12 at 04:37
  • I return the empty array intentionally as `{}` is a valid combination (and 'cos it's that way in the question). The other difference between the two is with the empty array as input; you return {} and I return null, so yours is more correct there. As for speed, your PL/PgSQL version is the faster one here by about 5%. The similarity of speeds is just nuts. – Craig Ringer Aug 17 '12 at 04:41
  • @CraigRinger: I ran more tests, performance diverges with higher number of array elements. 20 elements create 1048575 combination which starts to get hard on the server. Hardly relevant for the specs of the question any more. This was fun, gotta sleep, happy coding! – Erwin Brandstetter Aug 17 '12 at 05:06
  • I like it! Agreed, excellent improvement. I tested with my same dataset and got the same results in 109ms. Going to mark this as the answer, and a very good one! – Kim Aug 18 '12 at 22:11
  • If I have an array with 30 elements, is there a way to update this function so it scales better by only returning rows with at least 20 array elements? – Nick May 10 '19 at 16:11
3

One approach is with a recursive CTE. Erwin's updated recursive function is significantly faster and scales better, though, so this is really useful as an interesting different approach. Erwin's updated version is much more practical.

I tried a bit counting approach (see the end) but without a fast way to pluck arbitrary elements from an array it proved slower then either recursive approach.

Recursive CTE combinations function

CREATE OR REPLACE FUNCTION combinations(anyarray) RETURNS SETOF anyarray AS $$
WITH RECURSIVE
    items AS (
            SELECT row_number() OVER (ORDER BY item) AS rownum, item
            FROM (SELECT unnest($1) AS item) unnested
    ),
    q AS (
            SELECT 1 AS i, $1[1:0] arr
            UNION ALL
            SELECT (i+1), CASE x
                    WHEN 1 THEN array_append(q.arr,(SELECT item FROM items WHERE rownum = i))
                    ELSE q.arr END
            FROM generate_series(0,1) x CROSS JOIN q WHERE i <= array_upper($1,1)
    )
SELECT q.arr AS mods
FROM q WHERE i = array_upper($1,1)+1;
$$ LANGUAGE 'sql';

It's a polymorphic function, so it'll work on arrays of any type.

The logic is to iterate over each item in the unnested input set, using a working table. Start with an empty array in the working table, with a generation number of 1. For each entry in the input set insert two new arrays into the working table with an incremented generation number. One of the two is a copy of the input array from the previous generation and the other is the input array with the (generation-number)'th item from the input set appended to it. When the generation number exceeds the number of items in the input set, return the last generation.

Usage

You can use the combinations(smallint[]) function to produce the results you desire, using it as a set-returning function in combinatin with the row_number window function.

-- assuming table structure
regress=# \d comb
       Table "public.comb"
 Column  |    Type    | Modifiers 
---------+------------+-----------
 base_id | integer    | 
 mods    | smallint[] | 


SELECT base_id, row_number() OVER (ORDER BY mod) AS mod_id, mod 
FROM (SELECT base_id, combinations(mods) AS mod FROM comb WHERE base_id = 3) x
ORDER BY mod;

Results

regress=# SELECT base_id, row_number() OVER (ORDER BY mod) AS mod_id, mod 
regress-# FROM (SELECT base_id, combinations(mods) AS mod FROM comb WHERE base_id = 3) x
regress-# ORDER BY mod;
 base_id | mod_id |    mod    
---------+--------+-----------
       3 |      1 | {}
       3 |      2 | {7}
       3 |      3 | {7,15}
       3 |      4 | {7,15,48}
       3 |      5 | {7,48}
       3 |      6 | {15}
       3 |      7 | {15,48}
       3 |      8 | {48}
(8 rows)

Time: 2.121 ms

Zero element arrays produce a null result. If you want combinations({}) to return one row {} then a UNION ALL with {} will do the job.

Theory

It appears you want the k-combinations for all k in a k-multicombination, rather than simple combinations. See number of combinations with repetition.

In other words, you want all k-combinations of elements from your set, for all k from 0 to n where n is the set size.

Related SO question: SQL - Find all possible combination, which has the really interesting answer about bit counting.

Bit operations exist in Pg, so a bit counting approach should be possible. You'd expect it to be more efficient, but because it's so slow to select a scattered subset of elements from an array it actually works out slower.

CREATE OR REPLACE FUNCTION bitwise_subarray(arr anyarray, elements integer)
RETURNS anyarray AS $$
SELECT array_agg($1[n+1]) 
FROM generate_series(0,array_upper($1,1)-1) n WHERE ($2>>n) & 1 = 1;
$$ LANGUAGE sql;

COMMENT ON FUNCTION bitwise_subarray(anyarray,integer) IS 'Return the elements from $1 where the corresponding bit in $2 is set';

CREATE OR REPLACE FUNCTION comb_bits(anyarray) RETURNS SETOF anyarray AS $$ 
SELECT bitwise_subarray($1, x) 
FROM generate_series(0,pow(2,array_upper($1,1))::integer-1) x;
$$ LANGUAGE 'sql';

If you could find a faster way to write bitwise_subarray then comb_bits would be very fast. Like, say, a small C extension function, but I'm only crazy enough to write one of those for an SO answer.

Community
  • 1
  • 1
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • 1
    @Erwin: Thank you for the solution! Understandable for a newbie like me. Note I added `IF _a = '{}' AND _z = '{}' THEN RETURN QUERY SELECT '{}'::int[]; END IF;` to add a null result ('{}') to each set. Also, I found that when I removed the special 3-element case I got the same performance, so I decided to remove it. – Kim Aug 17 '12 at 16:57
  • @Kim: Interesting, thanks for the feedback. Please consider my new solution. I came up with a completely new, simpler idea. – Erwin Brandstetter Aug 17 '12 at 18:58