PostgreSQL 9.3 - Compare two sets of data without duplicating values in first set

Question

I have a group of tables that define some rules that need to be followed, for example:

CREATE TABLE foo.subrules (
    subruleid SERIAL PRIMARY KEY,
    ruleid INTEGER REFERENCES foo.rules(ruleid),
    subrule INTEGER,
    barid INTEGER REFERENCES foo.bars(barid)
);

INSERT INTO foo.subrules(ruleid,subrule,barid) VALUES 
    (1,1,1),
    (1,1,2),
    (1,2,2),
    (1,2,3),
    (1,2,4),
    (1,3,3),
    (1,3,4),
    (1,3,5),
    (1,3,6),
    (1,3,7);

What this is defining is a set of "subrules" that need to be satisfied... if all "subrules" are satisfied then the rule is also satisfied. In the above example, "subruleid" 1 can be satisfied with a "barid" value of 1 or 2. Additionally, "subruleid" 2 can be satisfied with a "barid" value of 2, 3, or 4. Likewise, "subruleid" 3 can be satisfied with a "barid" value of 3, 4, 5, 6, or 7.

I also have a data set that looks like this:

 primarykey |  resource  |   barid  
------------|------------|------------
     1      |     A      |     1      
     2      |     B      |     2      
     3      |     C      |     8

The tricky part is that once a "subrule" is satisfied with a "resource", that "resource" can't satisfy any other "subrule" (even if the same "barid" would satisfy the other "subrule")

So, what I need is to evaluate and return the following results:

   ruleid   |   subrule  |   barid    | primarykey |  resource  
------------|------------|------------|------------|------------
     1      |     1      |     1      |     1      |     A      
     1      |     1      |     2      |    NULL    |    NULL
     1      |     2      |     2      |     2      |     B      
     1      |     2      |     3      |    NULL    |    NULL
     1      |     2      |     4      |    NULL    |    NULL
     1      |     3      |     3      |    NULL    |    NULL    
     1      |     3      |     4      |    NULL    |    NULL
     1      |     3      |     5      |    NULL    |    NULL
     1      |     3      |     6      |    NULL    |    NULL
     1      |     3      |     7      |    NULL    |    NULL
    NULL    |    NULL    |    NULL    |     3      |     C

Interestingly, if "primarykey" 3 had a "barid" value of 2 (instead of 8) the results would be identical.

I have tried several methods including a plpgsql function that performs a grouping by "subruleid" with ARRAY_AGG(barid) and building an array from barid and checking if each element in the barid array is in the "subruleid" group via a loop, but it just doesn't feel right.

Is a more elegant or efficient option available?

The problem is not well defined. The result depends on the sequence of assignments. To get a deterministic result you need to declare an unambiguous ranking for assignments. Each assignment potentially depends on all previous assignments, which makes this an inherently ***procedural*** task, and which is also why this is hard to solve with the set-based approach of SQL. Not saying it can't be done. A recursive CTE comes to mind. A plpgsql function iterating through 2 cursors in parallel is probably faster. But first, the definition must be clear. — Erwin Brandstetter, Feb 12 '15 at 05:04
If I understand the question correctly, if resource 'C' would have a barid = 7, then **all** the subrules would be satisfied, so rule#1 would be satisfied, too ? — wildplasser, Feb 14 '15 at 13:48
@wildplasser - yes, resource `C` could have a value of `3`, `4`, `5`, `6`, or `7` and `ruleid` `1` would be satisfied. — losthorse, Feb 14 '15 at 13:53
yes, the number if resources is limited and known in advance. Each `ruleid` has a given number of `subrule`s and the `ruleid` is selected by the number of resources. In this case there are `3` resources, so there must also be `3` `sublevel`s. — losthorse, Feb 14 '15 at 14:11
You are only interested in the satisfiability of the complete problem, or do you want the actual realisation(s) too? — wildplasser, Feb 15 '15 at 12:36
@wildplasser - (my apologies for the delayed response) I would prefer the actual realization; however, I'm grateful for any and all help. — losthorse, Feb 16 '15 at 01:57
I think this is a "big" problem. Maybe n-sat, maybe relational-devision. @ErwinBrandstetter maybe change title and tags ? — wildplasser, Feb 16 '15 at 23:28
@wildplasser: I think it's a common pattern. Not relational division, though. I don't recognize "n-sat". I am still waiting for confirmation by the OP, whether or not I interpreted the question right. — Erwin Brandstetter, Feb 17 '15 at 16:32
wildplasser & ErwinBrandstetter - I think you both seem to understand the question... wildplasser's answer seems to calculate all possible correct options and then check if the current configuration matches. This will find a reliable answer; however it requires a dynamic SQL statement (and I would like to avoid that if possible). On the other hand, ErwinBrandstetter's answer is pure SQL but seems to miss some valid configurations due to the procedural nature of the RECURSIVE CTE. — losthorse, Feb 17 '15 at 16:42
@ OP: maybe define the problem more clearly? Still think this kind of probem is in the n-sat corner. It even smells like sudoku... The question still needs a better title, though. — wildplasser, Feb 17 '15 at 22:54
@wildplasser: Yeah, the problem is very close to Sudoku. losthorse, you may be interested in [Mathematics of Sudoku](https://en.wikipedia.org/wiki/Mathematics_of_Sudoku) on Wikipedia for hints on an efficient algorithm. — Erwin Brandstetter, Feb 17 '15 at 23:35
The diffence is that sudoku is strictly limited to 3*3 (*3). The OP's problem appears to be unbounded. — wildplasser, Feb 17 '15 at 23:40

score 2 · Answer 1 · answered Feb 15 '15 at 15:06

The following fragment finds solutions, if there are any. The number three (resources) is hardcoded. If only one solution is needed some symmetry-breaker should be added.

If the number of resources is not bounded, I think there could be a solution by enumerating all possible tableaux (Hilbert? mixed-radix?), and selecting from them, after pruning the not-satifying ones.

 -- the data
CREATE TABLE subrules
    ( subruleid SERIAL PRIMARY KEY
    , ruleid INTEGER -- REFERENCES foo.rules(ruleid),
    , subrule INTEGER
    , barid INTEGER -- REFERENCES foo.bars(barid)
);

INSERT INTO subrules(ruleid,subrule,barid) VALUES
    (1,1,1), (1,1,2),
    (1,2,2), (1,2,3), (1,2,4),
    (1,3,3), (1,3,4), (1,3,5), (1,3,6), (1,3,7);

CREATE TABLE resources
    ( primarykey INTEGER NOT NULL PRIMARY KEY
    ,  resrc  varchar
    ,  barid  INTEGER NOT NULL
        );

INSERT INTO resources(primarykey,resrc,barid) VALUES
      (1, 'A', 1) ,(2, 'B', 2) ,(3, 'C', 8)
        -- ################################
        -- uncomment next line to find a (two!) solution(s)
     -- ,(4, 'D', 7)
        ;

-- all matching pairs of subrules <--> resources
WITH pairs AS (
        SELECT sr.subruleid, sr.ruleid, sr.subrule, sr.barid
        , re.primarykey, re.resrc
        FROM subrules sr
        JOIN resources re ON re.barid = sr.barid
        )
SELECT
        p1.ruleid AS ru1 , p1.subrule AS sr1 , p1.resrc AS one
        , p2.ruleid AS ru2 , p2.subrule AS sr2 , p2.resrc AS two
        , p3.ruleid AS ru3 , p3.subrule AS sr3 , p3.resrc AS three
  -- self-join the pairs, excluding the ones that
  -- use the same subrule or resource
FROM pairs p1
JOIN pairs p2 ON p2.primarykey > p1.primarykey -- tie-breaker
JOIN pairs p3 ON p3.primarykey > p2.primarykey -- tie breaker
WHERE 1=1
AND p2.subruleid <> p1.subruleid
AND p2.subruleid <> p3.subruleid
AND p3.subruleid <> p1.subruleid
        ;

Result (after uncommenting the line with missing resource) :

 ru1 | sr1 | one | ru2 | sr2 | two | ru3 | sr3 | three 
-----+-----+-----+-----+-----+-----+-----+-----+-------
   1 |   1 | A   |   1 |   1 | B   |   1 |   3 | D
   1 |   1 | A   |   1 |   2 | B   |   1 |   3 | D
(2 rows)

The resources {A,B,C} could of course be hard-coded, but that would prevent the 'D' record (or any other) to serve as the missing link.

I may have misunderstood your question in the comments above... the number of resources is known in advance of the function running; however, it is not known at the time of the function being created... thus this solution would require a dynamic SQL solution to add/reduce the number of resources. — losthorse, Feb 16 '15 at 02:19

score 1 · Answer 2 · edited May 23 '17 at 12:06

1

Since you are not clarifying the question, I am going with my own assumptions.

subrule numbers are ascending without gaps for each rule.
(subrule, barid) is UNIQUE in table subrules.
If a there are multiple resources for the same barid, assignments are arbitrary among these peers.
As commented, the number of resources matches the number of subrules (which has no effect on my suggested solution).
The algorithm is as follows:
1. Pick the subrule with the smallest subrule number.
2. Assign a resource to the lowest barid possible (the first that has a matching resource), which consumes the resource.
3. After the first resource is matched, skip to the next higher subruleid and repeat 2.
4. Append all remaining resources after last subrule.

You can implement this with pure SQL using a recursive CTE:

WITH RECURSIVE cte AS ((
   SELECT s.*, r.resourceid, r.resource
        , CASE WHEN r.resourceid IS NULL THEN '{}'::int[]
               ELSE ARRAY[r.resourceid] END AS consumed
   FROM   subrules s
   LEFT   JOIN resource r USING (barid)
   WHERE  s.ruleid = 1
   ORDER  BY s.subrule, r.barid, s.barid
   LIMIT  1
   )
   UNION ALL (
   SELECT s.*, r.resourceid, r.resource
        , CASE WHEN r.resourceid IS NULL THEN c.consumed
                                         ELSE c.consumed || r.resourceid END
   FROM   cte           c
   JOIN   subrules      s ON s.subrule = c.subrule + 1
   LEFT   JOIN resource r ON r.barid = s.barid
                         AND r.resourceid <> ALL (c.consumed)
   ORDER  BY r.barid, s.barid
   LIMIT  1
   ))
SELECT ruleid, subrule, barid, resourceid, resource FROM cte

UNION ALL  -- add unused rules
SELECT s.ruleid, s.subrule, s.barid, NULL, NULL 
FROM   subrules s
LEFT   JOIN cte c USING (subruleid)
WHERE  c.subruleid IS NULL

UNION ALL  -- add unused resources
SELECT NULL, NULL, r.barid, r.resourceid, r.resource
FROM   resource r
LEFT   JOIN cte c USING (resourceid)
WHERE  c.resourceid IS NULL    
ORDER  BY subrule, barid, resourceid;

Returns exactly the result you have been asking for.
SQL Fiddle.

Explain

It's basically an implementation of the algorithm laid out above.

Only take a single match on a single barid per subrule. Hence the LIMIT 1, which requires additional parentheses:
- Sum results of a few queries and then find top 5 in SQL
Collecting "consumed" resources in the array consumed and exclude them from repeated assignment with r.resourceid <> ALL (c.consumed). Note in particular how I avoid NULL values in the array, which would break the test.
The CTE only returns matched rows. Add rules and resources without match in the outer SELECT to get the complete result.

Or you open two cursors on the tables subrule and resource and implement the algorithm with any decent programming language (including PL/pgSQL).

edited May 23 '17 at 12:06

Community

1
1

answered Feb 16 '15 at 01:47

Erwin Brandstetter

605,456
145
1,078
1,228

My apologies for the delayed response, I started to reply to your original comment above and failed to finish it. I started trying to better define the problems you pointed out in the comments above... specifically the issue of this task being inherently procedural and thus not well defined. On the other hand, your solution is returning the requested results. The only issue I see is with the second step in the algorithm as this could ultimately cause the return of a false negative. – losthorse Feb 16 '15 at 02:12
@losthorse: I clarified item 2. of the algorithm: "the first that has a matching resource": that's what my query does. So no false negatives there. – Erwin Brandstetter Feb 16 '15 at 12:26
It looks like something is wrong with the SQL Fiddle... all versions are empty. Also, if I am understanding the updated explanation, this technique will accurately return a `true` result if, for example, `subrule` `3` was limited to a single `barid` value of `1` and `resourceid` `3` had a `barid` value of `3`... is my understanding correct? – losthorse Feb 17 '15 at 15:22
@losthorse: sqlfiddle is having major troubles lately. Overloaded most of the time, down sometimes. I also checked other, known good links. Currently, all links are dead. They will probably be back eventually. You'll have to test in your own installation. As for your example: *no*, `barid` 1 can only match `barid` 1. Two *unmatched* rows will be returned. Try it. – Erwin Brandstetter Feb 17 '15 at 16:22
I just tested on my server and found that you're correct; it does not work. Here are the updated example data sets: `INSERT INTO subrules(ruleid,subrule,barid) VALUES (1,1,1), (1,1,2), (1,2,2), (1,2,3), (1,2,4), (1,3,1);` and `INSERT INTO resource(resourceid,resource,barid) VALUES (1, 'A', 1) ,(2, 'B', 2) ,(3, 'C', 3);` This configuration should match, no? – losthorse Feb 17 '15 at 16:26
@losthorse: No, my current query consumes `barid` 1 for `subrule` 1 (lowest `barid` first) There is no `barid` 1 left for `subrule` 3. Are you after a smart algorithm that consumes `bar_id` 2 for `subrule` 1 so that `barid` 1 is still left for `subrule` 3? That's a lot harder to solve, yet. And gets extremely complex with growing sets. Combinatorics. O(NM). As a first *incomplete* optimization, you could start with the subrule that has the least rows. I remember a similar question here in the past, but can't find it right now. (I don't think we solved it.) – Erwin Brandstetter Feb 17 '15 at 16:42
Yes, im looking for an algorithm that consumes `bar_id` `2` for `subrule` `1` so that `barid` `1` is still left for `subrule` `3`. It would be best if I could return the results as displayed in the original post; however, if there is a way to return only the configuration points used to meet the rule that would be something. My current solution involves this approach: I build a set of arrays and find those with the largest intersection of values with the current configuration... but this requires dynamic SQL because the number of subrules is variable). – losthorse Feb 17 '15 at 16:52
@losthorse: Probably best to start a new question with an accurate description of the problem and add a test case like the one in your comment here to illustrate. I think I answered *this* question as posted. – Erwin Brandstetter Feb 17 '15 at 17:06

PostgreSQL 9.3 - Compare two sets of data without duplicating values in first set

2 Answers2

Explain