4

Given the 4 tables, each containing items and representing one set, how to get the count of the items in each compartment required to draw a Venn diagram as shown below. The calculation should take place in the MySQL server avoiding transmission of items to the application server.

Example tables:

s1:         s2:         s3:         s4:
+------+    +------+    +------+    +------+
| item |    | item |    | item |    | item |
+------+    +------+    +------+    +------+
| a    |    | a    |    | a    |    | a    |
+------+    +------+    +------+    +------+
| b    |    | b    |    | b    |    | c    |
+------+    +------+    +------+    +------+
| c    |    | c    |    | d    |    | d    |
+------+    +------+    +------+    +------+
| d    |    | e    |    | e    |    | e    |
+------+    +------+    +------+    +------+
| ...  |    | ...  |    | ...  |    | ...  |

Now, I think I would calculate some set powers. Some examples with I corresponding to s1, II to s2, III to s3 and IV to s4:

quadruple Venn Diagram - Venn diagram made from 4 sets using ellipses

If I reinterpret sx as being a set, I would write:

  1. |s1 ∩ s2 ∩ s3 ∩ s4| - the white 25 in the center
  2. |(s1 ∩ s2 ∩ s4) \ s3| - the white 15 below right in relation to the center
  3. |(s1 ∩ s4) \ (s2 ∪ s3)| - the white 5 on the bottom
  4. |s1 \ (s2 ∪ s3 ∪ s4)| - the dark blue 60 on the blue ground
  5. ... till 15.

How to calculate those powers efficiently on the MySQL server? Does MySQL provide a function aiding in the calculation?

A naive approach would be running a query for 1.

SELECT count(*) FROM(
SELECT item FROM s1
INTERSECT
SELECT item FROM s2
INTERSECT
SELECT item FROM s3
INTERSECT
SELECT item FROM s4);

and another query for 2.

SELECT count(*) FROM(
SELECT item FROM s1
INTERSECT
SELECT item FROM s2
INTERSECT
SELECT item FROM s4
EXCEPT
SELECT item FROM s3);

and so on, resulting in 15 queries.

Rainer Rillke
  • 1,281
  • 12
  • 24

3 Answers3

2

Try something like this:

with universe as (
    select * from s1 
    union
    select * from s2
    union
    select * from s3
    union
    select * from s4
),
regions as (
    select
        case when s1.item is null then '0' else '1' end
        ||
        case when s2.item is null then '0' else '1' end
        ||
        case when s3.item is null then '0' else '1' end
        ||
        case when s4.item is null then '0' else '1' end as Region
    from universe u
    left join s1 on u.item = s1.item
    left join s2 on u.item = s2.item
    left join s3 on u.item = s3.item
    left join s4 on u.item = s4.item
)
select Region, count(*) from regions group by Region

Disclaimer: I only tested this in SQLite. You might need to SET sql_mode='PIPES_AS_CONCAT' for the ANSI string concatenation to work in MySQL, or use the concat function instead. The WITH syntax is only supported starting from version 8.0 of MySQL, but you can use temporary tables or nested queries appropriately instead.

If the sets are very large you might want to index the item column before querying in case the SQL optimizer won't figure it out by itself.

KT.
  • 10,815
  • 4
  • 47
  • 71
1

The question is a little complex so the answers are. Let me explain K.T.'s answer

with universe as (
    select * from s1 
    union
    select * from s2
    union
    select * from s3
    union
    select * from s4
),
regions as (
    select
        case when s1.item is null then '0' else '1' end
        ||
        case when s2.item is null then '0' else '1' end
        ||
        case when s3.item is null then '0' else '1' end
        ||
        case when s4.item is null then '0' else '1' end as Region
    from universe u
    left join s1 on u.item = s1.item
    left join s2 on u.item = s2.item
    left join s3 on u.item = s3.item
    left join s4 on u.item = s4.item
)
select Region, count(*) from regions group by Region

The universe results in the UNION of all tables (duplicates eliminated), something like

+------+
| item |
+------+
| a    |
+------+
| b    |
+------+
| c    |
+------+
| d    |
+------+
| e    |
+------+
| ...  |
+------+

Then, s1, s2, s3 and s4 are joined

+------+---------+---------+---------+---------+
| item | s1.item | s2.item | s3.item | s4.item |
+------+---------+---------+---------+---------+
| a    | a       | a       | a       | a       |
+------+---------+---------+---------+---------+
| b    | b       | b       | b       | NULL    |
+------+---------+---------+---------+---------+
| c    | c       | c       | NULL    | c       |
+------+---------+---------+---------+---------+
| d    | d       | NULL    | d       | d       |
+------+---------+---------+---------+---------+
| e    | NULL    | e       | e       | e       |
+------+---------+---------+---------+---------+
| ...  | ...     | ...     | ...     | ...     |
+------+---------+---------+---------+---------+

and converted to a binary string (0: if cell is NULL; 1: else) called Region where the first digit corresponds to s1, the second to s2 and so on

+------+--------+
| item | Region |
+------+--------+
| a    | 1111   |
+------+--------+
| b    | 1110   |
+------+--------+
| c    | 1101   |
+------+--------+
| d    | 1011   |
+------+--------+
| e    | 0111   |
+------+--------+
| ...  | ...    |
+------+--------+

and finally aggregated and grouped by Region

+--------+-------+
| Region | count |
+--------+-------+
| 1111   | 1     |
+--------+-------+
| 1110   | 1     |
+--------+-------+
| 1101   | 1     |
+--------+-------+
| 1011   | 1     |
+--------+-------+
| 0111   | 1     |
+--------+-------+
| ...    |       |
+--------+-------+

Note that regions having 0 set elements in them do not show up in the results and 0000 never will (=item not part of any set s1, s2, s3, s4) so there are 15 regions.

4-set venn diagram with regions in binary representation

Rainer Rillke
  • 1,281
  • 12
  • 24
0

Following procedure:

  1. Created a stored procedure, which creates temporary in-memory tables containing the sets.
  2. Mind that MySQL does not allow you refer to a temporary in-memory table more than one time in a query.
  3. As noted, MySQL does not have an INTERSECT or EXCEPT. But you can emulate them. By removing duplicates from your raw data/ raw sets, emulation can be even more simplified.
  4. Decided to store the computed value into a variable each and output a table consisting of all 15 of those values corresponding to components.

What I came up with is currently https://gist.github.com/Rillke/c2da0921f8f2a047615f41fab8781c11

Rainer Rillke
  • 1,281
  • 12
  • 24