SQL - Many to Many alternatives?

Question

Suppose I have records:

======= =========
Element id
======= =========
        "H"
        "O"

And another like:

======== ==
Compound id
======== ==
         "Water"

With:

======== == =========== ========== ==========
Relation id compound_id element_id bond
======== == =========== ========== ==========
         1  "Water"     "H"        "Covalent"
         2  "Water"     "H"        "Covalent"
         3  "Water"     "O"        "Covalent"

Now, most of my queries aren't for an exact match, but suppose sometimes I want to find the compound with the exact elements = ["H", "H", "O"] (i.e. Water -- but not Hydroxide (["H", "O"]) or Peroxide (["H", "H", "O", "O"]).

How might I go about this?

Consensus seems to have it, that the best way to store arrays in SQL is through many-to-many intermediate tables.
However, querying for an exact match without arrays seems slow and complex, even with database-specific functions like GROUP_CONCAT.

score 3 · Answer 1 · answered Jan 22 '19 at 11:36

3

Why not just use array_agg()?

select compound_id
from t3
group by compound_id
having array_agg(element_id order by element_id) = array['H', 'H', 'O']

answered Jan 22 '19 at 11:36

Gordon Linoff

1,242,037
58
646
786

I'll consider this if I switch to Postgres, but for now `array_agg` is the Postgres-only equivalent of Mysql's `group_concat` that I mention in the question. – c z Feb 05 '19 at 12:58

Radim Bača · Accepted Answer · 2019-01-22T11:53:24.677

It is always better to keep your database normalized. In your particular case, I would store the number of elements per compound instead of having a new row for each element.

 compound_id element_id      bond         count
 -------------------------------------------------
   "Water"     "H"        "Covalent"        2
   "Water"     "O"        "Covalent"        1

The query for the exact match will be

 select compound_id
 from elements
 group by compound_id
 having count(
              case when 
                (element_id = 'H' and count = 2) or
                (element_id = 'O' and count = 1) then 1 
              end
        ) = count(*)

However, this approach will be suboptimal, since a sequential scan will be used. If denormalization is not a problem, storing a number of different elements per compound can help.

 compound_id   element_count
 ------------------------------
   "Water"          2

Then the query may be

 select e.compound_id
 from elements e
 join compounds c on e.compound_id = c.compound_id
 where c.element_count = 2 and
       ((e.element_id = 'H' and e.count = 2) or
        (e.element_id = 'O' and e.count = 1))
 group by e.compound_id
 having count(*) = 2

and if you have an index on compounds(element_count) and elements(element_id, count) then your query will use it to retrieve the results fast even if the database is large.

SQL - Many to Many alternatives?

2 Answers2