Normalize a table where there is a need to reference subsets of a column in another table and those subsets must be unique

Question

How do I normalise this relation (i.e. make it conform to 1NF, 2NF, and 3NF)

CREATE TABLE IF NOT EXISTS series (
  series_id SERIAL PRIMARY KEY,
  dimension_ids INT[] UNIQUE,
  dataset_id INT REFERENCES dataset(dataset_id) ON DELETE CASCADE
);

CREATE TABLE IF NOT EXISTS dimension (
  dimension_id SERIAL PRIMARY KEY,
  dim VARCHAR(50),
  val VARCHAR(50),
  dataset_id INT REFERENCES dataset(dataset_id) ON DELETE CASCADE,
  UNIQUE (dim, val, dataset_id)
);

Where subsets of dimension_id's uniquely identify records in the series table.

EDIT

To provide more information, the data I want to store comes from XML structures looking something like the following

<?xml version="1.0" encoding="utf-8"?>
<message:StructureSpecificData >
   <message:Header>
      <message:ID>IREF757740</message:ID>
      <message:Test>false</message:Test>
      <message:Prepared>2020-04-09T14:55:23</message:Prepared>
   </message:Header>
   <message:DataSet ss:dataScope="DataStructure" ss:structureRef="CPI" xsi:type="ns1:DataSetType">
      <Series FREQ="M" GEOG_AREA="WC" UNIT="IDX">
         <Obs OBS_STATUS="A" OBS_VALUE="75.5" TIME_PERIOD="31-Jan-2008"/>
         <Obs OBS_STATUS="A" OBS_VALUE="75.8" TIME_PERIOD="29-Feb-2008"/>
         <Obs OBS_STATUS="A" OBS_VALUE="77" TIME_PERIOD="31-Mar-2008"/>
         <Obs OBS_STATUS="A" OBS_VALUE="77.5" TIME_PERIOD="30-Apr-2008"/>
         <Obs OBS_STATUS="A" OBS_VALUE="78" TIME_PERIOD="31-May-2008"/>
         <Obs OBS_STATUS="A" OBS_VALUE="78.8" TIME_PERIOD="30-Jun-2008"/>
      </Series>
      <Series FREQ="M" GEOG_AREA="NC" UNIT="IDX">
         <Obs OBS_STATUS="A" OBS_VALUE="75.5" TIME_PERIOD="31-Jan-2008"/>
         <Obs OBS_STATUS="A" OBS_VALUE="75.8" TIME_PERIOD="29-Feb-2008"/>
         <Obs OBS_STATUS="A" OBS_VALUE="77" TIME_PERIOD="31-Mar-2008"/>
         <Obs OBS_STATUS="A" OBS_VALUE="77.5" TIME_PERIOD="30-Apr-2008"/>
         <Obs OBS_STATUS="A" OBS_VALUE="78" TIME_PERIOD="31-May-2008"/>
         <Obs OBS_STATUS="A" OBS_VALUE="78.8" TIME_PERIOD="30-Jun-2008"/>
      </Series>
   </message:DataSet>
</message:StructureSpecificData>

There is a dataset that contains series (0...n) that contain observations (0...n). The series are uniquely identified by their XML attributes - what I call dimensions in my data model. In my example I have two series, differentiated by the geographical areas they cover. Any series can have an arbitrary number of dimensions. series are expected to be queried from its dimensions and the dimensions will also be queried using the series_id. The obvious solution is a bridging table:

CREATE TABLE series_dimension
  series_id INT REFERENCES series(series_id) ON DELETE CASCADE,
  dimension_id INT REFERENCES dimension(dimension_id)
);

This solution permits, however, the following scenario:

|--------------------------|
| series_dimension         |
|-----------|--------------|
| series_id | dimension_id |
|-----------|--------------|
| 1         | 1            |
| 1         | 2            |
| 1         | 3            |
| 1         | 4            |
| 2         | 1            |
| 2         | 2            |
| 2         | 3            |
| 2         | 4            |
|-----------|--------------|

That is, two different series with the same dimensions, so that if I query a series for a given set of dimensions I can't decide in the case of dimensions [1 2 3 4] whether I am looking for series_id=1 or series_id=2 which is unacceptable. Is it therefore the case that in such a situation, I must decide between having referential integrity and the uniqueness property I have just explained?

You need a bridge table, get rid of the array in `series` table as it obviously violates 3NF. — filiprem, Apr 19 '20 at 18:05
@filiprem I understand that, what I am struggling with is how to ensure that the subsets that are represented by the array (in the current set up) are unique in the proposed bridging table. — byrongibby, Apr 19 '20 at 20:11
That is, if a subset of `dimension_id`'s, say `[1 2 3 4]` refers to `series_id`=1, then `[1 2 3 4]` may not refer to `series_id`=2 as well. — byrongibby, Apr 19 '20 at 20:18
If a subset of dimensions is a "first-class citizen" in your model, it should have a name and unique ID. This ID would be related to your "Series" object. This is just a guess but looks like you're modeling something like https://mondrian.pentaho.com/documentation/schema.php ? Can you show what you did so far (the schema with bridge table included)? — filiprem, Apr 19 '20 at 21:33
Referencing the `series_id` directly in the 'dimension` table doesn't solve the problem since you still have a one-to-many relationship which doesn't guarantee that you will be able to isolate a unique `series_id` from a given set of dimensions. I haven't been able to make sense of the reference you gave, but I will give it a deeper look when I get the chance. I have provided some additional information in the OP. — byrongibby, Apr 20 '20 at 09:42
Is there an estimate of total number of possible dimensions? — Damir Sudarevic, Apr 28 '20 at 19:46
@Damir the number of dimensions according to the [schema](https://registry.sdmx.org/schemas/v2_1/SDMXStructureDataStructure.xsd) are technically unbounded. The typical case would have 1-20 dimensions I would say. — byrongibby, Apr 28 '20 at 20:57
Normalization doesn't produce constraints so it's not even clear what you want or why you gave that link. You seem to be trying to produce a design with certain properties--being basic & straightforward with certain types, NFs & constraints)--but you don't ask 1 clear question--do so. If it involves "normalization" find precise definitions & algorithms & ask where 1st stuck. (All are faqs. Though unfortunately SO answers re the relational model or mostly very poor.) PS We don't normalize to 3NF via 2NF. And why do you even think you want 3NF? — philipxy, Apr 30 '20 at 07:12

score 0 · Answer 1 · answered Apr 23 '20 at 12:11

0

My conclusion that this relationship (where a column refers to attributes whose number is not known in advance) requires that normalisation lead to a many-to-many or one-to-many relationsip being created, and this precludes a unique mapping.

Conversely, for a relationship where a column refers to attributes whose number is not known in advance, the way to make the relationship one-to-one/unique is to group those attributes into unique subsets which violates 1NF.

There is only one way to specify UNIQUE constraints, and that is on column(s)
My example requires that each series_id references a variable number of columns/dimensions
I therefore stack my columns in rows with the result that the UNIQUE construct is not available
The solution has each series_id relate to an array that specifies the row subsets, I can now specify that this column of arrays is UNIQUE
This violates 1NF, therefore this relationship cannot be normalised

answered Apr 23 '20 at 12:11

byrongibby

55
1
7

["1NF" has no single meaning.](https://stackoverflow.com/a/40640962/3404097) The various meanings (when not meaningless) (like in the Wikipedia article) are mostly about ("better" designs via) having a column in a table for any parameter or property that we want to quantify over (via "for all" or "for some") in queries. So you need to say exactly what you mean by it if you want to be understood. Your question doesn't say what it means by "normalize" either, but at least it seems to say enough about what you want without referring to it that that doesn't matter. – philipxy Apr 23 '20 at 13:27
Point taken, I have edited the question to be a little more concrete about what I mean. – byrongibby Apr 28 '20 at 21:09
Your point `5` is *not correct*. However, there is nothing more useless then arguing about `NFs` -- so, I can just point you to my rant (external site). https://www.damirsystems.com/forgotten-simplicity/ – Damir Sudarevic Apr 29 '20 at 15:02

Damir Sudarevic · Accepted Answer · 2020-04-30T10:22:38.957

Given your expectation of around 20 dimensions, the example is limited to 60. It does require a controlled process to define each set of dimensions (series).

Reasoning

-- DIM is a valid numeric identifier for a dimension.
--
valid_dim {DIM}
       PK {DIM}

CHECK ((DIM = 1) OR ((DIM > 1) AND (mod(DIM,2) = 0)))


-- data sample
  (DIM)
---------
  (2^0)
, (2^1)
, (2^2)
, ...
, (2^58)
, (2^59)

-- Dimension DIM, named DIM_NAME exists.
--
dimension {DIM, DIM_NAME}
       PK {DIM}
       AK {DIM_NAME}

FK {DIM} REFERENCES valid_dim {DIM}


-- data sample
(DIM, DIM_NAME)
---------------
  (2^0, 'FREQ')
, (2^1, 'GEOG_AREA')
, (2^2, 'UNIT')
, ...
, (2^58, 'AGE_GROUP')
, (2^59, 'HAIR_COLOR')

Loading series and ser_dim can be done from a function, application, or whatever. However, this should be a controlled process.
SER is unique for a given set of dimensions. Note that | is bitwise OR operator.

-- Series SER, named SER_NAME exists.
--
series {SER, SER_NAME}
    PK {SER}
    AK {SER_NAME}


-- data sample
(SER, SER_NAME)
--------------------------------
  ((2^0 | 2^1 | 2^2)  , 'F-G-U')
, ((2^1 | 2^58)       , 'G-A'  )
, ((2^0 | 2^58 | 2^59), 'F-A-H')

-- Series SER has dimension DIM.
--
ser_dim {SER, DIM}
     PK {SER, DIM}

FK1 {SER} REFERENCES series    {SER}
FK2 {DIM} REFERENCES dimension {DIM}

CHECK ((DIM & SER) = DIM)

-- data sample
(SER, DIM)
--------------------------------
  ((2^0 | 2^1 | 2^2) , 2^0)
, ((2^0 | 2^1 | 2^2) , 2^1)
, ((2^0 | 2^1 | 2^2) , 2^2)

, ((2^1 | 2^58) , 2^1 )
, ((2^1 | 2^58) , 2^58)

, ((2^0 | 2^58 | 2^59), 2^0)
, ((2^0 | 2^58 | 2^59), 2^58)
, ((2^0 | 2^58 | 2^59), 2^59)

Note:

All attributes (columns) NOT NULL

PK = Primary Key
AK = Alternate Key (Unique)
FK = Foreign Key

PostgreSQL

-- DIM is a valid numeric identifier
-- for a dimension.
--
CREATE TABLE valid_dim (
      DIM bigint NOT NULL

    , CONSTRAINT pk_valid_dim PRIMARY KEY (DIM)

    , CONSTRAINT chk_valid_dim
        CHECK ( (DIM = 1)
                OR ( (DIM > 1)
                     AND (mod(DIM, 2) = 0) )
              )
);

-- define some of valid DIMs
INSERT INTO valid_dim (DIM)
VALUES
  ((2^ 0)::bigint)
, ((2^ 1)::bigint)
, ((2^ 2)::bigint)
-- fill this gap
, ((2^58)::bigint)
, ((2^59)::bigint) ;

-- Dimension DIM, named DIM_NAME exists.
--
CREATE TABLE dimension (
      DIM      bigint NOT NULL
    , DIM_NAME text   NOT NULL

    , CONSTRAINT pk_dim PRIMARY KEY (DIM)
    , CONSTRAINT ak_dim UNIQUE (DIM_NAME)

    , CONSTRAINT
        fk_dim   FOREIGN KEY (DIM)
        REFERENCES valid_dim (DIM)
);

-- define few dimensions
INSERT INTO dimension (DIM, DIM_NAME)
VALUES
  ((2^ 0)::bigint, 'FREQ')
, ((2^ 1)::bigint, 'GEOG_AREA')
, ((2^ 2)::bigint, 'UNIT')
, ((2^58)::bigint, 'AGE_GROUP')
, ((2^59)::bigint, 'HAIR_COLOR') ;

-- Series SER, named SER_NAME exists.
--
CREATE TABLE series (
     SER      bigint NOT NULL
   , SER_NAME text   NOT NULL

   , CONSTRAINT pk_series PRIMARY KEY (SER)
   , CONSTRAINT ak_series UNIQUE (SER_NAME)
);

-- define three series
INSERT INTO series (SER, SER_NAME)

SELECT bit_or(DIM) as SER, 'F-G-U' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT')

UNION

SELECT bit_or(DIM) as SER, 'G-A' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('GEOG_AREA', 'AGE_GROUP')

UNION

SELECT bit_or(DIM) as SER, 'F-A-H' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'AGE_GROUP', 'HAIR_COLOR') ;

-- Series SER has dimension DIM.
--
CREATE TABLE ser_dim (
     SER bigint NOT NULL
   , DIM bigint NOT NULL

   , CONSTRAINT pk_ser_dim PRIMARY KEY (SER, DIM)

   , CONSTRAINT
       fk1_ser_dim FOREIGN KEY (SER)
             REFERENCES series (SER)

   , CONSTRAINT
       fk2_ser_dim FOREIGN KEY (DIM)
          REFERENCES dimension (DIM)

   , CONSTRAINT
        chk_ser_dim CHECK ((DIM & SER) = DIM)
);

-- populate ser_dim
INSERT INTO ser_dim (SER, DIM)
SELECT SER, DIM
FROM series
JOIN dimension ON true
WHERE (DIM & SER) = DIM ;

Another option would be to use a (materialized) view for ser_dim. That depends on the rest of the model: if a FK is needed to {SER, DIM} keep the table, otherwise a view would be better.

-- An option, instead of the table.
--
CREATE VIEW ser_dim
AS
SELECT SER, DIM
FROM series
JOIN dimension ON true
WHERE (DIM & SER) = DIM ;

Test

-- Show already defined series
-- and their dimensions.
SELECT SER_NAME, DIM_NAME
FROM ser_dim  
JOIN series    USING (SER)
JOIN dimension USING (DIM)
ORDER BY SER_NAME, DIM_NAME ;

-- Get SER for a set of dimensions;
-- use this when defining a series.
SELECT bit_or(DIM) AS SER
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT') ;

-- Find already defined series,
-- given a set of dimensions.
SELECT x.SER
FROM (
  SELECT bit_or(DIM) AS SER
  FROM dimension
  WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT')
  ) AS x
WHERE EXISTS
  (SELECT 1 FROM series AS s WHERE s.SER = x.SER) ;

Summary

Unfortunately standard SQL implementations do not support assertions, database-wide constraints. SQL standard actually defines them, but no luck yet. Hence, not every business constraint can be done in SQL elegantly, usually some creativity and compromise is required.

Thanks. This is a very inventive solution, being a beginner to SQL, I would never have come up with using binary in this way. If there is one thing I'll remember going forward is never to talk about NFs, seems like it's a touchy subject. I'm interested to see how performant this solution is (or more accurately, how performant I can make it). P.S. my apologies for the late reply, work has been busy. Thanks again. — byrongibby, May 13 '20 at 12:28