3

This is a conceptual question as I am considering two approaches to a database design for a set of tables that get joined frequently.

Here is CASE A:

/*SCHEMA*/
CREATE TABLE SCHEMA (
SCHEMA_ID    INTEGER,
CONSTRAINT sch_pk PRIMARY KEY (SCHEMA_ID)
);

/*OBJECTS*/
CREATE TABLE OBJECT (
OBJECT_ID    INTEGER,
SCHEMA_ID    INTEGER NOT NULL,
CONSTRAINT obj_pk PRIMARY KEY (OBJECT_ID),
CONSTRAINT obj_sch_uniq UNIQUE (OBJECT_ID,SCHEMA_ID),
CONSTRAINT obj_fk FOREIGN KEY (SCHEMA_ID) REFERENCES SCHEMA(SCHEMA_ID)
);

/*COLUMNS*/
CREATE TABLE COL (
COL_ID      INTEGER,
OBJECT_ID   INTEGER,
CONSTRAINT col_pk PRIMARY KEY (COL_ID),
CONSTRAINT col_obj_uniq UNIQUE (COL_ID,OBJECT_ID),
CONSTRAINT col_fk FOREIGN KEY (OBJECT_ID) REFERENCES OBJECT(OBJECT_ID)
);

Here is CASE B:

/*SCHEMA*/
CREATE TABLE SCHEMA (
schema_id               INTEGER,
CONSTRAINT schema_pk PRIMARY KEY (schema_ID),
);

/*OBJECTS*/
CREATE TABLE OBJECT (
object_id               INTEGER,
schema_id               INTEGER,
CONSTRAINT object_pk PRIMARY KEY (object_id,schema_id),
CONSTRAINT object_schema_fk FOREIGN KEY (schema_id) REFERENCES SCHEMA (schema_id)
);

/*COLUMNS*/
CREATE TABLE COL (
column_id               INTEGER,
object_id               INTEGER,
schema_id               INTEGER,
CONSTRAINT column_pk PRIMARY KEY (column_id,object_id,schema_id),
CONSTRAINT column_object_fk FOREIGN KEY (object_id,schema_id) REFERENCES OBJECT (object_id,schema_id)
);

A common query that would be run against these set of tables is as follows: CASE A

SELECT *
FROM METADATA_CONTROL.COL
INNER JOIN METADATA_CONTROL.OBJECT ON METADATA_CONTROL.COL.OBJECT_ID = METADATA_CONTROL.OBJECT.OBJECT_ID
WHERE OBJECT.SCHEMA_ID = 101;

CASE B

SELECT *
FROM METADATA_CONTROL.COL
WHERE SCHEMA_ID = 101;

As can be seen, CASE A requires joins, CASE B doesn't. My questions:

-Am I correct that these two table structures enforce the same business requirements? -If they do, how do I determine which case to implement?

I'm weak in understanding the performance gains/penalties of different relational table structures. This is on Oracle 12c.

I appreciate any guidance in this case and some additional resources or rules I can follow with regards to query performance and how they relate to constraints and joins.

Thank you!

nojohnny101
  • 514
  • 7
  • 26
  • How many rows do you expect in each of those tables? – Littlefoot Aug 27 '18 at 19:21
  • on table "schemas" <1000, on "objects" probably 10,000-100,000 and on "columns" 1,000,000+ – nojohnny101 Aug 27 '18 at 19:22
  • I was hoping *your numbers divided by 100*. If that was the case, I guess that it wouldn't matter whether you choose A or B. With *millions* of rows, huh, things are different. I don't know what to suggest, sorry. If possible, create sample data set, populate both cases, run queries and check what explain plan looks like. Pick the one that is less expensive. Also, I hope that someone *educated* will be able to suggest something way better. – Littlefoot Aug 27 '18 at 19:26
  • I appreciate you taking the time to post. I appreciate it. I do plan on doing the explain plan and see that, but I was wondering conceptually and considering which case adheres's closest to "best practices" as it relates to relational database modeling. – nojohnny101 Aug 27 '18 at 19:30
  • No problem; I'll be watching this question, as I'm interested what will people say. – Littlefoot Aug 27 '18 at 19:33
  • If you opt to case B, then you don't need the OBJECT table. Case A seems to be normalized, and so will case B be, but without the OBJECT table. Though, that stands for current tables' description, with IDs only. In *real life*, they wouldn't look like that, I presume. In that case, if you want to have it normalized, then I'd say that case A is what is left, regardless queries with JOINS. – Littlefoot Aug 27 '18 at 21:05
  • My current generic comment re "better"/"best" etc: There's no such thing as "better"/"best" in engineering unless *you* define it. Also unfortunately all reasonable practical definitions require a ridiculous amount of experience with a ridiculous number of factors that interact with chaotic sensitivity to details. Make straightforward designs. When you demonstrate via measurement that a design and all alternatives you can think of have problems (whatever that means at the time), then ask a very specific question. Which should also define "better"/"best". https://meta.stackexchange.com/q/204461 – philipxy Aug 28 '18 at 02:46
  • These are relational designs in a trivial sense only--the idiom [EAV](https://stackoverflow.com/a/24153539/3404097). Research pros & cons. Instead of having tables representing application state you have tables representing tables representing application state. (See [my answers re EAV](https://stackoverflow.com/search?q=user%3A3404097+eav+is%3Aanswer).) Even if an application offers DBMS-like functionality, the straightforward way to implement its table-like state & DDL- & DML-like operations is *simply delegating to implementation DBMS DDL & DML calls*. – philipxy Aug 28 '18 at 03:54
  • @philipxy thanks for the response. I read your linked response on EAV. All tables I have defined above are required business cases. These 3 tables are part of a larger database that is used to store values for ingestion. Moving data between sources and so we must store all that information. Could you clarify a bit more? The reason I am redesign this database is because currently our ingestion tools do too much of the data integrity, and not the DB. That is what I'm working towards. Are you implying the structure I outlined is the opposite of that? – nojohnny101 Aug 28 '18 at 15:24
  • Your code builds a DBMS. Instead, delegate to your DBMS. It already has table & metadata tables for them & DDL & DML to manipulate those. I don't get "ingestion". But if you have a table on input why do you just not use DDL & DML to put that table into your database? I am commenting rather than answering because I am suggesting that your "business case" is a wrong solution to the actual problem. (See dandarc's answer re querying your (effectively meta-data) tables. Eg a superset of a unique column set is unique--your UNIQUE is redundant. Eg in SQL PK = UNIQUE NOT NULL, period.) – philipxy Aug 28 '18 at 19:58

1 Answers1

2

Those are not equivalent models.

Consider the following 2 rows in table OBJECT :

OBJECT_ID    SCHEMA_ID
1            1
1            2

You could insert both rows in CASE B. You could not insert both rows in CASE A. Primary Keys are Unique after all - so PRIMARY_KEY(OBJECT_ID) means OBJECT_ID is unique in the OBJECT table.

Now, if you're willing to let the application enforce the "OBJECT_ID is unique" constraint, then you could use CASE B to store data that meets CASE A requirements. You probably should not, but you could. That is to say everything you can put into the database in CASE A you can put into the database in CASE B. The reverse is not true.

So pick between Case A and B based on correctness primarily.

On Case A - the unique constraint is redundant - again looking at the OBJECT table OBJECT_ID is already unique. So you don't need that constraint on your table.

What you probably do want is an index where the leading column is SCHEMA_ID. That could just be an index on SCHEMA_ID, or on SCHEMA_ID and OBJECT_ID. That gives the database a good access path for your typical query where you are providing the SCHEMA_ID. Same goes for the COL table - to get best performance, you likely want an index with leading column OBJECT_ID.

If Case B is correct, you can either add indexes to support your queries, or reverse the order of the fields in your primary keys - if you're always filtering by SCHEMA_ID, you want an index where the first column is SCHEMA_ID. Primary Keys always have an associated index, so you could leverage that.

Basically decide on what is correct per the requirements, then optimize for performance.

dandarc
  • 678
  • 3
  • 7
  • I appreciate the response. I don't quite follow with what you said regards to the example you provided. My understanding is that your example would work with CASE A because of the constraint "UNIQUE (OBJECT_ID,SCHEMA_ID). I know it is a bit redundant, but in CASE A, there is a difference (in my understanding) between PRIMARY KEY (OBJECT_ID) vs UNIQUE (OBJECT_ID,SCHEMA_ID). The former stating that OBJECT_ID is unique identifies a row, not nullable, and irreducible. Whereas the latter says the combination of OBJECT_ID AND SCHEMA_ID must be unique. Correct? – nojohnny101 Aug 28 '18 at 15:34
  • Why not just try it. Build the table with the primary key and the unique constraint as you have done and see what happens when you try to insert those two rows. First insert will work, second will fail because you're violating the primary key constraint. – dandarc Aug 28 '18 at 23:15
  • I see. I did create some test tables and you're right, case A didn't accept your above example. I will look more into indexes as I am not very familiar with that. Thank you. – nojohnny101 Aug 29 '18 at 14:24