0

We have a database that stores large amounts of measurement data in the following way.

  • There's a table called "Instrument", where every intrument has as primary key of three columns [CustomerCode | LocationCode | InstrumentCode]. Those colums a of type VARCHAR(4), so a complete primary key looks like ABC | L001 | S001. Those codes are meaningfull to us, so we can't just change the for integers. Other relations ar defined on those columns as well but they are out of the scope for this question. This table holds about 200.000 rows representing different measurement data points.
  • There's a table called "InstrumentLoggings" where every InstrumentLogging has a primary key of four columns [CustomerCode | LocationCode | InstrumentCode | Timestamp]. The Timestamp column is of type DateTime. A Foreign Key relation is defined on the first three columns to the "Instrument" table. Then there's a fifth field of type VARCHAR(25) containing a value for that timestamp. This tables holds about 5 billion records (is this outrageous or is it not bad at all?).

Here's a short diagram of the current situation:

enter image description here

Our issue is that the InstumentLoggings table grows to over 200GB, and the performance is starting to decrease. Also back-up and restoring is too much time-consuming. We are looking for ways to eliminate all those primary key fields to a single field in the InstrumentLoggings table.

Can i simply put an extra column InstrumentId on the Instrument table, and create a InstrumentLoggings table with only three colums [InstrumentId | Timestamp | Value ] where the primary key consist of the InstrumentId and Timestamp columns?? Or is it better for the performance to add an extra column InstrumentLoggingId to the previous idea?

In the following image you can see the logging table as is is now, and the two alternatives. I am very curious about your thoughts and if there are any alternatives i do not see right now...

enter image description here

Jeroen1984
  • 1,616
  • 1
  • 19
  • 32
  • 1
    If no other table is related to instrumentLoggings, I see no benefit of adding another column to act as it's primary key. The InstrumentId combined with the Timestamp can stay the primary key of that table. As for the storage space you will save be adding the InstrumentId column - it's replacing 18 bytes with 4 bytes for each record in the table. – Zohar Peled Dec 06 '16 at 15:25
  • 1
    Since this appears to be new development I would suggest not using the name Timestamp. Not only is a datatype in sql server that has nothing to do with date or time it is incredibly ambiguous. Something like DateAdded is a much better and clear name. I would also try to avoid column names like Value. for the problem at hand I would probably use an identity as part of the composite key instead of a datetime value. – Sean Lange Dec 06 '16 at 16:08
  • 2
    Reducing the information represented in the InstrumentLoggings table may result in additional joins required to reconstruct that information in queries. Performance will depend on your queries and whether suitable indexes exist to support them. You should base optimization on actual testing and EXPLAINing your queries. – reaanb Dec 06 '16 at 21:04
  • @SeanLange The Timestamp column does represent the timestamp when the actual measurement has taken place. This can be different than the date the record is added. But i understand your suggestion, thank you. – Jeroen1984 Dec 07 '16 at 07:16

1 Answers1

2

Have a look at Why use multiple columns as primary keys (composite primary key). It looks like the consensus there is what we use for new development: single column primary key and then unique constraint when necessary on a contained compound key.

This would be your option 2 with InstrumentLoggingId. If required you could but a unique constraint or just an additional index on InstrumentId/Timestamp.

EDIT

Justification for this choice (based on experiences - I am not a trained DBA :-)):

  1. ORM simplicity and future proofing. If a new column is added to the business key, any referencing tables don't have to change and the code changes are much simpler.
  2. Uniqueness and tie breaking. Assuming you go with InstrumentId/Timestamp as your PK, what are you doing at daylight savings time...UTC to avoid duplicates? Or error at the device because of a PK conflict and lose the data? What happens if one of the devices makes a mistake or has its clock sync incorrectly...it could start spewing out duplicate times. Having a separate unique key allows for you to figure out what went on later by being able to sequence by time and that key and adjust the specific records you want to adjust where there might be duplicates.
  3. I had a third reason but can't remember what it is now unless it got mixed in with number 2. Will edit later if I can remember :-)
  4. Insert performance. AFAIK Making your unique (likely identity) PK clustered will keep records being inserted at the end rather than inserting and shuffling the physical order of records based on the business key (e.g. assuming you went with a clustered PK of InstrumentId/Timestamp, every insert for Instrument 1 would be physically inserted before the records for Instrument 2). I don't the full technical way this happens but I know there is more overhead than simply inserting at the end.
Community
  • 1
  • 1
SMM
  • 2,225
  • 1
  • 19
  • 30