0

I'm looking for a best practice or solution, on a conceptual level, to a problem I'm working on.

I have a collection of data points (around 500) which are partially changed, by a user, over time. It is important to able to tell, which values have been changed at what point in time. The data might look like this:

Data changed over time:

+--------------------------------------------------------------------------------------+
|   Date     |  Value no. 1  |  Value no. 2  |  Value no. 3  |  ...  |  Value no. 500  | 
|------------+---------------+---------------+---------------+-------+-----------------|
|  1/1/2018  |               |               |       2       |       |        1        |
|  1/3/2018  |       2       |       1       |               |       |                 |
|  1/7/2018  |               |               |       4       |       |        8        |
| 1/12/2018  |       5       |       3       |               |       |                 |
....

It must be possible to take a snapshot at a certain point in time, to get a complete set of data points, that were valid for that particular point in time, like this:

Snapshot taken 1/3/2018 will yield:

+---------------------------------------------------------+
|  Value 1  |  Value 2  |  Value 3  |  ...  |  Value 500  | 
|-----------+-----------+-----------+-------+-------------|
|     2     |     1     |     2     |   0   |      1      |

Snapshot taken 1/9/2018 will yield:

+---------------------------------------------------------+
|  Value 1  |  Value 2  |  Value 3  |  ...  |  Value 500  | 
|-----------+-----------+-----------+-------+-------------|
|     2     |     1     |     4     |   0   |      8      |

Snapshot taken 1/13/2018 will yield:

+---------------------------------------------------------+
|  Value 1  |  Value 2  |  Value 3  |  ...  |  Value 500  | 
|-----------+-----------+-----------+-------+-------------|
|     5     |     3     |     4     |   0   |      8      |

and so on...

I'm not bound by a particular database technology, so either SQL or NoSQL will do. It is probably not possible to satisfy all the requirements in the DB-domain - some will probably have to be addressed in code. But my main question is what database technology is best suited for this task?

I'm not quite sure this fits a time-series database (TSDB), since only a portion of the values are changed at a given time, and it is important to know which values changed. Maybe I'm wrong?

/Chris

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Chris L.
  • 60
  • 8
  • I am not sure I understand the full problem here. Some additional questions: do you need to log/query WHO changed a specific data item and when? Is there a reason for always storing all 500 values at the same time (wouldn't a "sparse" table work better, assuming you can always query the appropriate value for all 500 data items in any point in time?) – p.marino Nov 27 '18 at 11:44
  • I'll try to explain: Yes, I do need to log/query who changed a certain value. There's no reason for storing all 500 values at the same time. Preferably, only the changed values should be stored. However, I do have to be able to retrieve a complete set of data points, that were valid at a certain point in time. – Chris L. Nov 27 '18 at 13:51
  • See my solution below - with this (or something similar) you can have in one (or max 2 tables): a complete history of values for each DataPoint, a "grid" you can query to find the value of one or all the datapoints in one moment in time, and a log of who changed what and when... all using perfectly normal SQL (I did not specify the indexes but this should be trivial) – p.marino Nov 27 '18 at 14:11

2 Answers2

0

Have a look at SQL Server temporal tables engine which may be a solution in your case. This approach allow to run the queries mentioned in the question, for example

SELECT * 
FROM my_data   
FOR SYSTEM_TIME AS OF '2018-01-01'

However, the table in the example seems to be very large (maybe denormalized). I would suggest to group columns by some technical or functional characteristics (vertical partitioning) to avoid further maintenance drawbacks.

serge
  • 992
  • 5
  • 8
  • That might be a viable option. But it doesn't solve the issue of partially changing data, i.e. only some of the data points are changed at a time. – Chris L. Nov 27 '18 at 14:06
  • @ChrisL. the [clause CONTAINED IN](https://learn.microsoft.com/en-us/sql/relational-databases/tables/querying-data-in-a-system-versioned-temporal-table#query-for-changes-to-specific-rows-over-time) _returns only those that existed within specified period boundaries_. Is it the solution or I didn't understand the problem? – serge Nov 28 '18 at 08:40
0

My suggestion would be to model this in a sparse format, something like:

CREATE TABLE DataPoint (
    DataID int,               /* 1 to 500 in your example, or whatever you need to identify it*/
    ValidFrom timestamp,  /*default value 01/01/1970-00:00:00 or a suitable "Epoch" */
    ValidUntil timestamp, /*default value 31/12/3999-00:00:00 or again something that is in the far future for your case */
    value Number (7,5)  /* again, this may be any data type, or even more than one field if needed, like Price & Currency
);

What we have just defined is a set of data and the "interval" in which each data has a specific value, so if you measured DataPoint 1 yesterday and got a value of 89.768 you will insert:

DataId=1 ValidFrom=26/11/2018-14:52:41 ValidUntil=31/12/3999-00:00:00 Value=89.768

Then you measure it again tomorrow and get:

DataId=1 ValidFrom=28/11/2018-14:51:23 ValidUntil=31/12/3999-00:00:00 Value=89.443

(Let assume that you have also logic so that when you record a new value you update the current value record and assign ValidUntil=28/11/2018-14:51:23 this is not really needed but will make the example query simpler).

One month from now you have accumulated more measurements for data #1, and the same, on different moments, for data #2 to 500. You now want to find out what the values were at noon today (i.e. one month "ago") i.e. at 27/11/2018:12:00:00:00

Select DataID, Value from DataPoint where ValidFrom <= 27/11/2018:12:00:00 and ValidUntil > 27/11/2018:12:00:00

This will return:

  001,89.768
  002,45.678
  ...,...
  500,112.809

Regarding logging who did this, or for what reason, you can either log it separately (saving for example DataPoint Id, Timestamp, UserId...) or make it part of the original table, so that whenever you register a new datapoint you also log who measured it.

p.marino
  • 6,244
  • 3
  • 25
  • 36
  • That is certainly a very good solution. The drawback being that I have to UPDATE and INSERT for each value changed. But I think I can live with that. Definately the best solution so far. – Chris L. Nov 27 '18 at 14:24
  • There are solutions around this, too, as I hinted in my comment. You could replace the ValidFrom/ValidUntil with just a single timestamp of when the data was inserted. Then (going back to my example) you select datapoint, max(timestamp) where timestamp <= 27/11/2018:12:00:00:00 and you are set... – p.marino Nov 27 '18 at 15:45
  • See this https://stackoverflow.com/a/17328717/54504 for an example of what I mean. Depending on what DB you decide to use you may need a slightly different syntax but it should be trivial to accomplish the same. – p.marino Nov 27 '18 at 15:49
  • That's pretty brilliant. Solves my question completely. Thank you so much! – Chris L. Dec 06 '18 at 13:47