4

I'm on a project that involves time series analytics, and I need to be able to let users upload a file containing their own time series (ie numbers with dates), for instance in a .csv file. Data contained in their files would then be accessible at any time, to be used within our system.

How could I do that? The ideas I've thought about:

  1. Create a table each time a user upload a file (and save somewhere the name of that table). If I have lots of users uploading lots of data, I may end up with tons of tables.
  2. Create one big fat monster table with basically three or four columns: the date of the value; the value; the dataset name (and/or the dataset's owner). Everything is uploaded in that table, and when Bob needs its weather data I just select (date,value) where owner = Bob and datasetname = weatherdata.
  3. In between solution: one table per user, and all Bob's datasets are in Bob's table.
  4. Completely different: just save the .csv file somewhere and read it when you need it.

I keep reading it's bad practice to have a varying number of tables (and I believe it). However my situation is slightly different from other questions I've seen on this site (most people seems to want to create one table per user, when they should create one row per user).

Some additional information:

  • time series data may contain hundreds of thousands observations, maybe millions
  • a priori, saved data should not be modified afterwards. However I guess it would be useful to let users append new data to their time series.
  • a priori, I won't need to do complicated SQL select statements. I just want to read Bob's weather data and I'll probably use it in the chronological order - although you never know what tomorrow may bring.
  • using PostgreSQL 9.1, if that's of any importance.

EDIT Reading some answers I realize I may have not done my job very well, I should have said that I'm clearly already evolving in a SQL environment; I already have a User table; when I write "table" I really mean "relation"; all my 4 ideas involve foreign keys somewhere; and RDBMS normalization is the paradigm unless something else is better. (All this not meaning I'm against not-sql solutions).

Arthur
  • 1,974
  • 2
  • 21
  • 28
  • 1
    One reason that PostgreSQL may be important is that Postgres's approach to partitioning ( http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html ) essentially consists of *explicitly* creating multiple tables and unioning them in a view - this may be significantly less helpful than some other SQLs' approach. I don't have direct experience of partitioning in PostgreSQL, however. –  Nov 01 '11 at 17:25
  • I wouldn't say that 4 related tables automatically leads a relational database solution. Believe me, I'm normally one of the [last people to board the nosql train](http://stackoverflow.com/questions/2571098/moving-to-nosql/2571516#2571516), and have asserted before that [relational databases can scale well into the terabytes](http://stackoverflow.com/questions/2794736/best-data-store-for-billions-of-rows/2794976#2794976). But those are talking about complex business systems; if you've got a very limited system with no isolation required (or intrinsic isolation) then you don't need an RDBMS. – Aaronaught Nov 01 '11 at 18:46

4 Answers4

3

I'm going to have to go with the "big fat monster table". This is how relational databases are meant to work, although you should normalize it (create one table for users, another for data sets, and another for the data points). Having multiple tables with identical schemas is a bad idea from all angles - design, management, security, even querying; are you sure you'll never want to combine information from two data sets?

If you really are certain that each data set will be totally isolated then you might also consider not using SQL at all. HDF (hierarchical data format) was literally built for this exact purpose, efficient storage and retrieval of "scientific data sets" which are very often time-series data. "Tables" in HDF are literally called data sets, they can share definitions, they can be multidimensional (e.g. one dimension for the day, one for the time), and they are much cheaper than SQL tables.

I don't normally try to steer people away from SQL, but unusual situations sometimes call for unusual solutions. If you're going to end up with billions of rows in a SQL table (or more) and you have practically no other data to store, then SQL may not be the right solution for you.

Aaronaught
  • 120,909
  • 25
  • 266
  • 342
  • Or you could use a combiantion, a relational database for the more ordinary data and the HDF for the timeseries data. – HLGEM Nov 01 '11 at 18:25
  • Yes, a hybrid would be another option, with the caveat that it can be a bit tricky since HDF has no transaction semantics, so you've got to be very careful about cross-database consistency. But that's not much different from any SQL/noSQL hybrid. – Aaronaught Nov 01 '11 at 18:33
  • @Aaronaught Thanks for your answer. However I'm not sure I understand your last sentence: "if you have practically no other data to store". I have a whole system with users and other (important) stuffs saved in PostgreSQL tables, so I guess I'm not in that case? As for the dataset "feature", I just have to save the data - and be able to link datasets to their owners. – Arthur Nov 01 '11 at 23:24
  • In any case, as I already have lots of stuff in SQL, if I decide to choose the noSQL route it will be hybrid somewhere. – Arthur Nov 01 '11 at 23:27
  • @Arthur: OK, that wasn't obvious from your question/comments - it sounded as though you were just starting this design. Hybrid systems can work, although you have to be careful about the order of operations - be prepared for bugs or transient issues to interrupt the process in the middle of a cross-system write and cause inconsistent state between systems, and make sure you have a way to track and repair these. – Aaronaught Nov 02 '11 at 00:00
  • @Aaronaught , sorry, I realize it wasn't obvious. I haven't made my choice yet; but I must say I more in my comfort zone with SQL and people seem to think it can handle it, so I'm (a bit) leaning this way. Thanks again! – Arthur Nov 02 '11 at 00:30
2

Example T-SQL* for a possible design:

CREATE TABLE dbo.Datasets (
    ID          int NOT NULL IDENTITY(1,1),
    OwnerUserID int NOT NULL,
    Loaded      datetime NOT NULL,

   CONSTRAINT FK_Datasets_Users
       FOREIGN KEY ( OwnerUserID )
       REFERENCES dbo.Users ( ID )
);

CREATE TABLE dbo.DatasetValues (
    DatasetID   int NOT NULL,
    Date        datetime NOT NULL,
    Value       int NOT NULL,

    CONSTRAINT FK_DatasetValues_Datasets
        FOREIGN KEY ( DatasetID )
        REFERENCES dbo.Datasets ( ID )
);

The design models two 'entities' implied in your question – the time series data being loaded and the sets of time series data.

*For SQL Server; I know you said PostgreSQL 9.1, but I'm pretty sure you can translate easily enough.

Kenny Evitt
  • 9,291
  • 5
  • 65
  • 93
  • +1 for including DDL SQL, although I think a single user may have multiple datasets - as suggested by Aaronaught, it might be better to have separate tables for users and datasets. –  Nov 01 '11 at 17:29
2

Your ideas are all fairly good ways of accomplishing the task (hopefully i've read it correctly).

What about a relational database? For example a table with username, time uploaded and a unique dataid, then link the dataid to another table containing the dataid foreign key and the raw file data. This would keep the user table to a minimum (and you could possibly merge it with another table, containing the users details for example). Having a separate table for users and then another for passwords and another for emails and then 5 more for data is probably bad practice, but personally I don't see anything wrong with separating files from user data.

What language are you using to process the data? This could also be a deciding factor also.

Hope this helps :)

Tom

Xleedos
  • 948
  • 2
  • 9
  • 17
2

Ok I think Option 2 is best, creating extra tables is just a nightmare to maintain and leaves you open to so many errors etc. Option 4 is somewhat appealing but I still think a database should be able to cope with this kind of task.

I think I would structure my tables like so:

User Table - UserID, Name etc

Row - Each row in your uploaded data (rowid, etc)

RowInDataSet - Row ID, DataSetID

DataSet - DataSetID, Upload Date, UploadBy etc

This lets you break up your data a little and makes it easy to maintain. Storing large amounts of data should not be such an issue if you correctly index these tables.

Purplegoldfish
  • 5,268
  • 9
  • 39
  • 59
  • Thanks for your answer! But I'm not sure I understand it... The value and their dates are stored in "Row", right? What is exactly the point of "RowInDataSet"? Couldn't I just have a DataSetID foreign key column in the Row table? Or is it more efficient to put that in a separate table? – Arthur Nov 01 '11 at 23:54
  • @Arthur I think that is something you would have to test to get an accurate answer, im not a DBA but I've found that in some cases I have had increased performance doing it this way, and it has been the way I have seen other database tables designed the same way – Purplegoldfish Nov 02 '11 at 09:08