12

We have a large and growing dataset of experimental data taken from around 30,000 subjects. For each subject, there are several recordings of data. Within each recording, there is a collection several time series of physiological data, each about 90 seconds long and sampled at 250Hz. I should note that any given instance of a time series is never extended, only additional recordings are added to the dataset. These recordings are not all of the same length, as well. Currently, the data for each recording is contained in its own flat file. These files are organized in a directory structure that is broken down hierarchically by version of the overall experiment, experiment location, date, and experiment terminal (in that hierarchical order).

Most of our analysis is done in MATLAB and we plan to continue to use MATLAB extensively for further analysis. The situation as it stands was workable (if undesirable) when all researchers were co-located. We are now spread around the globe and I am investigating the best solution to make all of this data available from remote locations. I am well-versed in MySQL and SQL Server, and could easily come up with a way to structure this data within such a paradigm. I am, however, skeptical as to the efficiency of this approach. I would value any suggestions that might point me in the right direction. Should I be considering something different? Time series databases (though those seem to me to be tuned for extending existing time series)? Something else?

Analysis does not need to be done online, though the possibility of doing so would be a plus. For now, our typical use case would be to query for a specific subset of recordings and pull down the associated time series for local analysis. I appreciate any advice you might have!

Update:

In my research, I've found this paper, where they are storing and analyzing very similar signals. They've chosen MongoDB for the following reasons:

  • Speed of development
  • The ease of adding fields to existing documents (features extracted from signals, etc.)
  • Ease of MapReduce use through the MongoDB API itself

These are all attractive advantages to me, as well. The development looks dead simple, and the ability to easily augment existing documents with the results of analysis is clearly helpful (though I know this isn't exactly difficult to do in the systems with which I am already familiar.

To be clear, I know that I can leave the data stored in flat files, and I know I could simply arrange for secure access to these flat files via MATLAB over the network. There are numerous reasons I want to store this data in a database. For instance:

  • There is little structure to the flat files now, other than the hierarchical structure stated above. It is impossible to pull all data from a particular day without pulling down all individual files for each terminal for a particular day, for instance.
  • There is no way to query against metadata associated with a particular recording. I shudder to think of the hoops I'd need to jump through to pull all data for female subjects, for example.

The long and short of it is that I want to store these data in a data base for myriad reasons (space, efficiency, and ease of access considerations, among many others).

Update 2

I seem to not be sufficiently describing the nature of these data, so I will attempt to clarify. These recordings are certainly time series data, but not in the way many people think of time series. I am not continually capturing data to be appended to an existing time series. I am really making multiple recordings, all with varying metadata, but of the same three signals. These signals can be thought of as a vector of numbers, and the length of these vectors vary from recording to recording. In a traditional RDBMS, I might create one table for recording type A, one for B, etc. and treat each row as a data point in the time series. However, this does not work as recordings vary in length. Rather, I would prefer to have an entity that represents a person, and have that entity associated with the several recordings taken from that person. This is why I have considered MongoDB, as I can nest several arrays (of varying lengths) within one object in a collection.

Potential MongoDB Structure

As an example, here's what I sketched as a potential MongoDB BSON structure for a subject:

{
    "songs": 
    {
        "order": 
        [
            "R008",
            "R017",
            "T015"
        ],
        "times": [
            { 
                "start": "2012-07-02T17:38:56.000Z",
                "finish": "2012-07-02T17:40:56.000Z",
                "duration": 119188.445
            },
            { 
                "start": "2012-07-02T17:42:22.000Z",
                "finish": "2012-07-02T17:43:41.000Z",
                "duration": 79593.648
            },
            { 
                "start": "2012-07-02T17:44:37.000Z",
                "finish": "2012-07-02T17:46:19.000Z",
                "duration": 102450.695
            }
        ] 
    },
    "self_report":
    {
        "music_styles":
        {
                "none": false,
                "world": true
        },
        "songs":
        [
            {
                "engagement": 4,
                "positivity": 4,
                "activity": 3,
                "power": 4,
                "chills": 4,
                "like": 4,
                "familiarity": 4
            },
            {
                "engagement": 4,
                "positivity": 4,
                "activity": 3,
                "power": 4,
                "chills": 4,
                "like": 4,
                "familiarity": 3
            },
            {
                "engagement": 2,
                "positivity": 1,
                "activity": 2,
                "power": 2,
                "chills": 4,
                "like": 1,
                "familiarity": 1
            }
        ],
        "most_engaged": 1,
        "most_enjoyed": 1,
        "emotion_indices":
        [
            0.729994,
            0.471576,
            28.9082
        ]
    },
    "signals":
    {
        "test":
        {
            "timestamps":
            [
                0.010, 0.010, 0.021, ...
            ],
            "eda":
            [
                149.200, 149.200, 149.200, ...
            ],
            "pox":
            [
                86.957, 86.957, 86.957, ...
            ]
        },
        "songs":
        [
            {
                "timestamps":
                [
                    0.010, 0.010, 0.021, ...
                ],
                "eda":
                [
                    149.200, 149.200, 149.200, ...
                ],
                "pox":
                [
                    86.957, 86.957, 86.957, ...
                ]
            },
            {
                "timestamps":
                [
                    0.010, 0.010, 0.021, ...
                ],
                "eda":
                [
                    149.200, 149.200, 149.200, ...
                ],
                "pox":
                [
                    86.957, 86.957, 86.957, ...
                ]
            },
            {
                "timestamps":
                [
                    0.010, 0.010, 0.021, ...
                ],
                "eda":
                [
                    149.200, 149.200, 149.200, ...
                ],
                "pox":
                [
                    86.957, 86.957, 86.957, ...
                ]
            }
        ]
    },
    "demographics":
    {
        "gender": "female",
        "dob": 1980,
        "nationality": "rest of the world",
        "musical_background": false,
        "musical_expertise": 1,
        "impairments":
        {
            "hearing": false,
            "visual": false
        }
    },
    "timestamps":
    {
        "start": "2012-07-02T17:37:47.000Z",
        "test": "2012-07-02T17:38:16.000Z",
        "end": "2012-07-02T17:46:56.000Z"
    }
}

Those signals are the time seria.

GarlicFries
  • 8,095
  • 5
  • 36
  • 53
  • Are you sure you don't overestimate the problem? See this for example: http://stackoverflow.com/questions/3779088/database-that-can-handle-500-millions-rows – Dennis Jaheruddin Jul 27 '13 at 18:37
  • I don't understand what you mean by 'overestimate the problem'. If you're saying that 30,000 subjects is not a lot of 'rows', I'm very aware of that. That's not the concern I'm trying to address here... – GarlicFries Jul 28 '13 at 02:11
  • 3
    From your info I understand that you have a well organized directory structure and routines to read in the files with MATLAB. You problem however is (concurrent?) remote access. As a simple short term solution, just put your directory structure online and restrict access by ip/user etc... Also, concurrency can be solved by MATLAB with a try catch block within a while loop or for better design by a database. What is your real problem and goal, i.e. why are you not happy with files read in by MATLAB? – Oleg Jul 28 '13 at 23:40
  • As @OlegKomarov said, what exactly is the problem you face now. And also, you mention that you are not confident comfortable with setting it up in SQL server because of 'efficiency'. But what do you mean by that? (My previous commentw as based on the thought that you were concerned with the amount of records) – Dennis Jaheruddin Jul 29 '13 at 13:07
  • are you using MAT-files for storage or a custom format for storing the time-series? Also currently how big is the data in total? I've once worked with a similar dataset of biological signals. It was about 500 GB in size, and was stored in a regular RDBMS database ([PostgreSQL](http://www.postgresql.org/docs/devel/static/arrays.html)). My personal advice: don't do it, keep using flat files instead :) If you need to provide remote access, set up a file server and manage access in the usual ways. – Amro Jul 29 '13 at 14:12
  • Just custom ASCII files right now. About the same size as yours, as well, @Amro. – GarlicFries Jul 29 '13 at 14:26
  • What you could do is separate the "metadata" from the actual signals, and only store the subjects information in a SQL database. That way you could run queries to perform analysis (like get all patients where gender=male, age>50 and so on). But since the signals part would not benefit from being stored in a database, keep them in flat files. Maybe use filenames as record ID for the signals. – Amro Jul 29 '13 at 14:52
  • @Amro, I've considered this, as well. Really, my only hesitance is losing the ability for online processing (MapReduce, etc.) So, even if I were to take this approach, I'd like to allow for the room to fold the signals into the database. With that in mind (in other words, assuming that the signals *will* one day move into the database), can you provide your opinion? – GarlicFries Jul 29 '13 at 17:20
  • @GarlicFries: I'm afraid I don't have much experience with the MapReduce approach. I've started learning NoSQL databases just recently.. All I'm saying is that you shouldn't immediately disregard using regular files, its simplistic approach is an advantage especially for other team members not familiar with the other techniques – Amro Jul 30 '13 at 09:42

3 Answers3

1

Quite often when people come to NoSQL databases, they come to it hearing that there's no schema and life's all good. However, IMHO this is a really wrong notion.

When dealing with NoSQL, You have to think in terms of "aggregates" . Typically an aggregate would be an entity that can be operated on as a single unit. In your case one possible (but not that efficient) way will be to model an user and his/her data as a single aggregate. This will ensure that your user aggregate can be data centre / shard agnostic. But if the data is going to grow - loading a user will also load all the related data and be a memory hog. (Mongo as such is bit greedy on memory)

Another option will be to have the recordings stored as an aggregate and "linked" back to the user with an id - this can be a synthetic key that you can create like a GUID. Even though this superficially seems like a join, its just a "look up by property" - Since there's no real referential integrity here. This maybe the approach that I'll take if files are going to get added constantly.

The place where MongoDb shines is the part where you can do adhoc queries by a property in the document(you will create an index for this property if you don't want to lose hair later down the road.). You will not go wrong with your choice for time series data storage in Mongo. You can extract data that matches an id, within a date range for e.g., without doing any major stunts.

Please do ensure that you have replica sets no matter which ever approach you take, and diligently chose your sharding approach early on - sharding later is no fun.

JVXR
  • 1,294
  • 1
  • 11
  • 20
0

I feel like this may not answer the right question, but here is what I would probably go for (using SQL server):

User (table)

  • UserId
  • Gender
  • Expertise
  • etc...

Sample (table)

  • SampleId
  • UserId
  • Startime
  • Duration
  • Order
  • etc...

Series (table)

  • SampleId
  • SecondNumber (about 1-90)
  • Values (string with values)

I think this should give you fairly flexible access, as well as reasonable memory efficency. As the values are stored in string format you cannot do analysis on the timeseries in sql (they will need to be parsed first) but I don't think that should be a problem. Of course you can also use MeasurementNumber and Value, then you have complete freedom.

Of course this is not as complete as your MongoDB setup but the gaps should be fairly easy to fill.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
0

You should really investigate LDAP and its data model. There is clearly a strong hierarchical character to your data, and LDAP is already commonly used to store attributes about people. It's a mature, standardized network protocol so you can choose from a variety of implementations, as opposed to being locked into a particular NoSQL flavor-of-the-month choice. LDAP is designed for distributed access, provides a security model for authentication (and authorization/access control as well) and is extremely efficient. More so than any of these HTTP-based protocols.

hyc
  • 1,387
  • 8
  • 22