Storing and analysing experimental data in efficient way (SQL, PYTHON)

Question

I am currently employed as an undergraduate intern in a neuroscience research lab at my university where I am the dedicated "coder/math guy". I know this description is cringey but I think this is how how I am perceived.

I have a number of tasks including updating legacy code which has led me to take a step back and try to conceptualize the best way to do this. I am looking for advice on the best way to structure my process here.

The data collected is currently from two forms. One is from a 3d based capture equipment which gathers data on different markers, most notably their xyz position, it can export this data in a CSV. However there is also a weird old API written in C which I could use to fetch the data manually. There is also an eye tracker which stores its data in a proprietary binary format with a MEX file provided to extract relevant info in Matlab, however that compiled MEX file seems to be completely black boxed.

I was thinking of trying to store all that data in a relational database and using the python plugin to go fetch the needed one and structure in certain dataframes for analysis. We are talking of about 100,000 * 10 of 64floating for the 3d motion capture.

Is it worth it to look into SQL and NOSQL schema or should I just store everything in HDF5 or JSON in structured directories?

Your question will probably be closed because it is either too broad or asking for a recommendation. However, both standard SQL databases and NOSQL can handle a million points with no problem. — Gordon Linoff, Mar 14 '16 at 16:12
Is it that broad? I mean it is soft but it's not like I'm asking for a tutorial or anything. I know both can handle a lot of data, I'm more asking in terms of the best structure to adopt. — SolipsistElvis, Mar 14 '16 at 16:26
I think it will depend on how do you want to use/process your data. Storing 1 million new rows per day should not be an issue for both approaches. If you will choose SQL i would recommend to partition your data, so that the tables remain maintainable. — MaxU - stand with Ukraine, Mar 14 '16 at 17:28

score 5 · Accepted Answer · edited May 23 '17 at 12:01

I ask myself the same question, I will try to resume what I have found.

Given this is similar to your dataset:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 10 columns):
0    100000 non-null float64
1    100000 non-null float64
2    100000 non-null float64
3    100000 non-null float64
4    100000 non-null float64
5    100000 non-null float64
6    100000 non-null float64
7    100000 non-null float64
8    100000 non-null float64
9    100000 non-null float64
dtypes: float64(10)
memory usage: 8.4 MB

Your "basic" dataset will need ≈ 10 MB

General view

So HDF5 is that rare product which excels in two fields: archiving and sharing data according to strict standardized conventions, and also ad-hoc, highly flexible and iterative use for local data analysis.

https://hdfgroup.org/wp/2015/03/hdf5-as-a-zero-configuration-ad-hoc-scientific-database-for-python/

Generally, HDF5 works better with int compared with string, it seems it is your case. Maybe it main limitation is concurrency:

Of greater concern are recent postings on a mailing list discussing use of netCDF and HDF5 in high performance computing applications with thousands of processors using parallel I/O, which warn of the danger of file corruption during parallel I/O if a client dies at a particular time. The HDF Group is aware of this problem and is addressing it.

https://earthdata.nasa.gov/standards/hdf5

Looking at import efficiency

Example 1:

As shown in the result, the time of data import from HDF5 is the shortest, only ~50% of import time from CSV and ~25% of import time from SQLITE. https://statcompute.wordpress.com/tag/hdf5/

Example 2:

In [18]: %timeit test_sql_read()
1 loops, best of 3: 766 ms per loop

In [19]: %timeit test_hdf_fixed_read()
10 loops, best of 3: 19.1 ms per loop

In [20]: %timeit test_hdf_table_read()
10 loops, best of 3: 39 ms per loop

HDF5 - concurrency, compression & I/O performance

SQL

Good for:

– Your data is structured (you have a ‘Schema’)

– Relational (tables of rows and columns)

– Mid-Size, <=several GB in total

– Transactional operations (ensuring DB is consistent)

https://www.nersc.gov/assets/Uploads/09-Databases-Wahid2.pdf

In a previous SO discussion:

SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets. Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

But Pandas (HDF5 through PyTables) has a rich "query" capacity, recently they included SQL-like queries.

In [171]: df.query('(a < b) & (b < c)')
Out[171]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method-experimental

NOSQL

Some insights about HDF5 vs NOSQL

HDF5 is no database. MongoDB has ACID properties, HDF5 doesn't (might be important).

HDF5 (unless you are using the MPI version) does not support concurrent write access (read access is possible).

What is a better approach of storing and querying a big dataset of meteorological data

Conclusions

Indeed, for some applications a database management system just isn’t justified:

If all the datasets are small,

If data storage and access requirements aren’t going to change (flexibility isn’t needed).

http://www.barrodale.com/docs/Why%20don't%20scientists%20use%20databases.pdf

Hope it helps.

Storing and analysing experimental data in efficient way (SQL, PYTHON)

1 Answers1

General view

Looking at import efficiency

SQL

NOSQL

Conclusions