2

I am currently employed as an undergraduate intern in a neuroscience research lab at my university where I am the dedicated "coder/math guy". I know this description is cringey but I think this is how how I am perceived.

I have a number of tasks including updating legacy code which has led me to take a step back and try to conceptualize the best way to do this. I am looking for advice on the best way to structure my process here.

The data collected is currently from two forms. One is from a 3d based capture equipment which gathers data on different markers, most notably their xyz position, it can export this data in a CSV. However there is also a weird old API written in C which I could use to fetch the data manually. There is also an eye tracker which stores its data in a proprietary binary format with a MEX file provided to extract relevant info in Matlab, however that compiled MEX file seems to be completely black boxed.

I was thinking of trying to store all that data in a relational database and using the python plugin to go fetch the needed one and structure in certain dataframes for analysis. We are talking of about 100,000 * 10 of 64floating for the 3d motion capture.

Is it worth it to look into SQL and NOSQL schema or should I just store everything in HDF5 or JSON in structured directories?

  • 1
    Your question will probably be closed because it is either too broad or asking for a recommendation. However, both standard SQL databases and NOSQL can handle a million points with no problem. – Gordon Linoff Mar 14 '16 at 16:12
  • Is it that broad? I mean it is soft but it's not like I'm asking for a tutorial or anything. I know both can handle a lot of data, I'm more asking in terms of the best structure to adopt. – SolipsistElvis Mar 14 '16 at 16:26
  • I think it will depend on how do you want to use/process your data. Storing 1 million new rows per day should not be an issue for both approaches. If you will choose SQL i would recommend to partition your data, so that the tables remain maintainable. – MaxU - stand with Ukraine Mar 14 '16 at 17:28

1 Answers1

5

I ask myself the same question, I will try to resume what I have found.

Given this is similar to your dataset:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 10 columns):
0    100000 non-null float64
1    100000 non-null float64
2    100000 non-null float64
3    100000 non-null float64
4    100000 non-null float64
5    100000 non-null float64
6    100000 non-null float64
7    100000 non-null float64
8    100000 non-null float64
9    100000 non-null float64
dtypes: float64(10)
memory usage: 8.4 MB

Your "basic" dataset will need ≈ 10 MB

General view

So HDF5 is that rare product which excels in two fields: archiving and sharing data according to strict standardized conventions, and also ad-hoc, highly flexible and iterative use for local data analysis.

https://hdfgroup.org/wp/2015/03/hdf5-as-a-zero-configuration-ad-hoc-scientific-database-for-python/

Generally, HDF5 works better with int compared with string, it seems it is your case. Maybe it main limitation is concurrency:

Of greater concern are recent postings on a mailing list discussing use of netCDF and HDF5 in high performance computing applications with thousands of processors using parallel I/O, which warn of the danger of file corruption during parallel I/O if a client dies at a particular time. The HDF Group is aware of this problem and is addressing it.

https://earthdata.nasa.gov/standards/hdf5

Looking at import efficiency

Example 1:

As shown in the result, the time of data import from HDF5 is the shortest, only ~50% of import time from CSV and ~25% of import time from SQLITE. https://statcompute.wordpress.com/tag/hdf5/

Example 2:

In [18]: %timeit test_sql_read()
1 loops, best of 3: 766 ms per loop

In [19]: %timeit test_hdf_fixed_read()
10 loops, best of 3: 19.1 ms per loop

In [20]: %timeit test_hdf_table_read()
10 loops, best of 3: 39 ms per loop

HDF5 - concurrency, compression & I/O performance

SQL

Good for:

– Your data is structured (you have a ‘Schema’)

– Relational (tables of rows and columns)

– Mid-Size, <=several GB in total

– Transactional operations (ensuring DB is consistent)

https://www.nersc.gov/assets/Uploads/09-Databases-Wahid2.pdf

In a previous SO discussion:

SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets. Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

But Pandas (HDF5 through PyTables) has a rich "query" capacity, recently they included SQL-like queries.

In [171]: df.query('(a < b) & (b < c)')
Out[171]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method-experimental

NOSQL

Some insights about HDF5 vs NOSQL

  • HDF5 is no database. MongoDB has ACID properties, HDF5 doesn't (might be important).
  • HDF5 (unless you are using the MPI version) does not support concurrent write access (read access is possible).

What is a better approach of storing and querying a big dataset of meteorological data

Conclusions

Indeed, for some applications a database management system just isn’t justified:

  • If all the datasets are small,
  • If data storage and access requirements aren’t going to change (flexibility isn’t needed).

http://www.barrodale.com/docs/Why%20don't%20scientists%20use%20databases.pdf

Hope it helps.

Community
  • 1
  • 1
Mike
  • 2,255
  • 1
  • 11
  • 12