EDIT: As result of the answers so far I like to add more focus in what I like to zero in on: A database that allows writing in-memory (could be simple C# code) with persistence to storage options in order to access the data from within R. Redis so far looks the most promising. I also consider to actually use something similar to Lockfree++ or ZeroMQ, in order to avoid writing data concurrently to the database, but rather sending all to be persisted data over a message bus/other implementation and to have one "actor" handle all write operations to an in-memory db or other solution. Any more ideas aside Redis (some mentioned SQLite and I will need to still test its performance). Any other suggestions?
I am searching for the ideal database structure/solution that meets most of my below requirements but so far I utterly failed. Can you please help?
My tasks: I run a process in .Net 4.5 (C#) and generate (generally) value types that I want to use for further analysis in other applications and therefore like to either preserve in-memory or persist on disk. More below. The data is generated within different tasks/threads and thus a row based data format does not lend itself well to match this situation (because the data generated in different threads is generated at different times and is thus not aligned). Thus I thought a columnar data structure may be suitable but please correct me if I am wrong.
Example:
Tasks/Thread #1 generates the following data at given time stamps
datetime.ticks / value of output data
1000000001 233.23
1000000002 233.34
1000000006 234.23 ...
Taks/Thread #2 generates the following data at given time stamps
datetime.ticks / value of output data
1000000002 33.32
1000000005 34.34
1000000015 54.32 ...
I do not need to align the time stamps at the .Net run-time, I am first and foremost after preserving the data and to process the data within R or Python at a later point.
My requirements:
Fast writes, fast writes, fast writes: It can happen that I generate 100,000- 1,000,000 data points per second and need to persist (worst case) or retain in memory the data. Its ok to run the writes on its own thread so this process can lag the data generation process but limitation is 16gb RAM (64bit code), more below.
Preference is for columnar db format as it lends itself well to how I want to query the data later but I am open to any other structure if it makes sense in regards to the examples above (document/key-value also ok if all other requirements are met, especially in terms of write speed).
API that can be referenced from within .Net. Example: HDF5 may be considered capable by some but I find their .Net port horrible.Something that supports .Net a little better would be a plus but if all other requirements are met then I can deal with something similar to the HDF5 .Net port.
Concurrent writes if possible: As described earlier I like to write data concurrently from different tasks/threads.
I am constrained by 16gb memory (run .Net process in 64bit) and thus I probably look for something that is not purely in-memory as I may sometimes generate more data than that. Something in-memory which persists at times or a pure persistence model is probably preferable.
Preference for embedded but if a server in a client/server solution can run as a windows service then no issue.
In terms of data access I have strong preference for a db solution for which interfaces from R and Python already exist because I like to use the Panda library within Python for time series alignments and other analysis and run analyses within R.
If the API/library supports in addition SQL/SQL-like/Linq/ like queries that would be terrific but generally I just need the absolute bare bones such as load columnar data in between start and end date (given the "key"/index is in such format) because I analyze and run queries within R/Python.
If it comes with a management console or data visualizer that would be a plus but not a must.
Should be open source or priced within "reach" (no, KDB does not qualify in that regards ;-)
OK, here is what I have so far, and again its all I got because most db solution simply fail already on the write performance requirement:
- Infobright and Db4o. I like what I read so far but I admit I have not checked into any performance stats
- Something done myself. I can easily store value types in binary format and index the data by datetime.ticks , I just would need to somehow write scripts to load/deserialize the data in Python/R. But it would be a massive tasks if I wanted to add concurrency, a query engine, and other goodies. Thus I look for something already out there.