9

EDIT: As result of the answers so far I like to add more focus in what I like to zero in on: A database that allows writing in-memory (could be simple C# code) with persistence to storage options in order to access the data from within R. Redis so far looks the most promising. I also consider to actually use something similar to Lockfree++ or ZeroMQ, in order to avoid writing data concurrently to the database, but rather sending all to be persisted data over a message bus/other implementation and to have one "actor" handle all write operations to an in-memory db or other solution. Any more ideas aside Redis (some mentioned SQLite and I will need to still test its performance). Any other suggestions?

I am searching for the ideal database structure/solution that meets most of my below requirements but so far I utterly failed. Can you please help?

My tasks: I run a process in .Net 4.5 (C#) and generate (generally) value types that I want to use for further analysis in other applications and therefore like to either preserve in-memory or persist on disk. More below. The data is generated within different tasks/threads and thus a row based data format does not lend itself well to match this situation (because the data generated in different threads is generated at different times and is thus not aligned). Thus I thought a columnar data structure may be suitable but please correct me if I am wrong.

Example:

Tasks/Thread #1 generates the following data at given time stamps

datetime.ticks / value of output data

1000000001 233.23

1000000002 233.34

1000000006 234.23 ...

Taks/Thread #2 generates the following data at given time stamps

datetime.ticks / value of output data

1000000002 33.32

1000000005 34.34

1000000015 54.32 ...

I do not need to align the time stamps at the .Net run-time, I am first and foremost after preserving the data and to process the data within R or Python at a later point.

My requirements:

  • Fast writes, fast writes, fast writes: It can happen that I generate 100,000- 1,000,000 data points per second and need to persist (worst case) or retain in memory the data. Its ok to run the writes on its own thread so this process can lag the data generation process but limitation is 16gb RAM (64bit code), more below.

  • Preference is for columnar db format as it lends itself well to how I want to query the data later but I am open to any other structure if it makes sense in regards to the examples above (document/key-value also ok if all other requirements are met, especially in terms of write speed).

  • API that can be referenced from within .Net. Example: HDF5 may be considered capable by some but I find their .Net port horrible.Something that supports .Net a little better would be a plus but if all other requirements are met then I can deal with something similar to the HDF5 .Net port.

  • Concurrent writes if possible: As described earlier I like to write data concurrently from different tasks/threads.

  • I am constrained by 16gb memory (run .Net process in 64bit) and thus I probably look for something that is not purely in-memory as I may sometimes generate more data than that. Something in-memory which persists at times or a pure persistence model is probably preferable.

  • Preference for embedded but if a server in a client/server solution can run as a windows service then no issue.

  • In terms of data access I have strong preference for a db solution for which interfaces from R and Python already exist because I like to use the Panda library within Python for time series alignments and other analysis and run analyses within R.

  • If the API/library supports in addition SQL/SQL-like/Linq/ like queries that would be terrific but generally I just need the absolute bare bones such as load columnar data in between start and end date (given the "key"/index is in such format) because I analyze and run queries within R/Python.

  • If it comes with a management console or data visualizer that would be a plus but not a must.

  • Should be open source or priced within "reach" (no, KDB does not qualify in that regards ;-)

OK, here is what I have so far, and again its all I got because most db solution simply fail already on the write performance requirement:

  • Infobright and Db4o. I like what I read so far but I admit I have not checked into any performance stats
  • Something done myself. I can easily store value types in binary format and index the data by datetime.ticks , I just would need to somehow write scripts to load/deserialize the data in Python/R. But it would be a massive tasks if I wanted to add concurrency, a query engine, and other goodies. Thus I look for something already out there.
Marc
  • 16,170
  • 20
  • 76
  • 119
Matt
  • 7,004
  • 11
  • 71
  • 117
  • 1
    Believe it or not.. SQLite does fairly decently in some benchmarks I've come across. Have you considered that? – Simon Whitehead Nov 05 '12 at 01:35
  • @SimonWhitehead, thanks yes I do consider it, but so far I shied away because I have not heard it does a whole lot better than 100k inserts/writes per second, also to my knowledge (please correct if wrong) SQLite does not handle concurrent writes all too well... – Matt Nov 05 '12 at 01:38
  • 1
    thought about an object database like db4o? http://www.db4o.com/s/benchmarkdb.aspx – Hardrada Nov 05 '12 at 01:42
  • @Hardrada, yes sorry I will edit my question, I look into db4o alongside Infobright. Thanks. – Matt Nov 05 '12 at 01:44
  • 1
    Ok. Does your current state of mind disqualify object databases as a whole? or just db4o.. I'll just drop this one in there anyway: Eloquera.com – Hardrada Nov 05 '12 at 01:47
  • 1
    'because the data generated in different threads is generated at different times and is thus not aligned' Can you expand on this? Based on the information given, I don't understand the issue. Perhaps you could explain why you can't just insert your sample data into a normal database in parallel threads? – Nick.Mc Nov 05 '12 at 01:48
  • I am open to object databases, my concern here, however, is that they work well when writing/retrieving within the same platform. However, as stated I like to load the data into R and would need to check each individual oodb whether adapters exists or need to be written and whether its worthwhile to invest the effort to write something if it does not yet exist. – Matt Nov 05 '12 at 01:49
  • @ElectricLlama, simple: Performance constraints. Each run will write/persist to a fresh db file/table/... thus no inserts are needed. Show me a solution that lets me write 500k items per second with inserts, I have not come across too many. Sure, if the solution offers concurrent writes (most embedded solutions dont seem to) and if performance is up there then I dont have concerns. – Matt Nov 05 '12 at 01:54
  • ...in case I forgot to mention: I do not have much experience with databases thus please feel free to correct me if my concerns are unwarranted or thoughts are unsupported by standard practice or empirical evidence. Great suggestions and thoughts so far. Thanks – Matt Nov 05 '12 at 01:58
  • @Hardrada, You brought up Eloquera, such object dbs seem to work well if I want to store a .Net collection such as list or dictionary, but how will the types and collections be resolved if I wanted to access data inside the collections from within R or Python for that matter? Also the performance numbers did not impress me to much, I remember I tried RavenDB and attempted to store batches of a total of 500k floats and performance lagged severely in both writes and reads... – Matt Nov 05 '12 at 02:05
  • @Hardrada, support quote from Eloquera forums: "We ran tests using your class definition (database version 4.60 RC4), we have used a laptop and localhost connection.In C/S mode we were able to retrieve 250K objects in about 40 sec." Even this claim sounds extremely slow and does not match my performance needs at all. But thanks for the suggestion. – Matt Nov 05 '12 at 02:11
  • 1
    Is the performance requirement of 100k/500k/1M items/sec a peak or sustained throughput? If sustained, then over what period would you expect that rate to be sustained? – RBarryYoung Nov 05 '12 at 02:22
  • @RBarryYoung, definitely peak, sustained a lot lower, roughly 10k-50k/second, and needless to say batch/bulk writes are strongly preferred. – Matt Nov 05 '12 at 02:44
  • 1
    @Freddy re:python. There is a python.NET. I don't know how tied to your use of python you are but I figured the more possibilities you have the better chance you have of reaching your goal. Also, with ODB's, the less complex the class definition the faster you get. There are config options to help tweak perf. Additionally, you might be limited by hardware as well. Faster CPU, faster memory I/O, faster disk I/O. Can't hurt to have a monster of a box running the logic. – Hardrada Nov 05 '12 at 03:46
  • 1
    @Freddy I just found another one that I'm going to throw at you. I've never seen it, used it, heard of it, but their perf test says 249k rec/s : http://velocitydb.com/Compare.aspx. I actually think I found something that I like better than db4o and EQ. :D – Hardrada Nov 05 '12 at 03:54
  • @Hardrada, thanks great suggestions, will dig into the ones you mentioned. Btw, a Python adapter is not as much a requirement as it is to get the data into R. Really appreciate your help and ideas. Let me read up on those dbs you mentioned. I will definitely report back my findings and decision but I like to check a bit further and wait for more ideas, suggestions, concerns by either you or others in the meantime. Thanks a lot – Matt Nov 05 '12 at 05:36
  • 1
    @Freddy ,just me throwing more stuff at ya: http://www.valentina-db.com/en/valentina-db-kernel-features/72-ultra-fast-database http://www.vertica.com/ – Hardrada Nov 08 '12 at 00:34
  • @Hardrada, thanks I just find it hard to find any sort of performance figures for the dbs you mentioned. I am currently implementing a few, starting with InfoBright in order to evaluate its performance in terms of write and read speeds. – Matt Nov 08 '12 at 05:23
  • Same here. I couldn't find anything but what they claimed in a single paragraph on their site. That being said, it at least gives you other possibilities. Good luck! – Hardrada Nov 08 '12 at 05:25
  • thanks, really appreciate, if you know of a specific solution you recommend which meets all requirements please feel free to write up an answer as I opened a bounty worth 50 points. In any case your recommendations are of much value already and really help, thanks again – Matt Nov 08 '12 at 05:44
  • 1
    You seem to be on the verge of NoSQL, go for it, run some tests, check out redis, riak and the others, because you have to jump over hoops to solve the inherent problems of relational databases. – AlfredoVR Nov 08 '12 at 07:06
  • @alfa64, am working on it, dbs such as db4o are too slow, wrote 1 million objects within a list collection and it took 45 seconds, unacceptably slow. Looking at InfoBright right now but there does not seem to be much around in terms of C# code to get someone started quickly. – Matt Nov 08 '12 at 07:14
  • 1
    Can you serialize them to access them as key-value? give redis a shot, it's very fast – AlfredoVR Nov 08 '12 at 07:30
  • Have you asked this question on http://serverfault.com/? They might have some more insight on server related stuff than we developers here. – kor_ Nov 08 '12 at 09:07
  • @alfa64, testing redis now, yes I can serialize myself, but I use Booksleeve, I would strongly assume the serialization utilizes Protocol buffers ;-) – Matt Nov 09 '12 at 06:50
  • @kor_, I checked out serverfault, there does not seem to be much content over there, so I gave up quickly. – Matt Nov 09 '12 at 06:50

5 Answers5

13

I can't comment -- low rep (I'm new here) -- so you get a full answer instead...

First, are you sure you need a database at all? If fast write speed and portability to R is your biggest concern then have you just considered a flat file mechanism? According to your comments you're willing to batch writes out but you need persistence; if those were my requirements I'd write a straight-to-disck buffering system that was lightning fast then build a separate task that periodically took the disk files and moved them into a data store for R, and that's only if R reading the flat files wasn't sufficient in the first place.

If you can do alignment after-the-fact, then you could write the threads to separate files in your main parallel loop, cutting each file off every so often, and leave the alignment and database loading to the subprocess.

So (in crappy pseudo_code), build a thread process that you'd call with backgroundworker or some such and include a threadname string uniquely identifying each worker and thus each filestream (task/thread):

file_name = threadname + '0001.csv' // or something
open(file_name for writing)
while(generating_data) {
    generate_data()
    while (buffer_not_full and very_busy) {
        write_data_to_buffer
        generate_data()
    }
    flush_buffer_to_disk(file_name)
    if(file is big enough or enough time has passed or we're not too busy) {
        close(file_name)
        move(file_name to bob's folder)
        increment file_name
        open(file_name for writing)
    }
)

Efficient and speedy file I/O and buffering is a straightforward and common problem. Nothing is going to be faster than this. Then you can just write another process to do the database loads and not sweat the performance there:

while(file_name in list of files in bob's folder sorted by date for good measure)
{
    read bob's file
    load bob's file to database
    align dates, make pretty
}

And I wouldn't write that part in C#, I'd batch script it and use the database's native loader which is going to be as fast as anything you can build from scratch.

You'll have to make sure the two loops don't interfere much if you're running on the same hardware. That is, run the task threads at a higher priority, or build in some mutex or performance limiters so that the database load doesn't hog resources while the threads are running. I'd definitely segregate the database server and hardware so that file I/O to the flat files isn't compromised.

FIFO queues would work if you're on Unix, but you're not. :-)

Also, hardware is going to have more of a performance impact for you than the database engine, I'd imagine. If you're on a budget I'm guessing you're on COTS hardware, so springing for a solid state drive may up performance fairly cheaply. As I said, separating the DB storage from the flat file storage would help, and the CPU/RAM for R, the Database, and your Threads should all be segregated ideally.

What I'm saying is that choice of DB vendor probably isn't your biggest issue, unless you have a lot of money to spend. You'll be hardware bound most of the time otherwise. Database tuning is an art, and while you can eek out minor performance gains at the top end, having a good database administrator will keep most databases in the same ballpark for performance. I'd look at what R and Python support well and that you're comfortable with. If you think in columnar fashion then look at R and C#'s support for Cassandra (my vote), Hana, Lucid, HBase, Infobright, Vertica and others and pick one based on price and support. For traditional databases on a single commodity machine, I haven't seen anything that MySQL can't handle.

Community
  • 1
  • 1
Chipmonkey
  • 863
  • 7
  • 18
  • 1
    agree, upvoted, definitely the best suggestion so far, given speed is my highest priority, I will shortly post my own suggested answer, but only in order to keep public track of the data bases that I have tested and why each has so far failed to meet my requirements. – Matt Nov 09 '12 at 06:52
2

This is not to answer my own question but to keep track of all data bases which I tested so far and why they have not met my requirements (yet): each time I attempted to write 1 million single objects (1 long, 2 floats) to the database. For ooDBs, I stuck the objects into a collection and wrote the collection itself, similar story for key/value such as Redis but also attempted to write simple ints (1mil) to columnar dbs such as InfoBright.

  • Db4o, awefully slow writes: 1mil objects within a collection took about 45 seconds. I later optimized the collection structure and also wrote each object individually, not much love here.
  • InfoBright: Same thing, very slow in terms of write speed, which surprised me quite a bit as it organizes data in columnar format but I think the "knowledge tree" only kicks in when querying data rather than when saving flat data structures/tables-like structures.
  • Redis (through BookSleeve): Great API for .Net: Full Redis functionality (though couple drawbacks to run the server on Windows machines vs. a Linux or Unix box). Performance was very fast...North of 1 million items per second. I serialized all objects using Protocol Buffers (protobuf-net, both written by Marc Gravell), still need to play a lot more with the library but R and Python both have full access to the Redis DB, which is a big plus. Love it so far. The Async framework that Marc wrote around the Redis base functions is awesome, really neat and it works so far. I wanna spend a little more time to experiment with the Redis Lists/Collection types as well, as I so far only serialized to byte arrays.
  • SqLite: I ran purely in-memory and managed to write 1 million value type elements in around 3 seconds. Not bad for a pure RDBMS, obviously the in-memory option really speeds things up. I only created one connection, one transaction, created one command, one parameter, and simply adjusted the value of the parameter within a loop and ran the ExecuteNonQuery on each iteration. The transaction commit was then run outside the loop.
  • HDF5: Though there is a .Net port and there also exists a library to somehow work with HDF5 files out of R, I strongly discourage anyone to do so. Its a pure nightmare. The .Net port is very badly written, heck, the whole HDF5 concept is more than questionable. Its a very old and in my opinion outgrown solution to store vectorized/columnar data. This is 2012 not 1995. If one cannot completely delete datasets and vectors out of the file in which they were stored before then I do not call that an annoyance but a major design flaw. The API in general (not just .Net) is very badly designed and written imho, there are tons of class objects that nobody, without having spent hours and hours of studying the file structure, understands how to use. I think that is somewhat evidenced by the very sparse amount of documentation and example code that is out there. Furthermore, the h5r R library is a drama, an absolute nightmare. Its badly written as well (often the file upon writing is not correctly close due to a faulty flush and it corrupts files), the library has issues to even be properly installed on 32 bit OSs...and it goes on and on. I write the most about HDF5 because I spent the most of my time on this piece of .... and ended up with the most frustration. The idea to have a fast columnar file storage system, accessible from R and .Net was enticing but it just does not deliver what it promised in terms of API integration and usability or lack thereof.

Update: I ditched testing velocityDB simply because there does not seem any adapter to access the db from within R available. I currently contemplate writing my own GUI with charting library which would access the generated data either from a written binary file or have it sent over a broker-less message bus (zeroMQ) or sent through LockFree++ to an "actor" (my gui). I could then call R from within C# and have results returned to my GUI. That would possibly allow me the most flexibility and freedom, but would obviously also be the most tedious to code. I am running into more and more limitations during my tests that with each db test I befriend this idea more and more.

RESULT: Thanks for the participation. In the end I awarded the bounty points to Chipmonkey because he suggested partly what I considered important points to the solution to my problem (though I chose my own, different solution in the end). I ended up with a hybrid between Redis in memory storage and direct calls out of .Net to the R.dll. Redis allows access to its data stored in memory by different processes. This makes it a convenient solution to quickly store the data as key/value in Redis and to then access the same data out of R. Additionally I directly send data and invoke functions in R through its .dll and the excellent R.Net library. Passing a collection of 1 million value types to R takes about 2.3 seconds on my machine which is fast enough given that I get the convenience to just pass in the data, invoke computational functions within R out of the .Net environment and getting the results back sync or async.

Matt
  • 7,004
  • 11
  • 71
  • 117
  • This turned into a pretty useful thread, actually. Any chance you're going to have code available to share when this is all done? I wouldn't mind seeing the nuts and bolts of what you eventually went with. Thanks for the bounty too. ;-) – Chipmonkey Nov 15 '12 at 14:33
  • 1
    @Chipmonkey, believe it or not but I decided in the end to utilize my own custom binary file database plus mmap/index libraries in R to get the data into R. It proved to be the absolute fastest and most stable way to achieve persistence, throughput, and connectivity. The index library is awesome in that it allows me to run extremely fast queries on the stored data, despite the data being stored in a columnar similar way. I also use BookSleeve/Redis for in-memory data sharing because I really like the concept of an in-memory repo which is accessible through tcp or pipes. – Matt Nov 23 '12 at 08:59
0

Just a note: I once had a similar problem posted by a fellow in a delphi forum. I could help him with a simple ID-key-value database backend I wrote at that time (kind of a NoSQL engine). Basically, it uses a B-Tree to store triplets (32bit ObjectID, 32bit PropertyKey, 64bit Value). I could manage to save about 500k/sec Values in real time (about 5 years ago). Of course, the data was indexed on all three values (ID, property-ID and value). You could optimize this by ignoring the value index.

The source I still have is in Delphi, but I would think about implementing something like that using C#. I cannot tell you whether it will meet your needs for performance, but if all else fails, give it a try. Using a buffered write should also drastically improve performance.

alzaimar
  • 4,572
  • 1
  • 16
  • 30
0

I would go with way combining persistence storage (I personally prefer db4o, but you can use files as well as mentioned above) and storing objects into memory this way:

use BlockingCollection<T> to store objects in memory (I believe you will achieve better performance then 1000000/s to store objects in memory), and than have one or more processing threads which will consume the objects and store them into persistent database

// Producing thread
for (int i=0; i<1000000; i++)
    blockingCollection.Add(myObject);

// Consuming threads
while (true)
{
      var myObject = blockingCollection.Take();
      db4oSession.Store(myObject); // or write it to the files or whathever
}

BlockingCollection pretty much solves Producer-Consumer workflow, and in case you will use multiple instance of them and use AddToAny/TakeFromAny you can reach any kind of multithreaded performance

each consuming thread could have different db4o session (file) to reach desired performance (db4o is singlethreaded).

Tomas Panik
  • 4,337
  • 2
  • 22
  • 31
-1

Since you want to use ZeroMQ why not use memcache over Redis?
ZeroMQ offers no persistence as far as I know. Memcache also offers no persistence and is a bit faster than Redis.
Or perhaps the other way, if you use Redis why not use beanstalk MQ?
If you want to use Redis (for the persistence) you might want to switch from ZeroMQ to beanstalk MQ (also a fast in memory queue, but also has persistence via logging). Beanstalk also has C# libs.

Adrian
  • 5,603
  • 8
  • 53
  • 85
  • I do not need persistence for the message bus. I want to use binary data within .Net and R (preferably from the same source/DB/file structure) and thus need a unified source to store such data. Thats my requirements. I think Redis meets that requirement, I dont know whether MemCache offers a neat .Net port but Booksleeve is an excellent API to connect to Redis out of .Net. – Matt Nov 15 '12 at 06:47