Database for analytics and map/reduce

Question

I have multiple threads in my app generating log files based on the work it is performing. They run typically multiple iterations over multiple days and generate close to 15 - 20 GB of data. I extract specific fields from each of those iterations of logs and store them along with the log.

I need to perform data analysis on these fields and may extract more data from the raw log in the future. I am finding myself writing more code to manage these files, doing analysis like summation, averaging, min, max etc and generate reports based on that. Also writing code to make sure the data generated from the threads are properly stored in files. Is it possible to abstract away some of these problems with use of appropriate database?

Is there a database which would meet the following requirement

Document based
Allows me to do data analysis like summation, min, max, average, consolidation based on specific fields etc.
Allows extraction of new data from the log files.
I don't have any high performance writes or reads as you can see that it takes days to generate 20 GB worth of data.
I might be running multiple such application in parallel and they would be accessing the same database.
I would like to do joins also.
I am working on C#/.NET

I came across RethinkDB which looked like the solution I wanted, but turns out it is still not production ready and supported only on Linux.

Thanks...

Just to clarify are you wanting to read the documents on the fly - and what type of documents are you trying to read? My first thought was to use a relational database to store the data and use something like the Lucene Project to gather the data you are looking for. — tsells, Jan 12 '13 at 17:04
@tsells I am running a test on the target device for thousands of iteration. Every iteration, the text log from the device is collected and stored. I also extract specific values from the same log like time taken, temperature etc and store it as key value pairs along with the log. Now the same test is being performed across multiple device in parallel. From what I can see Lucene seems to be search engine. Not sure how it will help me. Moreover the type kind of fields I am extracting changes every other month. Is using a relational db a good idea? — Manoj, Jan 13 '13 at 08:54
How are you storing the text log? Are storing it as binary or as actual text? — tsells, Jan 14 '13 at 17:15
@tsells I am storing it as text. I am using xml format to store the extracted values and there is a tag called which has the log in text. I chose xml for no particular reason - just to organise the extracted fields. — Manoj, Jan 16 '13 at 18:05

Database for analytics and map/reduce

0 Answers0