4

I’ve read a few csv vs database debates and in many cased people recommended db solution over csv. However it has never been exactly the same setup I have.

So here is the setup. - Every hour around 50 csv files generated representing performance group from around 100 hosts - Each performance group has from 20 to 100 counters - I need to extract data to create a number of predefined reports (e.g. daily for certain counters and nodes) - this should be relatively static - I need to extract data add-hoc when needed (e.g. for investigation purposes) based on variable time period, host, counter - In total around 100MB a day (in all 50 files)

Possible solutions?

1) Keep it in csv - To create a master csv file for each performance group and every hour just append the latest csv file To generate my reports using just scripts with shell commands (grep, sed, cut, awk)

2) Load it to database (e.g. MySQL) - To create tables mirroring performance groups and load those csv files into the tables To generate my reports using sql queries

When I tried simulate and to use just shell commands on csv files and it was very fast. I worry that database queries would be slower (considering the amount of data). I also know that databases don’t like too wide tables – in my scenario I would need in some cases 100+ columns. It will be read only for most of time (only appending new files). I’d like to keep data for a year so it would be around 36GB. Would the database solution still perform ok (1-2 core VM, 2-4GB memory expected). I haven’t simulate the database solution that’s why I’d like to ask you if you have any view/experience with similar scenario.

Cœur
  • 37,241
  • 25
  • 195
  • 267
qwertyuser
  • 43
  • 1
  • 3
  • `I also know that databases don’t like too wide tables` who told you that? Most popular engines should be able to handle [even absurd amounts of columns](http://stackoverflow.com/a/3184536/2186023) in one table. Database would probably really shine perfomancewise as soon as you have multiple users accessing the same data, but in your case it seems the distribution is quite centrally organized, so I would opt to go for (changing to) database if you expect column(name/type) changes in the future, otherwise **why change a running system?**. – DrCopyPaste Mar 16 '15 at 16:00
  • Thanks for that. I read that when doing my research. Some people say it's ok some recommend to avoid wide tables. Our BDA also said 100+ is too many ... I checked MySQL limits and I'm still far away from it. and there is no running system at the moment, I just receive csv files and append them into master csv. No reporting functionality in place Yes, I guess sooner or later there will be new counters introduced so with a db it should be easier to change it. – qwertyuser Mar 16 '15 at 16:56
  • You will really appreciate having the data in a database when you get ad hoc requests. I would not worry about "lots of columns" except it might be an indicator that simply mapping the CSV structure to a database table is probably not the optimum schema and you should look carefully at a DB design that is a bit different than the CSV flat table. – Jim Mc Mar 19 '15 at 03:52
  • Depending on the data in those columns, it might be better to split the data into more than 1 table. For example if the data was 20 basic bits of info on the server and then 80 columns each with a count for a different performance group then those performance groups can each instead be a row on a different table, joined to the parent. Also the big advantage of a database is when requirements change and go over a wider time scale than your CSV files are built to cope with. – Kickstart Mar 19 '15 at 09:59

0 Answers0