I’ve read a few csv vs database debates and in many cased people recommended db solution over csv. However it has never been exactly the same setup I have.
So here is the setup. - Every hour around 50 csv files generated representing performance group from around 100 hosts - Each performance group has from 20 to 100 counters - I need to extract data to create a number of predefined reports (e.g. daily for certain counters and nodes) - this should be relatively static - I need to extract data add-hoc when needed (e.g. for investigation purposes) based on variable time period, host, counter - In total around 100MB a day (in all 50 files)
Possible solutions?
1) Keep it in csv - To create a master csv file for each performance group and every hour just append the latest csv file To generate my reports using just scripts with shell commands (grep, sed, cut, awk)
2) Load it to database (e.g. MySQL) - To create tables mirroring performance groups and load those csv files into the tables To generate my reports using sql queries
When I tried simulate and to use just shell commands on csv files and it was very fast. I worry that database queries would be slower (considering the amount of data). I also know that databases don’t like too wide tables – in my scenario I would need in some cases 100+ columns. It will be read only for most of time (only appending new files). I’d like to keep data for a year so it would be around 36GB. Would the database solution still perform ok (1-2 core VM, 2-4GB memory expected). I haven’t simulate the database solution that’s why I’d like to ask you if you have any view/experience with similar scenario.