2

I have a little bit of logs [ 200Mbytes/per day ]. What I want is to use certain data from this logs to build some statistics and show it through web interface. After pre-processing these files I get 4-5 files like this one:

hadooper@ubuntu:/usr/local/hadoop$ du -h part-r-00000 
4.0K    part-r-00000

hadooper@ubuntu:/usr/local/hadoop$ cat part-r-00000 
201508042015    444335775
201508042020    563
201508042025    320787123
.....

I'm planning to store all this at least for year, maybe even more. Not sure yet.

My question is where would be better to store and retrieve data: files or database ?

I'm planning to use rails as backend. And as for now it seems like storing everything in files are ok option. But there might be some drawbacks in long term which I'm not aware of right now.

I'm sure there are a lot of experienced people who solved similar tasks. Would much appreciate your thoughts and help

Yuriy Vasylenko
  • 3,031
  • 25
  • 25

1 Answers1

2

If you are only trying to store the files, store as flat/zipped file or add to the database and then export them as backup file from the database. Preparing backup from database will ensure easier import later when you need the data.

If you will need to perform queries on them too all this time, store them in database as querying to database is faster (because of indices) and easier (because of availability of DDL, DML etc.)

If you are worried about security, encrypt your files or encrypt the database and then export.

Let me know if there is some case I forgot to address.

displayName
  • 13,888
  • 8
  • 60
  • 75
  • I will query those data all the time, and use to draw graphics. And insert data once per half an hour. Your point seems fair to me, regarding speed and simplicity. Something that from my opinion should be addressed here is dimensioning. Just curious how to calculate needed space for db – Yuriy Vasylenko Sep 29 '15 at 17:12
  • @YuraVasylenko: You mean dimensioning like in Data Warehousing? I think that is tacit... otherwise what good is a database without proper E-R modeling? Also, for space requirements, the best approximation for that can be said as *proportional to your incoming data volume.* – displayName Sep 29 '15 at 18:28
  • by dimensioning I mean required space. So is it approximately the same as if I would store data in files, +/- a bit ? – Yuriy Vasylenko Sep 30 '15 at 14:17
  • @YuraVasylenko: I'm tempted to say that if you take out the space to install the database then both methods will be using approx the same space but I'm not sure about it. Have no link to any authoritative article about the same. However I find this answer pretty useful for you: http://stackoverflow.com/questions/2356851/database-vs-flat-files – displayName Sep 30 '15 at 14:24
  • @YuraVasylenko: Actually database is likely to take more space than flat-zipped files. – displayName Sep 30 '15 at 14:27
  • 1
    Sure, but as you mentioned above it will take more time to retrieve that data back, and send to client. Furthermore data in DBMS more manageable cause there a lot of ready-to-use tools to work with them. So actually I accept your answer as I already started writing scripts to push data in postgres. Thanks! – Yuriy Vasylenko Sep 30 '15 at 14:46