I am trying on a project to manage massive number of data files. Basically I have a map with square grids of a resolution of some 900 by 700
. So that's 630,000 squares
. On each square we have a data file associated with local weather data dated way back to 1900. These data file are in CSV format contains two columns: non-zero decimal record and the associated date. So each file is unique in the number of rows. File size ranges from 0K to 1MB
.
The current situation is all 630,000+
files are kept in one folder. And with new data records kick in, I need to update each one of these files. The folder is current at 260GB
. I am working on ways to optimize the current situation and develop tools to automate the future data update and fetching.
My question is: are there ways to optimize the way the data is currently stored? It seems there isn't much data redundancy. I'm having trouble to even get 1 year's data into computer memory to do some shoveling. I hope someone can share some light on how this kind data should be stored / managed on a hard drive in a workplace.