0

I am trying on a project to manage massive number of data files. Basically I have a map with square grids of a resolution of some 900 by 700. So that's 630,000 squares. On each square we have a data file associated with local weather data dated way back to 1900. These data file are in CSV format contains two columns: non-zero decimal record and the associated date. So each file is unique in the number of rows. File size ranges from 0K to 1MB.

The current situation is all 630,000+ files are kept in one folder. And with new data records kick in, I need to update each one of these files. The folder is current at 260GB. I am working on ways to optimize the current situation and develop tools to automate the future data update and fetching.

My question is: are there ways to optimize the way the data is currently stored? It seems there isn't much data redundancy. I'm having trouble to even get 1 year's data into computer memory to do some shoveling. I hope someone can share some light on how this kind data should be stored / managed on a hard drive in a workplace.

Krish
  • 648
  • 1
  • 8
  • 17
uqji
  • 195
  • 1
  • 10
  • Assuming NTFS worth reading http://stackoverflow.com/questions/197162/ntfs-performance-and-large-volumes-of-files-and-directories – Steve Feb 03 '16 at 08:28
  • Well, i think that's a job for big data patterns. You should use specific tools to deal with that amount of data, (maybe go the noSql way, preprocess all the data...). But that's too wide for a simple answer here – Pikoh Feb 03 '16 at 08:31
  • At least, create subdirectory structure and move files there arranged by year, etc. 630000 files in one folder is very bad case. – i486 Feb 03 '16 at 08:33
  • 1
    There is ancient Microsoft project (1998) - Terraserver which may be of interest to you http://research.microsoft.com/apps/pubs/default.aspx?id=64154. Obviously scale is different (I would guess 10-100x compared to yours), but you may get some ideas. – Alexei Levenkov Feb 03 '16 at 08:35
  • It is not clear if your problem lies in reading all the data in memory or accessing the file one by one. Do you have measured the time required to open/read a single file? In any case I suggest to rearrange your files in such a way that you have folders by year and then inside any year's folder subfolders for each row (700 subfolders) containing the files with only the data for that row and for that year. An algorithm to reach the correct year/row should be relatively easy to implement. – Steve Feb 03 '16 at 08:36
  • The amount of data you need to load into memory (RAM) will probably not change (unless you can modify the software you are using), but you may be able to decrease disk storage if you store the data in a database. This will depend on the disk's block size, and the individual filesizes. Open the folders *properties* dialog and compare the values for *Size* and *Size on disk*. If there are many small files, *Size* will be much smaller than 260GB. Also, the time needed to load data into memory, or appending new data, may be less when using a database instead of individual files. – Berend Feb 03 '16 at 08:44
  • How you optimize it really depends on how the data is used. Is the data is being searched by date mostly or by square or by the decimal value? Is making best use of disk space a priority? Speed of access? Minimizing memory usage? All these have to be considered before you changed how it is stored. For most situations a database would be better, but in certain scenarios it wouldn't. For example, if you read all the the data one day at a time, you could store each day's data as a binary file. There are too many possible answers without knowing how the data will be used. – David Wilson Feb 03 '16 at 11:16

1 Answers1

0

To me it seems that these Datafiles contain data in the form of

1;20.09.1983
17;16.05.1985
.
.
.

So ... let's make an assumption:

You are having 1.000.000 Imagetiles for which you have to store 1-100.000 Entries. Lets make that 30.000 Records on average for each of the 1.000.000 ImageTiles. Thats a total of 30.000.000.000 (30 Billion Rows).

Lets partition that Data.

30 Billion Rows for the entire Date Span. You could use different Databases or different Servers or just different Tables to store that data. Your Job is to do the communication with your customer to determine what the business case would be. Is the Business case to show only data from one year? Partition for Years. Is the Business case to show only data from a specific region? Partition by geolocation.

So in conclusion i would put the csv data in a database. So you can make queries on that data. Give the Image Tiles a unique name and link to that name from the database.

No need for a nosql storage. MS SQL handles pretty big amounts of data.

Stephan Schinkel
  • 5,270
  • 1
  • 25
  • 42