I will likely be involved in a project where an important component is a storage for a large number of files (in this case images, but it should just act as a file storage).
Number of incoming files is expected to be around 500,000 per week (averaging around 100 Kb each), peaking around 100,000 files per day and 5 per second. Total number of files is expected to reach tens of million before reaching an equilibrium where files are being expired for various reasons at the input rate.
So I need a system that can store around 5 files per second at peak hours, while reading around 4 and deleting 4 at any time.
My initial idea is that a plain NTFS file system with a simple service for storing, expiring and reading should actually be sufficient. I could imagine the service creating sub-folders for each year, month, day and hour to keep the number of files per folder at a minimum and to allow manual expiration in case that should be needed.
A large NTFS solution has been discussed here, but I could still use some advice on what problems to expect when building a storage with the mentioned specifications, what maintenance problems to expect and what alternatives exist. Preferably I would like to avoid a distributed storage, if possible and practical.
edit
Thanks for all the comments and suggestions. Some more bonus info about the project:
This is not a web-application where images are supplied by end-users. Without disclosing too much, since this is in the contract phase, it's more in the category of quality control. Think production plant with conveyor belt and sensors. It's not traditional quality control since the value of the product is entirely dependent on the image and metadata database working smoothly.
The images are accessed 99% by an autonomous application in first in - first out order, but random access by a user application will also occur. Images older than a day will mainly serve archive purposes, though that purpose is also very important.
Expiration of the images follow complex rules for various reasons, but at some date all images should be deleted. Deletion rules follow business logic dependent on metadata and user interactions.
There will be downtime each day, where maintenance can be performed.
Preferably the file storage will not have to communicate image location back to the metadata server. Image location should be uniquely deducted from metadata, possibly though a mapping database, if some kind of hashing or distributed system is chosen.
So my questions are:
- Which technologies will do a robust job?
- Which technologies will have the lowest implementing costs?
- Which technologies will be easiest to maintain by the client's IT-department?
- What risks are there for a given technology at this scale (5-20 TB data, 10-100 million files)?