0

I have a framework in which I need to read, say 10000 files (each file is approx 10MB), process them and write another 10000 files. In the next step I read these 10000 files, do some processing on all of them and write some more files to disk. This entire process happens several times.

Question: Is there an efficient way of storing these files in contiguous locations to save read/write time? Something like tar. I don't want to compress them a lot, but I prefer speed. If I use tar, is there a way I can index (hash) these 10000 files, so that I can read any particular file in O(1) time?

  • Are these files independent during the processing? Are you interested only in final result (after 7 iterations), or intermediate results are valuable too? – dbf Aug 10 '15 at 19:44
  • If you do not yet have a performance problem don't try to fix it. Filesystems are pretty good. For many, many files its a good idea to tune the filesystem by increasing its inode count. An alternative way to go is with a NoSQL database such as MongoDB. For some information about that see http://blog.mongodb.org/post/183689081/storing-large-objects-and-files-in-mongodb and http://stackoverflow.com/questions/15030532/mongodb-as-file-storage. –  Aug 10 '15 at 19:48
  • In the entire pipeline of the framework, sometimes the files can be processed independently. But at a later stage, I need to read all files into the memory and do processing. In the initial stages, the intermediate files are also important. – Santosh Aug 10 '15 at 19:49
  • Do you have a machine with 100+ gigs of RAM? If not, this will be incredibly slow. – TigerhawkT3 Aug 10 '15 at 19:51
  • I have machine with 64GB of RAM and 12 cores. I use all of them. – Santosh Aug 10 '15 at 19:53
  • 10000 files at 10MB each is 100000MB, or 100GB. If your program attempts to hold all those files in memory at the same time, it will use virtual memory, meaning that it will use your hard drive as extra memory and become very slow. – TigerhawkT3 Aug 10 '15 at 19:56
  • You're right. Fortunately till now all the data summed up was around 46 GB. In the near future it might cross 100 GB and I will have to buy a cluster or something. – Santosh Aug 10 '15 at 20:01
  • 1
    [performance](http://stackoverflow.com/q/26178038/4279) [questions](http://stackoverflow.com/q/11227809/4279) may surprise you. Implement the most straightforward solution that could possibly work (peek a reasonable for your task algorithm, estimate whether your task is I/O or CPU bound and choose the overall design accordingly). Measure it and only then consider whether you need to optimize. – jfs Aug 10 '15 at 20:26

0 Answers0