For the development of an object recognition algorithm, I need to repeatedly run a detection program on a large set of volumetric image files (MR scans). The detection program is a command line tool. If I run it on my local computer on a single file and single-threaded it takes about 10 seconds. Processing results are written to a text file. A typical run would be:
- 10000 images with 300 MB each = 3TB
- 10 seconds on a single core = 100000 seconds = about 27 hours
What can I do to get the results faster? I have access to a cluster of 20 servers with 24 (virtual) cores each (Xeon E5, 1TByte disks, CentOS Linux 7.2). Theoretically the 480 cores should only need 3.5 minutes for the task. I am considering to use Hadoop, but it's not designed for processing binary data and it splits input files, which is not an option. I probably need some kind of distributed file system. I tested using NFS and the network becomes a serious bottleneck. Each server should only process his locally stored files. The alternative might be to buy a single high-end workstation and forget about distributed processing.
I am not certain, if we need data locality, i.e. each node holds part of the data on a local HD and processes only his local data.