Distributed Processing of Volumetric Image Data

Question

For the development of an object recognition algorithm, I need to repeatedly run a detection program on a large set of volumetric image files (MR scans). The detection program is a command line tool. If I run it on my local computer on a single file and single-threaded it takes about 10 seconds. Processing results are written to a text file. A typical run would be:

10000 images with 300 MB each = 3TB
10 seconds on a single core = 100000 seconds = about 27 hours

What can I do to get the results faster? I have access to a cluster of 20 servers with 24 (virtual) cores each (Xeon E5, 1TByte disks, CentOS Linux 7.2). Theoretically the 480 cores should only need 3.5 minutes for the task. I am considering to use Hadoop, but it's not designed for processing binary data and it splits input files, which is not an option. I probably need some kind of distributed file system. I tested using NFS and the network becomes a serious bottleneck. Each server should only process his locally stored files. The alternative might be to buy a single high-end workstation and forget about distributed processing.

I am not certain, if we need data locality, i.e. each node holds part of the data on a local HD and processes only his local data.

score 2 · Answer 1 · answered Apr 27 '16 at 11:07

I regularly run large scale distributed calculations on AWS using Spot Instances. You should definitely use the cluster of 20 servers at your disposal.

You don't mention which OS your servers are using but if it's linux based, your best friend is bash. You're also lucky that it's a command line programme. This means you can use ssh to run commands directly on the servers from one master node.

The typical sequence of processing would be:

run a script on the Master Node which sends and runs scripts via ssh on all the Slave Nodes
Each Slave Node downloads a section of the files from the master node where they are stored (via NFS or scp)
Each Slave Node processes its files, saving required data via scp, mysql or text scrape

To get started, you'll need to have ssh access to all the Slaves from the Master. You can then scp files to each Slave, like the script. If you're running on a private network, you don't have to be too concerned about security, so just set ssh passwords to something simple.

In terms of CPU cores, if the command line program you're using isn't designed for multi-core, you can just run several ssh commands to each Slave. Best thing to do is run a few tests and see what the optimal number of process is, given that too many processes might be slow due to insufficient memory, disk access or similar. But say you find that 12 simultaneous processes gives the fastest average time, then run 12 scripts via ssh simultaneously.

It's not a small job to get it all done, however, you will forever be able to process in a fraction of the time.

That's more or less what we are doing right now. It doesn't work very well. The whole system is very difficult to maintain and the scripts are quite complex. There should be some software available for this problem, it seems to be quite ordinary. It's acually an "embarassingly parallel" problem. — Markus Heitz, Apr 29 '16 at 05:46
I can see how it might be unwieldy. It's probably a lot easier with AWS because you can just make new servers with all the right start up scripts, and get data from S3, whereas running around 20 physical servers sounds like a nightmare! — Christian Cerri, Apr 29 '16 at 09:02

score 2 · Answer 2 · answered Apr 28 '16 at 15:17

You can use Hadoop. Yes, default implementation of FileInputFormat and RecordReader are splitting files into chunks and split chunks into lines, but you can write own implementation of FileInputFormat and RecordReader. I've created custom FileInputFormat for another purpose, I had opposite problem - to split input data more finely than default, but there is a good looking recipes for exactly your problem: https://gist.github.com/sritchie/808035 plus https://www.timofejew.com/hadoop-streaming-whole-files/

But from other side Hadoop is a heavy beast. It has significant overhead for mapper start, so optimal running time for mapper is a few minutes. Your tasks are too short. Maybe it is possible to create more clever FileInputFormat which can interpret bunch of files as single file and feed files as records to the same mapper, I'm not sure.

I found this: https://issues.apache.org/jira/browse/MAPREDUCE-5018 which, combined with the links you provided, looks like a solution. What do you think about it? — Markus Heitz, May 05 '16 at 09:45
Is https://gist.github.com/sritchie/808035 supposed to work on binary data or does it just fix the splitting problem? I tried https://www.timofejew.com/hadoop-streaming-whole-files/ but it still inserts tab-characters into the input stream. — Markus Heitz, May 07 '16 at 07:03

Distributed Processing of Volumetric Image Data

2 Answers2

Linked