4

I have a 50TB set of ~1GB tiff images that I need to run the same algorithm on. Currently, I have the rectification process written in C++ and it works well, however it will take forever to run on all these images consecutively. I understand that an implementation of MapReduce/Spark could work, but I can't seem to figure out how to use image input/output.

Every tutorial/example that I've seen uses plain text. In theory, I would like to utilize Amazon Web Services too. If anyone has some direction for me, that would be great. I'm obviously not looking for a full solution, but maybe someone has successfully implemented something close to this? Thanks in advance.

Community
  • 1
  • 1
HelloWor1d
  • 63
  • 1
  • 2
  • 10

2 Answers2

3

Is your data in HDFS? What exactly do you expect to leverage from Hadoop/Spark? Seems to me that all you need is a queue of filenames and a bunch of machines to execute.

You can pack your app into AWS Lambda (see Running Arbitrary Executables in AWS Lambda) and trigger events for each file. You can pack your app into a Docker container and start up a bunch of them in ECS, let them loose on a queue of filenames (or URLs or S3 buckets) to process.

I think Hadoop/Spark is overkill, specially since they're quite bad at processing 1GB splits as input, and your processing is not a M/R (no key-values for reducers to merge). If you must, you could pack your C++ app to read from stdin and use Hadoop Streaming.

Ultimately, the question is: where are the 50TB data stored, and what format? The solution depends a lot on the answer, as you would like to bring compute to where the data is and avoid transferring 50TB into AWS or even upload into HDFS.

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • Thanks for the info, I really appreciate it. This Docker container/queue idea seems like it could work. So to make sure I understand, I would wrap the existing C++ code into a container, and add all the filenames to a sqs queue. Then fire up a certain number of EC2's depending on the size of the queue and send the container to each one. As each EC2 finishes a job, it writes the new image to S3 and deletes the job from the queue? – HelloWor1d Jun 23 '16 at 15:52
  • That's right. Wrapping an app into a container is trivial (just add all dependencies/libraries, copy the compiled binary app in, and add a `RUN` command, see [Dockerfile](https://docs.docker.com/engine/reference/builder/). – Remus Rusanu Jun 23 '16 at 16:37
  • Great, I'll start looking into it. Would it make sense to send a couple jobs to each EC2, or just one for each? Also will it matter if the queue is incredibly long, or should I keep it at a certain length and add jobs as the EC2's finish other ones? – HelloWor1d Jun 23 '16 at 17:16
  • First of all, the 'queue' does not necessarily has to be a SQS. It can be a table in a mysql or postgress, add a row for each file and then mark them as 'processing', 'processed' as you progress. – Remus Rusanu Jun 23 '16 at 17:26
  • As for container/EC instance, it all depends on the resources required by your c++ app (basically, memory). You can also optimize CPU vs. IO, specially since your app *will* spend much time waiting for IO (reading the original image, I assume from some S3 bucket, then writing out the new S3), but doing so would require complex program (think boost::asio). I would keep it simple, even if you waste, say, 5s CPU per file waiting for IO, that is about 3 days, at EC2 price is pennies. – Remus Rusanu Jun 23 '16 at 17:33
  • Searching for Hadoop Image Processing on google brings up some interesting articles, eg. https://www.oreilly.com/ideas/how-to-analyze-100-million-images-for-624 – Remus Rusanu Jun 23 '16 at 17:49
2
  • You have 50TBs of ~1GB large .tif files.
  • You want to run same algorithm on each file.

One aspect of solving a problem in MapReduce paradigm, which most developers are not aware of is that:

If you do complex calculation on your Data nodes, the system will limp.

A big reason why you see mostly text-based simple examples is that they are actually the kind of problems which you can run on commercial grade hardware. In case you don't know or have forgotten, I'd like to point out that:

MapReduce programming paradigm is for running the kind of jobs that need scaling out vs scaling up.


Some hints:

  • With data this big in size, it makes sense to take computation where the data is rather than bringing data to computation.
  • Running this job on commercial grade hardware is clearly a bad idea. You need machines with multiple cores - 16/32 maybe.
  • After you have procured the required hardware, you should optimize your software to parallelize the algorithm wherever necessary/useful.
  • Your problem is definitely one which can benefit from scaling up. For files large in size and large collection of those types of files, increasing RAM and using a faster processor is undoubtedly a sensible thing to do.
  • Lastly, if you are concerned about taking in the input, you may read the images as binary. This will limit your ability to work with .tif format and you may have to rework your processing algorithm.
displayName
  • 13,888
  • 8
  • 60
  • 75