0

I have a large amount of images stored in an AWS S3 bucket.

Every week, I run a classification task on all these images. The way I'm doing it currently is by downloading all the images to my local PC, processing them, then making database changes once the process is complete.

I would like to reduce the amount of time spent downloading images to increase the overall speed of the classification task.

EDIT2:

I actually am required to process 20,000 images at a time to increase performance of the classification engine. This means I can't use Lambdas since the maximum option for RAM available is 3GB and I need 16GB to process all 20,000 images

The classification task uses about 16GB of RAM. What AWS service can I use to automate this task? Is there a service that can be put on the same VLAN as the S3 Bucket so that images transfer very quickly?

The entire process takes about 6 hours to do. If I spin up an EC2 with 16GB of RAM it would be very cost ineffective as it would finish after 6 hours then spend the remainder of the week sitting there doing nothing.

Is there a service that can automate this task in a more efficient manner?

EDIT:

Each image is around 20-40KB. The classification is a neural network, so I need to download each image so I can feed it through the network.

Multiple images are processed at the same time (batches of 20,000), but the processing part doesn't actually take that long. The longest part of the whole process is the downloading part. For example, downloading takes about 5.7 hours, processing takes about 0.3 hours in total. Hence why I'm trying to reduce the amount of downloading time.

A_toaster
  • 1,196
  • 3
  • 22
  • 50
  • 1
    How much is the size of image? Is it really require to download the image? How much time it take to process one image ? Can multiple image be processed at the same time? – Chetan Aug 05 '19 at 23:58
  • @ChetanRanpariya Each image is around 20-40KB. The classification is a neural network, so yes I need to download the image so I can feed it through the network. Multiple images are processed at the same time (batches of 20,000), but the processing part doesn't actually take that long. The longest part of the whole process is the downloading part. For reference, downloading takes about 5.7 hours, processing takes about 0.3 hours in total. Hence why I'm trying to reduce the amount of downloading time – A_toaster Aug 06 '19 at 00:01
  • I you can use the stream of the image instead of the physical file, I suggest to use Lambda Function. One function for each file. The function can use the stream of S3 object and run the processing on the image and write result to the database. Lambda function can even download the image temporarily from S3. You just need to invoke lambda function for each of the file using SQS or programmatically. – Chetan Aug 06 '19 at 00:08
  • @ChetanRanpariya Isn't the maximum amount of RAM available for a Lambda 3GB? Unfortunately I need a minimum of 16GB to perform this task – A_toaster Aug 06 '19 at 00:12
  • I assumed that you can process individual images... if the processing individual image does, not require lot of memories you can use lambda... – Chetan Aug 06 '19 at 00:16
  • Other approach is to use EC2 machine, set it up with all required softwares and start only when you need it. Stopped instance will not cause you any cost. – Chetan Aug 06 '19 at 00:31
  • https://stackoverflow.com/questions/2549035/do-you-get-charged-for-a-stopped-instance-on-ec2 – Chetan Aug 06 '19 at 00:31
  • @ChetanRanpariya Ah, that sounds promising! Is it possible to automate the starting/stopping of the instance when the code is finished? PS: feel free to answer this question and I will give you the green tick! – A_toaster Aug 06 '19 at 00:35

2 Answers2

3

For your purpose you can still use EC2 instance. And if you have large amount of data to be downloaded from S3, you can attach and EBS volume to the instance.

You need to setup the instance with all the tools and software required for running your job. And when you don't have any process to run, you can shut down the instance. And boot it up when you want to run the process.

EC2 instances are not charged for the time they are in stopped state. You will be charged for the EBS volume and Elasitc IP attached to the Instance.

You also will be charged for the storage of the EC2 image on S3.

But I think these cost will be less than the cost of running EC2 instance all the time.

You can schedule start and stop the instance using AWS instance scheduler.

https://www.youtube.com/watch?v=PitS8RiyDv8

You can also use AutoScaling but that would be more complex solution than using the Instance Scheduler.

Chetan
  • 6,711
  • 3
  • 22
  • 32
1

I would look into Kinesis streams for this, but it's hard to tell because we don't know exactly what processing you are doing to the images

alex067
  • 3,159
  • 1
  • 11
  • 17