5

Let's say you have instance1, instance2, and instance3 running in AWS.

They are all running Apache, and the web application that you run needs to allow users to upload images which is the case in many projects.

Also when you are showing the image you need to crop it to the right size, so you basically need to make sure all instances have access to the same files at all times.

So let's say a user uploads an image to instance1, and another user is visiting a page in which the same image is shown in 100x100 size, and he hits instance2. Another user is trying to see the same image in 300x300 size on instance3. And many other sizes that are not easily predictable.

So you basically need a distributed file system, I'm using Gluster FS. So all instances have access to the same files, when a request for seeing an image is made, I have a PHP script that checks to see that image has already been resized to the given dimensions if yes it will show them if not it will resize it and then show again.

Gluster FS is working very smoothly and I'm very happy with it, except that I think I'm reinventing the wheel and AWS should have some sort of solution for this. With top command I can see that glusterfs is always using some of my CPU.

I also use CloutFront to cache the output of my resizing script, that reduces the server load to a good degree but Gluster FS is still costly to run.

You could use rsync and some sort of cron job to do the same without Gluster FS but that's a lot of work and not very reliable, because you need to know when to trigger the rsyncing process, and you still won't get the great benefits that Gluster FS provides. I also tried s3fs and I'd just like to say it was an absolute nightmare.

NFS drives also seem very primitive compared to Gluster FS, I think they use UDP, so they treat your data like it doesn't matter.

So what's the best way of doing something like this? I tried to find a distributed file system offered by AWS since I think many developers would have same or similar problems but there isn't any.

You may say just upload to s3, but s3 doesn't help me, I need to know if the image is already resized or not, then resize and serve or just serve, so I need something that I can write a script for.

You may also say well why don't you resize all images first and then upload them all to s3, the reason I can't do that is that

  1. There are around 1 million images, and 100 sizes, so you we are looking ad a gigantic amount of files to be converted
  2. There may be new sizes added every day, so resize first strategy doesn't work
Yasser1984
  • 2,401
  • 4
  • 32
  • 55
  • 2
    Excellent question. I didn't quite understand why are you limited to writing a script when verifying whether an image exists or not. Why don't you just go ahead, host everything in S3, and use the available SDKs to check whether the file exists or not, and to upload new resized images to S3? – Viccari Jun 28 '13 at 11:37

1 Answers1

0

I would aproach it with 2 S3 buckets:

  • Main image bucket: upload images in raw resolution/best resolution used on the site, no expiration time.
  • Cache bucket: create on-demand images, you can use timthumb to create them with the requested sizes and set expiration time.

When user request the image you check if exist on the cache bucket, otherwise you create it, store it on cache bucket and dispatch it from there.

Considerations:

  • Watch out with timthumb, old versions have security issues, you can check for alternatives.
  • Squid-cache can help too, you can replace the cache bucket with another ec2 instance with it

This is only my aproach but feel free to reply and grasp it deeper