Let's say you have instance1, instance2, and instance3 running in AWS.
They are all running Apache, and the web application that you run needs to allow users to upload images which is the case in many projects.
Also when you are showing the image you need to crop it to the right size, so you basically need to make sure all instances have access to the same files at all times.
So let's say a user uploads an image to instance1, and another user is visiting a page in which the same image is shown in 100x100 size, and he hits instance2. Another user is trying to see the same image in 300x300 size on instance3. And many other sizes that are not easily predictable.
So you basically need a distributed file system, I'm using Gluster FS. So all instances have access to the same files, when a request for seeing an image is made, I have a PHP script that checks to see that image has already been resized to the given dimensions if yes it will show them if not it will resize it and then show again.
Gluster FS is working very smoothly and I'm very happy with it, except that I think I'm reinventing the wheel and AWS should have some sort of solution for this. With top command I can see that glusterfs is always using some of my CPU.
I also use CloutFront to cache the output of my resizing script, that reduces the server load to a good degree but Gluster FS is still costly to run.
You could use rsync and some sort of cron job to do the same without Gluster FS but that's a lot of work and not very reliable, because you need to know when to trigger the rsyncing process, and you still won't get the great benefits that Gluster FS provides. I also tried s3fs and I'd just like to say it was an absolute nightmare.
NFS drives also seem very primitive compared to Gluster FS, I think they use UDP, so they treat your data like it doesn't matter.
So what's the best way of doing something like this? I tried to find a distributed file system offered by AWS since I think many developers would have same or similar problems but there isn't any.
You may say just upload to s3, but s3 doesn't help me, I need to know if the image is already resized or not, then resize and serve or just serve, so I need something that I can write a script for.
You may also say well why don't you resize all images first and then upload them all to s3, the reason I can't do that is that
- There are around 1 million images, and 100 sizes, so you we are looking ad a gigantic amount of files to be converted
- There may be new sizes added every day, so resize first strategy doesn't work