1

I'm looking to build a backend for accepting user-uploaded images, renaming them and storing them in a file system (no, it's not an Instagram)

I was thinking of simply renaming the image and storing in a user folder:

images/{userid}/{userid}_{md5(timestamp)}.jpg

The associations would also be included in the database.

Is that a good/sufficient model?

Nathan Waters
  • 1,173
  • 4
  • 13
  • 23

3 Answers3

2

Essentially your method is just fine, but here are my suggestions to you:

  • dont use the the timestamp in the filename, since you're already storing the filename in the DB, just create extra columns for the timestamp <-> file relationship. This way its easier to manage things like original creation, last modified, or even expiration dates.
  • make sure the column for the filename your storing is unique. You dont want to accidently store duplicate filenames
  • do a cross check on acceptance of the files. if the file is saved to the server successfully but the query fails, make sure to delete the file on failure. Or if your order of operations is reversed, remove the entry in the DB if the file fails to save to the server.
  • if the images are not allowed to be publicly accessible, you can deny viewing of the images normally and instead direct users to a link (PHP file) with the filename as a GET variable. Then you can check SESSIONS and/or COOKIES to determine if they are authorized to view it. If they are, you can set the headers of the output to be that of a jpeg or whatever kind of file they are viewing.
aowie1
  • 836
  • 8
  • 21
  • I think storing the file name with timestamp is a good practice as it will make all the filenames unique. – heyanshukla Apr 16 '12 at 06:27
  • this is true, but I guess what I meant to say is dont rely on them for an accurate timestamp. but i guess you couldnt if they were md5'd anyways :P Anyhow, using rand() to generate a random string is better performance than md5(time()) – aowie1 Apr 16 '12 at 06:33
  • I would use timestamp without md5() which will be unique in all cases. – heyanshukla Apr 16 '12 at 06:35
  • 2
    On a sufficiently heavy traffic website timestamp wont necessarily be unique. – Toby Allen Apr 16 '12 at 06:40
  • Yep renaming using time() was simply to avoid duplicates. Would you suggest rand(), md5(userid + time()) or simply uniqid()? I think I might just do {userid}_{uniqid()}.jpg – Nathan Waters Apr 16 '12 at 06:59
0

Why not use the unique id from the database, this makes it much easier to find a file.

Also it doesnt restrict how you structure your files, perhaps you won't always want to save by username, if each file has an ID tied to the database, this may be much simpler.

user/{database_id}.jpg
Toby Allen
  • 10,997
  • 11
  • 73
  • 124
  • I don't really want people to be able to easily access the entire image library by going 1.jpg, 2.jpg etc – Nathan Waters Apr 16 '12 at 06:44
  • 1
    Ok, then generate a GUID put it in the database and call the filename after it, same result with less security concerns. – Toby Allen Apr 16 '12 at 06:51
0

Kinda depends:

  • how many images per user?
  • approx size range per image?
  • how many users?
  • what sort of concurrency do you expect?

If most of the above numbers are small your method will probably be fine for long enough to get you a good long way, and will at least let you get started.

I know that using MySQL blob storage gets bad press, but that would also be a simple way to get started, and you could shard the database to get some scale-out without having to do any clever coding.

That said ...

If, in your system, you expect users to upload very large numbers of files you might run into limits or performance issues of the filesystem.

If you are hosting on Windows, watch out for the 8.3 filename problem (very slow when the directory gets large), as your filenames will definitely be longer than 8.3 :)

If many people will be uploading/downloading concurrently - say at peak usage periods - you will have to watch out for I/O contention. If you're on a RAID 10 volume you'll get further, and better still with an SSD (but then you'll likely have storage-capacity problems).

Your suggested method won't be the most space efficient if there's any chance that the same images might be uploaded by different people (duplication across many folders), in which case you'd be better off keying by a function of the data (e.g. md5sum) and storing just one copy (yes, then there are management issues with deletes).

If you expect lots of large images from many people you will eventually have to think about scaling the underlying storage. You could maybe partition the data by some function of the {userid} and shard across different volumes or machines. This would also buy you better concurrent throughput.

Another question: will you always be serving out only the original image, or you'll send back re-scaled copies sometimes? You'd probably want to scale once and return the pre-scaled version always, in which case you'd need to take storage of those scaled copies into account too.

Community
  • 1
  • 1
Stevie
  • 7,957
  • 3
  • 30
  • 29