4

I'm trying to allow users to upload files through a PHP website. Since all the files are saved in a single folder on the server, it's conceivable (though admittedly with low probability) that two distinct users could upload two files that, while different, are named exactly the same. Or perhaps they're exactly the same file.

In the both cases, I'd like to use exec("openssl md5 " . $file['upload']['tmp_name']) to determine the MD5 hash of the file immediately after it is uploaded. Then I'll check the database for any identical MD5 hash and, if found, I simply won't complete the upload.

However, in the move_uploaded_file documentation, I found this comment:

Warning: If you save a md5_file hash in a database to keep record of uploaded files, which is usefull to prevent users from uploading the same file twice, be aware that after using move_uploaded_file the md5_file hash changes! And you are unable to find the corresponding hash and delete it in the database, when a file is deleted.

Is this really the case? Does the MD5 hash of a file in the tmp directory change after moving it to a permanent location? I don't understand why it would. And regardless, is there another, better way of ensuring the same file is not uploaded to the filesystem multiple times?

Magsol
  • 4,640
  • 11
  • 46
  • 68
  • `...and, if found, I simply won't complete the upload.` - I hope you have thought this through.. how would the user feel thinking they can upload their super cool meme and then are blindsided to find out that they can't because someone else already has... unless of course you offered a way to link the new 'attempted submission' to the existing entry so both users 'appear' to have their own files. – rlemon Nov 24 '11 at 21:33
  • 1
    Use sha1 or sha256 instead of md5. Make sure you hash the file contents not the filename. – Will Bickford Nov 24 '11 at 21:55
  • 1
    I don't see reasons, why sha1 or sha256 are preferable. Unless Magsol store vera large amounts of files, there should be any difference. Could you explain why he should your SHA? – SteAp Nov 24 '11 at 22:57
  • @rlemon: Don't worry, I wouldn't do that with memes, just with the meme templates :) Also, this is for uploading videos. Since they're such large files, if the same one already exists on the server, I'd rather not allocate the space for a duplicate and instead just redirect the link--like what you suggested--to the video that was previously uploaded. – Magsol Nov 25 '11 at 17:24
  • @Magsol ok as long as it is accounted for.. – rlemon Nov 25 '11 at 18:37

5 Answers5

1

Try renaming the uploaded file to a unique id. Use this:

$dest_filename = $filename;
        if (RENAME_FILE) {
      $dest_filename = md5(uniqid(rand(), true)) . '.' . $file_ext;
         }

Let me know if it helps :)

Albab
  • 374
  • 2
  • 12
1

Shouldn't you use exec("openssl md5 " . $file['upload']['name']) name instead? I'm thinking that the temporary name differs from upload to upload.

Cyclonecode
  • 29,115
  • 11
  • 72
  • 93
  • Is that the case? I assumed ['tmp_name'] refers to it on the filesystem, but in the temporary directory, and that `move_uploaded_file()` typically involves feeding ['name'] as the "destination" argument. If I want to access it on the filesystem prior to calling `move_uploaded_file`, should I use ['name'] instead? – Magsol Nov 25 '11 at 17:26
1

It would seem that it indeed is the case. I have shortly been looking through the docs aswell. But why dont you share the md5 checksum before using move_uploaded_file and store that value in your database linking it directly with the new file? That was you can always check the uploaded file and whether that file already exists in your filesystem.

This does require a database, but most have access to one.

Jan Dragsbaek
  • 8,078
  • 2
  • 26
  • 46
  • Yes! This is exactly what I'm currently doing, in fact. I realized after posting that, even if the MD5 changes when using move_uploaded_file(), performing an MD5 hash on the file *before* calling move_uploaded_file() will still hash everything to the correct domain and provide a good comparison. – Magsol Nov 25 '11 at 17:18
1

If you're convinced by all the reasons given here in the answers and decide not to use md5 at all (I'm still not sure whether you WANT to or MUST use hash), you can just append something unique for each user and the time of uploading to each file name. That way you'll end up with more readable file names. Something like: $filename = "$filename-$user_ip_string-$microtime";. Of course, you must have all three variables ready and formatted before that, it goes without saying.

No chance of the same file name, same IP address and same microtime occuring at the same time, right? You could easily get away with microtime only, but IP will make it even more certain. Of course, like I said, all this goes if you decide not to use hashing and go for a simpler solution.

Shomz
  • 37,421
  • 4
  • 57
  • 85
  • 1
    Some sort of unique name is the idea, yes, so I'm not necessarily married to MD5. I suppose I could forgo trying to detect if the same file is accidentally (or even on purpose) uploaded and just make sure every file has its own unique name. That might be the better approach, as it certainly makes it easier on the server side! – Magsol Nov 25 '11 at 17:22
0

No, in general the hash doesn't change by move_uploaded_file somehow magically.

But, if you compute the md5() including the file's path, the hash will certainly change if the file is move to a new path/folder.

In case you md5() the filename, nothing will change.

It's a good idea to rename uploaded files with a unique name.

But don't forget to locate the file to finally store the file, is outside of your document root folder of your vHost. Located there, it can't be downloaded without using a PHP-script.

Final remark: While it's very very unlikely, md5 hashed of two different files may be identical.

SteAp
  • 11,853
  • 10
  • 53
  • 88
  • 1
    The final remark describes a phenomenon known as a hash collision. It's what MD5 gets so much flak for these days. – BoltClock Nov 24 '11 at 21:34
  • One work-around is to have a 2-phase test; do the MD5, then compare the size & random bytes from files. Or if they're small, just compare contents. – Phil Lello Nov 24 '11 at 22:18
  • Yeah, I'm trying to hash the file's *contents*, rather than its *filename*, since two users could very easily upload two files with generic and identical names but vastly different content, and it's the content for which I want to prevent duplication. I don't care about a million files named "video.avi", so long as each video is of something different :) – Magsol Nov 25 '11 at 17:27