0

I need to generate a unique id for file sizes of upto 200-300MB. The condition is that the algo should be quick, it should not take much time. I am selecting the files from a desktop and calculation a hash value as such:

HMACSHA256 myhmacsha256 = new HMACSHA256(key);
byte[] hashValue = myhmacsha256.ComputeHash(fileStream);

filestream is a handle to the file to read content from it. This method is going to take a lot of time for obvious reasons. Does windows generate a key for a file for its own book keeping that I could directly use ? Is there any other way to identify if the file is same, instead of matching file name which is not very foolproof.

intoTHEwild
  • 448
  • 7
  • 24
  • Maybe you should hash not stream but just file size? – Denis Agarev Mar 16 '12 at 12:01
  • 2
    why reinvent the wheel? why not just use [md5](http://msdn.microsoft.com/en-us/library/system.security.cryptography.md5.aspx)? – mtijn Mar 16 '12 at 12:02
  • 1
    @mtijn, maybe because [MD5 is broken](http://en.wikipedia.org/wiki/MD5#Security) and should not be used for new implementations? – user Mar 16 '12 at 12:04
  • 2
    @MichaelKjörling - It's very unlikely you'll get collisions unless manufactured by a malicious user; and since OP isn't using MD5 for security I don't see it as an issue. – Bridge Mar 16 '12 at 12:06
  • I don't see anything in the question to indicate whether collision-resistance is a requirement or not, only that this is about identifying whether "the file is [the] same". – user Mar 16 '12 at 12:09
  • You can't get unique ids from a hash. – Hans Passant Mar 16 '12 at 12:09
  • is [this](http://stackoverflow.com/questions/1866454/unique-file-identifier-in-windows) what you are looking for? – mtijn Mar 16 '12 at 12:15

4 Answers4

1
MD5.Create().ComputeHash(fileStream);

Alternatively, I'd suggest looking at this rather similar question.

Community
  • 1
  • 1
mtijn
  • 3,610
  • 2
  • 33
  • 55
  • As I said in a previous comment, [MD5 is broken](http://en.wikipedia.org/wiki/MD5#Security) and **should not be used**. SHA256, which the OP already uses, at least insofar as I know, have no known significant vulnerabilities. – user Mar 16 '12 at 12:07
  • 1
    the OP says "the condition is that the algo should be quick", he did not say anything about secure, nor about using this for security reasons. it's just a checksum. – mtijn Mar 16 '12 at 12:10
  • 1
    MD5 is broken such that it should not be used as a mechanic in a security context, as a way of hashing general data for summary comparison its fine & has the advantage of being very fast. – Alex K. Mar 16 '12 at 12:10
0

How about generating a hash from the info that's readily available from the file itself? i.e. concatenate :

  • File Name
  • File Size
  • Created Date
  • Last Modified Date

and create your own?

Marcel
  • 944
  • 2
  • 9
  • 29
  • Rolling your own cryptography is a **very very bad idea**. People spend years on these things and get them wrong - what are the odds that the OP would do better? Additionally, the information you list (with the exception of file size) can very easily change while the file content remains exactly the same. – user Mar 16 '12 at 12:06
  • 1
    If it's used for security purposes, of course. If used only to identify files (which I thought was the intention), why not? – Marcel Mar 16 '12 at 13:36
0

When you compute hashes and compare them, it would require both files to completely go through. My suggestion is to first check the file sizes, if they are identical and then go through the files byte by byte.

bytecode77
  • 14,163
  • 30
  • 110
  • 141
0

If you want a "quick and dirty" check, I would suggest looking at CRC-32. It is extremely fast (the algorithm simply involves doing XOR with table lookups), and if you aren't too concerned about collision resistance, a combination of the file size and the CRC-32 checksum over the file data should be adequate. 28.5 bits are required to represent the file size (that gets you to 379M bytes), which means you get a checksum value of effectively just over 60 bits. I would use a 64-bit quantity to store the file size, for future proofing, but 32 bits would work too in your scenario.

If collision resistance is a consideration, then you pretty much have to use one of the tried-and-true-yet-unbroken cryptographic hash algorithms. I would still concur with what Devils child wrote and also include the file size as a separate (readily accessible) part of the hash, however; if the sizes don't match, there is no chance that the file content can be the same, so in that case the computationally intensive hash calculation can be skipped.

Community
  • 1
  • 1
user
  • 6,897
  • 8
  • 43
  • 79