Is effective usage of hashes of changing audio (mp3) files possible

Question

I'm going to create a music library program, easy. Storing the information, easy.

I previously looked at another music library made in c#, the guy claimed that even if you move the file, on rediscovery it will know all the information about that file retrieved from the database (xml, sql).

More info on rediscovery: When you move files you have to get the music library to rediscover because its current information is wrong, such as the file path, on re discovery it will find the file, check it in the database, and update any information

I thought this is impossible, till now. If you hash a file and use that hash as the key, you can then use that to always check the file to make sure it is the one.

Please correct me if I'm wrong and confirm what I'm saying is true (that is the question).

File path isn't used in hashing the file. (I don't know how to hash)
Re hash after every ID3 tag write (changing the file changes the hash?)
Using the Hash as an Key/Id will mean that if the file is moved it can be still referenced to the information stored about it
Once information read is read out of the xml (if we're using xml as a database) file, storing it in a dictionary is the quickest and best way to have the contents in memory

It is a question, it needs an answer, its about c#. I'm using c#, thats why it's specific, I'm doing background research, I just wanted some expert opinion on what i've stated

Your question is very unclear. What do you mean by "on rediscovery it will know all the information about that file retrieved from the database"? You *claim* this is a question, but it's very hard to see whether the actual question is - or why you think it's specific to C#. — Jon Skeet, Apr 16 '13 at 09:13
A) I don't know how to hash, B) When you move files you have to get the music library to rediscover because its current information is wrong, such as the file path, on re discovery it will find the file, check it in the database, and update any information — Callum Linington, Apr 16 '13 at 09:14
I'm using c#, thats why it's specific, I'm doing background research, I just wanted some expert opinion on what i've stated — Callum Linington, Apr 16 '13 at 09:15
I don't want to test based on my knowledge, it could be wrong, then I've wasted time. I'm being efficient — Callum Linington, Apr 16 '13 at 09:17
Asking other people to do something for you isn't being efficient, it's being lazy... — James, Apr 16 '13 at 09:18
No, im not asking them to do things, rather share the knowledge they already have. Point me in directions, I don;t want to reinvent the wheel, therefore we ask other people to show them that wheel, it can then be modified by said person — Callum Linington, Apr 16 '13 at 09:19
@No1_Melman the point I am making here is you clearly have an idea on how to go about this why not test it out first and if you find it's not working for some reason then change your approach, as David mentioned this wouldn't take long to implement. Every application is different therefore it's difficult for someone to give you a concrete answer on whether this approach will work for you. — James, Apr 16 '13 at 09:20
Because that can waste time, if someone has already tried and tested, and if someone knows the answer to my questions its quicker for them to answer this question. If you don't know the answers, then you shouldn't be commenting because its a question - answer forum, no offence but I would rather people who are willing to help with the question rather then tell me how to ask questions etc to comment. — Callum Linington, Apr 16 '13 at 09:22
No offence taken, however, I don't help people who aren't prepared to help themselves. Good luck. — James, Apr 16 '13 at 09:31

Dariusz · Accepted Answer · 2013-04-16T09:50:09.997

Answering your questions

file path should not be used when computing hash. Neither filename nor extension.
rehashing after each ID3 tag write would solve your problem provided that all changes occur in your application
hash can safely be used as a key for your purposes (see below)
probably yes, if I understand you correctly

Possibility of repeated hash value

Depending on the hashing function you choose, if you search, you will find/generate another file with the same hash in year, millenium, billion years or you will not do it till the end of the world.

It's all a matter of probabilities. Check details of each hashing function to learn how low the probability of finding another file with the same hash is.

Problem of changed tags in mp3 files

While this may be a problem, what you need to do is hash only the part of file that is not the ID3 tag. They are usually located at the end of the file and take a very small percent of the file size.

What you can do is to use the hashing funciton on the part of the file that will not be changing. Just skip the last N bytes of a file when hashing.

@No1_Melman I edited and significantly extended the answer; hope it is more helpful. — Dariusz, Apr 16 '13 at 09:50
That is a very clever idea, skip the header and take the song portion because that should never change :) thanks for that — Callum Linington, Apr 16 '13 at 09:54

score 1 · Answer 2 · edited May 23 '17 at 11:45

Yes, if you hash the file contents, then even if the file moves somewhere else, it will still result in the same hash when you do it again. So yes, you can totally identify files based on their content’s hash value (this is what Git does for example). As for creating a hash of a file, there are several questions that will tell you how to do it, for example this one.

Note though that due to ID3 tags and stuff, your files are not immutable, so hashing on the file contents might not be the best idea after all. If you change the tags of a file, its hash will change, resulting in a new file (at least for your application). Of course, if you change the tags within your application, then you can easily take track of those changes and update the old record to use the new hash. The same idea could be applied to identifying the file based on its path though too (if you move it within your application, you could just update its path in the database as well). The problem though is that both these actions are likely to happen outside of your application.

So both identification methods (hash of file contents, or file path) are somewhat flawed, but there is no real alternative for identifying the file.

Thank you, very much appreciated, I've been thinking that might be the case, but as you said, there is no other alternative — Callum Linington, Apr 16 '13 at 09:36

score 1 · Answer 3 · answered Apr 16 '13 at 09:31

Hashing will work for you. It basically creates a checksum based on all bytes in the file. Using a good hash will give you a signature for each file which is unique (there is more chance of winning the lottery five times in a row as finding two files which are different with the same hash).

Problem is you need to read the entire file to calculate the hash. This might hurt performance a bit.

So on rediscorvery you might want to first check if the filesize is the same. If not there is no need to read the entire file and calculate the hash. But you need to store filesize and hash for that.

Some info on hashing (using the MD5 method)

http://www.fastsum.com/support/md5-checksum-utility-faq/md5-hash.php

Thank you, just the answer I'm looking for – Callum Linington Apr 16 '13 at 09:35 — Callum Linington, Apr 16 '13 at 09:35

Is effective usage of hashes of changing audio (mp3) files possible

3 Answers3

Answering your questions

Possibility of repeated hash value

Problem of changed tags in mp3 files