2

I'd like to index files in a local database but I do not understand how I can identified each individual file. For example if I store the file path in the database then the entry will no longer be valid if the file is moved or deleted. I imagine there is some way of uniquely identifying files no matter what happens to them but I have had no success with Google.

This will be for *nix/Linux and ext4 in particular, so please nothing specific to windows or ntfs or anything like that.

secretformula
  • 6,414
  • 3
  • 33
  • 56

4 Answers4

7

In addition to the excellent suggestion above, you might consider using the inode number property of the files, viewable in a shell with ls -i.

Using index.php on one of my boxes:

ls -i

yields

196237 index.php

I then rename the file using mv index.php index1.php, after which the same ls -i yields:

196237 index1.php

(Note the inode number is the same)

chucksmash
  • 5,777
  • 1
  • 32
  • 41
  • +1 learn something new everyday, didn't know you could do that. This would defiantly be smaller in the database then storing the entire path. – secretformula Aug 21 '12 at 17:34
  • aha, I had a general idea that this existed but I couldn't get Google to give me anything relevant. Thankyou very much, I think this is exactly what I need. I'm satisfied that moving and modifying the content of a file will not change its inode number. Off to find some relevant C libraries! – user1614885 Aug 21 '12 at 18:01
  • 4
    inode number is not enough, device id also has to be taken into account otherwise you could have collisions: http://stackoverflow.com/a/1289864/246207 – JPvdMerwe Jan 23 '13 at 17:11
2

Try using a hashing scheme such as MD5, SHA-1, or SHA-2 these will allow you to match the files up by content.

Basically when you first create the index, you will hash all the files that you wish to add. This string is pretty good at telling if two files are different or the same. Then when you need to see if one of the files is already in the index, hash it and then compare the generated hash to your table of known hashes.

EDIT: As was said in the comments, it is a good idea to incorporate both data's so that way you can more accurately track changes

secretformula
  • 6,414
  • 3
  • 33
  • 56
  • 1
    +1. You probably want to do something involving both filename and hashing though, not just hashing. A hashing only solution would let the system recognize the same file in a different place but prevent it from recognizing the same file after it has been edited. – chucksmash Aug 21 '12 at 17:26
  • Thanks for your answer. Unfortunately I don't think hashing the files is appropriate to my needs. When I said "uniquely identifying files no matter what happens to them" I meant content changes as well, I apologize for nor being clear. – user1614885 Aug 21 '12 at 17:57
  • I see what you mean, the inode way is the way to go then – secretformula Aug 21 '12 at 18:02
0

If you do not consider files with same content same and only want to track moved/renamed files as same, then using its inode number will do. Otherwise you will have to hash the content.

Oleg V. Volkov
  • 21,719
  • 4
  • 44
  • 68
0

Only fly in the ointment with inodes is that they can be reassigned after a delete (depending on the platform) - you need to record the file creation Timestamp as well as the device id to be 100% sure. Its easier with windows and their user file attributes.

HKalsi
  • 333
  • 3
  • 10