4

I am investigating ways to take a directory (folder) and derive some form of unique numerical identifier. I have investigated "string to hash" methods, however, the Pigeon Hole Principle means that one can never derive a truely unique number for every single string.

String to unique hash is no good.

I have recently been investigating other means of achieving my goal and thus have the following question to ask:

Directory time stamps - how 'unique' are they? To what resolution are the time stamps reported by 'stat' as described here (second post)? if the resolution is small enough, is it possible for more than one folder to share the exact same time stamp on a Linux system?

If anyone has other methods/techniques they'd like to share, I'd be happy to listen :)

Edit 1 To clarify my use case in response to the answers posted so far: I am working on Android platforms, so the filesystem is not linked to any other (except of course for removeable media such as Micro SD cards).

I am inserting each path into a database but trying to avoid string comparisons when querying the table. The use of maps/hashmaps is not an option here. Yes, the path itself is unique, but ideally I need a numerical identifier that can be used to query the table as opposed to the path itself. The identifier must also be unique per path. I have experimented with std::collate but found there were many collides in the hashes (a dataset of 20, 000 paths yeilds approximatley 100 collides). What was even more surprising is that the hashes appeared to be largely different each time my application is run. I wonder if it's seeded somehow?

Many thanks, P

protectedmember
  • 349
  • 5
  • 20
  • Presumably each folder's absolute path is described by a unique sequence of characters. Or do you have to allow for duplicates? – juanchopanza Sep 02 '12 at 17:36
  • Are all directories on the same volume? – Ignacio Vazquez-Abrams Sep 02 '12 at 17:37
  • @juanchopanza, that wouldn't quite qualify for a numerical identifier, in the sense it's often understood. Timestamps do not fulfill the requirements, either, as you can set them to whatever you want (`stat` only reports them by the second, whatever the resolution of the FS). – eq- Sep 02 '12 at 17:38
  • Possibly inode numbers, but that would afaik require a single file system to be unique. – Joachim Isaksson Sep 02 '12 at 17:39
  • @eq- but one could be constructed from the path, given a big integer library. – juanchopanza Sep 02 '12 at 17:39
  • `st_dev` and `st_ino` would be a sort of unique combination locally on a system (bind mounts introduce a possible catch I think) but not globally and not necessarily preserved across boots. – Mat Sep 02 '12 at 17:42
  • 1
    Although in theory, you could have two paths hash to the same value, in practice this will never happen if you're using a decent hash. Is there some reason that the theoretical concern is trumping the practical one here? – David Sep 02 '12 at 17:46
  • 1
    Another question: is there a size limit on the "numerical identifier"? – p.marino Sep 02 '12 at 17:54
  • @juanchopanza: linux.die.net/man/2/stat Here it states that linux kernels from 2.5.48 report fike timestamps down to nanosecind resolution. – protectedmember Sep 08 '12 at 17:58

3 Answers3

6

On any UNIX-based system, you can use the inode number as a unique identifier within that file system. Combining it with the device number will make it unique within the machine. If you wanted it to be globally unique, you could throw in the system's primary MAC address.

Keep in mind, however, that:

  1. The inode number will "follow" the directory if it is moved or renamed. It will change if the directory is deleted and replaced.

  2. The inode number will not be stable across systems, beyond one or two really special directories. (For instance, / is usually inode 2.)

1

+1 duskwuff, good one!

Another way is to simply treat the dir's path as a number ("BigInt").

Take this dir for example: /opt/www/log
It's 12 chars long.
12 * 8bits = 96bits
Thus you have a 96 bits long number,which you can represent in hex/base64/anything (in case you need to pass it as an HTML link).

I'd personally go for duskwuff's approach though.

Poni
  • 11,061
  • 25
  • 80
  • 121
  • P.S You might save a few bits if you can know for certain that some chars are never going to be in the path. – Poni Sep 02 '12 at 17:50
0

I think it depends very much on the purpose of why you want a unique numerical identifier. Timestamps can change, inodes can change, disknumbers can change, MAC adresses can change. (Still, +1 for duskwuff)

In some scenarios you can simply make a table, where each path you add gets a new, unique number, just like a numerical key column in a database.

Although hashes can collide, in every actual environment this is absolutely unlikely (if you don't use the lousiest algorithm around...) It is much more likely that you get errors due to flaws in your implementation, for example you treat "/tmp" differently than "/tmp/", because you don't normalize the paths before hashing them. Or, you want to distinguish physical folders, but forget to check for hardlinks and symlinks to the same folder, thus you get multiple hashes / id's for the same dir.

Again, depending on your usecase, a collision is not necessarily fatal: if you find that a new path results in the same hash as an existing one (will not happen!) you can still react on that case. (*)

Just to help your imagination: If you use a 64 bit hash, you could fill 150 000 000 of 1TB hard disk drives with empty folders (nothing on them but short folder names...) and then you will have a collision for sure. If you think, that's too risky (blink, blink), use a 128 bit hash which makes it 18 446 744 073 710 000 times less likely.

Hashes are designed to make collisions unlikely, and even good old MD5 will do its job well if nobody willingly tries to produce a collision.

(*) Edit: Your pigeonhole article already points that out: a collision just mean that lookup is no longer O(1), but slightly slower. As it rarely ever happens, you can easily live with that. If you use a std::map (no hash) or std::hashmap, you'll not have to worry about collisions. See what the difference between map and hashmap in STL

Community
  • 1
  • 1
Daniel S
  • 456
  • 4
  • 17