How do I convert a directory path to a unique numerical identifier (Linux/C++)?

Question

I am investigating ways to take a directory (folder) and derive some form of unique numerical identifier. I have investigated "string to hash" methods, however, the Pigeon Hole Principle means that one can never derive a truely unique number for every single string.

String to unique hash is no good.

I have recently been investigating other means of achieving my goal and thus have the following question to ask:

Directory time stamps - how 'unique' are they? To what resolution are the time stamps reported by 'stat' as described here (second post)? if the resolution is small enough, is it possible for more than one folder to share the exact same time stamp on a Linux system?

If anyone has other methods/techniques they'd like to share, I'd be happy to listen :)

Edit 1 To clarify my use case in response to the answers posted so far: I am working on Android platforms, so the filesystem is not linked to any other (except of course for removeable media such as Micro SD cards).

I am inserting each path into a database but trying to avoid string comparisons when querying the table. The use of maps/hashmaps is not an option here. Yes, the path itself is unique, but ideally I need a numerical identifier that can be used to query the table as opposed to the path itself. The identifier must also be unique per path. I have experimented with std::collate but found there were many collides in the hashes (a dataset of 20, 000 paths yeilds approximatley 100 collides). What was even more surprising is that the hashes appeared to be largely different each time my application is run. I wonder if it's seeded somehow?

Many thanks, P

Presumably each folder's absolute path is described by a unique sequence of characters. Or do you have to allow for duplicates? — juanchopanza, Sep 02 '12 at 17:36
@juanchopanza, that wouldn't quite qualify for a numerical identifier, in the sense it's often understood. Timestamps do not fulfill the requirements, either, as you can set them to whatever you want (`stat` only reports them by the second, whatever the resolution of the FS). — eq-, Sep 02 '12 at 17:38
Possibly inode numbers, but that would afaik require a single file system to be unique. — Joachim Isaksson, Sep 02 '12 at 17:39
@eq- but one could be constructed from the path, given a big integer library. — juanchopanza, Sep 02 '12 at 17:39
`st_dev` and `st_ino` would be a sort of unique combination locally on a system (bind mounts introduce a possible catch I think) but not globally and not necessarily preserved across boots. — Mat, Sep 02 '12 at 17:42
Although in theory, you could have two paths hash to the same value, in practice this will never happen if you're using a decent hash. Is there some reason that the theoretical concern is trumping the practical one here? — David, Sep 02 '12 at 17:46
Another question: is there a size limit on the "numerical identifier"? — p.marino, Sep 02 '12 at 17:54
@juanchopanza: linux.die.net/man/2/stat Here it states that linux kernels from 2.5.48 report fike timestamps down to nanosecind resolution. — protectedmember, Sep 08 '12 at 17:58

score 6 · Accepted Answer · answered Sep 02 '12 at 17:42

6

On any UNIX-based system, you can use the inode number as a unique identifier within that file system. Combining it with the device number will make it unique within the machine. If you wanted it to be globally unique, you could throw in the system's primary MAC address.

Keep in mind, however, that:

The inode number will "follow" the directory if it is moved or renamed. It will change if the directory is deleted and replaced.
The inode number will not be stable across systems, beyond one or two really special directories. (For instance, / is usually inode 2.)

answered Sep 02 '12 at 17:42

1

Two different paths that point to the same in inode would end up with the same identifier. Don't know if that's acceptable here. – Pete Becker Sep 02 '12 at 18:58
Thank you, duskwuff. I appreciate your input! I have updated my question to clarify my use case. – protectedmember Sep 06 '12 at 17:34

score 1 · Answer 2 · answered Sep 02 '12 at 17:50

1

+1 duskwuff, good one!

Another way is to simply treat the dir's path as a number ("BigInt").

Take this dir for example: /opt/www/log
It's 12 chars long.
12 * 8bits = 96bits
Thus you have a 96 bits long number,which you can represent in hex/base64/anything (in case you need to pass it as an HTML link).

I'd personally go for duskwuff's approach though.

answered Sep 02 '12 at 17:50

Poni

11,061
25
80
121

P.S You might save a few bits if you can know for certain that some chars are never going to be in the path. – Poni Sep 02 '12 at 17:50

score 0 · Answer 3 · edited May 23 '17 at 11:53

I think it depends very much on the purpose of why you want a unique numerical identifier. Timestamps can change, inodes can change, disknumbers can change, MAC adresses can change. (Still, +1 for duskwuff)

In some scenarios you can simply make a table, where each path you add gets a new, unique number, just like a numerical key column in a database.

Although hashes can collide, in every actual environment this is absolutely unlikely (if you don't use the lousiest algorithm around...) It is much more likely that you get errors due to flaws in your implementation, for example you treat "/tmp" differently than "/tmp/", because you don't normalize the paths before hashing them. Or, you want to distinguish physical folders, but forget to check for hardlinks and symlinks to the same folder, thus you get multiple hashes / id's for the same dir.

Again, depending on your usecase, a collision is not necessarily fatal: if you find that a new path results in the same hash as an existing one (will not happen!) you can still react on that case. (*)

Just to help your imagination: If you use a 64 bit hash, you could fill 150 000 000 of 1TB hard disk drives with empty folders (nothing on them but short folder names...) and then you will have a collision for sure. If you think, that's too risky (blink, blink), use a 128 bit hash which makes it 18 446 744 073 710 000 times less likely.

Hashes are designed to make collisions unlikely, and even good old MD5 will do its job well if nobody willingly tries to produce a collision.

(*) Edit: Your pigeonhole article already points that out: a collision just mean that lookup is no longer O(1), but slightly slower. As it rarely ever happens, you can easily live with that. If you use a std::map (no hash) or std::hashmap, you'll not have to worry about collisions. See what the difference between map and hashmap in STL

Thank you, Daniel. I have updated my question to better describe my use case. — protectedmember, Sep 06 '12 at 17:35

How do I convert a directory path to a unique numerical identifier (Linux/C++)?

3 Answers3

Linked