Looking for a good 64 bit hash for file paths in UTF16

Question

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'. The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"

I want to hash this path into a 64-bit unsigned integer. It does not need to be "cryptographically sound". The hashes should be case insensitive, but able to handle non-ascii letters. Obviously, the hash also should scatter well.

There are some ideas that I had though of:

A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.

B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.

I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.

To sum up the needs:

1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive

The answers below are adequate - but I was hoping to have a hash that leverages the fact that the input is a utf-16 file path — Dominik Weber, Sep 18 '10 at 00:24
To cryptographic hashes, it virtually makes no difference whether it is UTF-16 or any other encoding because they are *designed* to be unpredictable and use *all* the information provided by the input in *every* single bit of the resulting hash for a perfect distribution with minimal collision (at least in theory - the more secure a hash is, the more it is believed to satisfy that), which is why you could use any part of that hash. — Arc, Sep 18 '10 at 15:14
but there are gaps in the UTF-16 code-space the private use are and either upper or lower case characters doe not get used. Or is it pointeless to worry about the structure of the input data? — Dominik Weber, Sep 20 '10 at 18:23
In theory, it should not matter to a good crypto-hash function. Of course there are probably hidden dependencies on the structure of the hashed data but these should hardly be observable for your application as otherwise I guess that someone would have found that out while examining the hash functions, and that would probably have killed the hash right away for crypto uses. Secure hashes are believed to be the best you can get for a perfect distribution. — Arc, Sep 21 '10 at 06:28
The following page has several implementations of general purpose hash functions that are efficient and exhibit minimal collisions: http://www.partow.net/programming/hashfunctions/index.html — , Jan 01 '11 at 10:49

Jon Hanna · Answer 1 · 2010-09-23T09:58:04.357

3

I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:

ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
  ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
  res = res * 8191 + merge; // unchecked arithmetic
}
return res;

I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.

I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.

edited Sep 23 '10 at 09:58

answered Sep 22 '10 at 16:19

Jon Hanna

110,372
10
146
251

This is not bad - it avoids having to capitalize the whole string - and the null termination trick is neat. – Dominik Weber Sep 22 '10 at 22:03
Well, it essentially capitalises the whole string in the UCase bit. That bit of pseudo code is meant to mean a capitalisation. Whether its done as a string or char-by-char is of no matter (including as efficiency) assuming I'm remembering correctly that the file case folding is char-by-char and culture-agnostic. I could be wrong in that memory. In either case you'd want to use the same method that windows file sys uses. – Jon Hanna Sep 22 '10 at 23:22

Arc · Accepted Answer · 2010-09-15T20:40:31.527

2

Cryptographically secure hashes might not be very efficient in terms of speed, but there are implementations available for virtually any programming language.
Whether using them is feasible for your application depends on how much you depend on speed – a benchmark would give you an appropriate answer to that.

You could use a sub-string of such a hash, e.g. MD5 on your path, previously converted to lower case so that the hash is effectively case-insensitive (requires that you use a method for lower-casing which knows how to convert all UTF-16 non-standard characters that may occur in the file system).

Cryptographically secure hashes have the benefit of quite even distribution no matter which sub-string part you take because they are designed to be non-predictable, i.e. each part of the hash ideally depends on the entire hashed data as any other part of it.

edited Sep 15 '10 at 20:40

answered Sep 15 '10 at 20:34

Arc

11,143
4
52
75

Thanks - but MD5 and just about any other "cryptographic hash" I know of are larger than 64 bit. I'd have to collapse the key space for this. I do have a method that is Unicode 5.1 based to lowercase strings that is quite fast. – Dominik Weber Sep 15 '10 at 21:41
Yes, like I said, you'd have to take a 64-bit sub-string of MD5 then. Alternatively, take a look at this question: http://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings – Arc Sep 15 '10 at 21:55
Yes that would work - I was confused with the "sub-sting" - I thought that referred to the file path, not the resulting hash. – Dominik Weber Sep 18 '10 at 00:28
Selected this beacuse this seems the best solution for me of and the good follow-up in the commments! Thanks! – Dominik Weber Sep 23 '10 at 21:46

score 2 · Answer 3 · edited Oct 07 '21 at 06:06

2

Even if you do not need a cryptographic hash, you can still use one, and since your problem is not about security, then a "broken" cryptographic hash would be fine. I suggest MD4, which is quite fast. On my PC (a 2.4 GHz Core2 system, using a single core), MD4 hashes more than 700 MB/s, and even for small inputs (less than 50 bytes) it can process about 8 millions messages par second. You may find faster non-cryptographic hashes, but it already takes a rather specific situation for it to make a measurable difference.

For the specific properties you are after, you would need:

To "normalize" characters so that uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case-insensitivity in the Unicode world is not an easy task. From what you explain, I gather that you are only after the same kind of case-insensitivity that Windows uses for file accesses (I think that it is ASCII-only, so conversion uppercase->lowercase is simple).
To truncate the output of MD4. MD4 produces 128 bits; just use the first 64 bits. This will be as scattered as you could wish for.

There are MD4 implementations available in may places, including right in the RFC 1320 I link to above. You may also find opensource MD4 implementations in C and Java in sphlib.

edited Oct 07 '21 at 06:06

Community

1
1

answered Sep 16 '10 at 13:56

Thomas Pornin

72,986
14
147
189

Yeah, crypto-hashes need not be bad per se, hence I suggest doing some benchmarks for evaluation if necessary. MD4 support may be phased out for some crypto-libraries as it is considered insecure, therefore the widely implemented MD5 may be more future-proof regarding portability and compatibility. Windows supports Unicode file system names as far as I know for NTFS and VFAT (UTF-16; according to Wikipedia, NTFS uses 16 bit characters which may - but need not - be UTF-16). http://en.wikipedia.org/wiki/NTFS – Arc Sep 16 '10 at 20:04
actually every NTFS volume has a uppercase map that the file system driver uses for case-insensitive comparisons. – Dominik Weber Sep 17 '10 at 18:46
Regarding speed, the hashing code used by Java in that related question I linked to might be a sufficiently good one as well for *compiled* languages (includes C/C++ and Java) - for dynamically interpreted languages (e.g. PHP, Perl), a hash function provided by the language's function library (which usually is in native machine code) may be faster than a function you provide in the interpreted language (also depends on the length of the input). http://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings/1660613#1660613 – Arc Sep 18 '10 at 15:20

reinier · Answer 4 · 2016-11-01T14:29:54.637

1

You could just create a shared library in C# and use the FileInfo class to obtain the full path of a directory or file. Then use .GetHashCode() in the path, like this:

Hash = fullPath.GetHashCode();

or

int getHashCode(string uri) 
{
   if (uri == null) throw new ArgumentNullException(nameof(uri));

   FileInfo fileInfo = new FileInfo(uri);
   return fileInfo.FullName.GetHashCode();
}

Altough this is just a 32 bits code, you duplicate it or append another HashCode based on some other characteristics of the file.

edited Nov 01 '16 at 14:29

answered Nov 01 '16 at 14:24

reinier

27
6

2

The OP is looking for a hash algorithm and you are proposing to embed the GetHashCode C# function in a shared library? – arainone Feb 20 '18 at 12:27

Looking for a good 64 bit hash for file paths in UTF16

4 Answers4