1

I am using a TClientDataSet to manage objects and to give me 'database' access to the object data. So far this works well. I have two 'special' (hidden) fields within the dataset - 'ObjectName' and 'ObjectRef'. ObjectName is a conjunctions of the category and name of the object item in the form My category.my object' and is used to get me from inside an object instance to the record number. This field is indexed. 'ObjectRef' is an integer typcast of a pointer to that object's instance and is used for all other object lifetime management.

I have had to choose a size for the 'ObjectName' field in which to fit my expected max possible category and name combination, but this is only an index and I would like to keep this size as small as possible for performance and memory reasons. Is there a 'lossless' function that I can apply to my form 'My category.my name' which would still be unique that I can use as a hash? Hash functions look clever but not being a computer science guru I am never sure how to know whether their output is unique.

Thanks

RRUZ
  • 134,889
  • 20
  • 356
  • 483
Brian Frost
  • 13,334
  • 11
  • 80
  • 154
  • 4
    You are asking if it is possible to store an infinite number of objects in a finite number of slots. The answer is no. – David Heffernan Jul 26 '12 at 20:19
  • @David. I dont think I am. I am prepared to allocate whatever length of field is required - and I accept that this will be a limitation - but I am merely hoping to 'compress' my bland reference of 'My category.my name' into a unique and more compact - and therefore preumably more efficent - indexable string. – Brian Frost Jul 27 '12 at 09:04
  • You need to also place a limit on the length of the My category.my name values. Then you can use them as is. – David Heffernan Jul 27 '12 at 09:33
  • Do you really need the "hash" to be reversible, or are you just worried about collisions? – Marcus Adams Jul 27 '12 at 13:05
  • @Brian: Instead of hashing, it sounds like you need auto-incremented identity values, and use id (integer) in your database design instead of primary-keys-as-strings. Of course, these don't exist in client datasets, but maybe you should think about using a real DB layer maybe SQLite, for these temporary tables. – Warren P Jul 27 '12 at 21:54
  • @marcus Good point. The latter - no need to be reversible. I had not realised the subtlety. – Brian Frost Jul 28 '12 at 13:05

2 Answers2

6

All the hash functons has risk of collitions, but AFAIK one of the more secure is the SHA-1 algorithm, exist many delphi implementations, for example you can use the Jwscl library (JEDI Windows Security Code Lib) which is a wrapper for the Windows CryptoAPI (you can find a delphi sample on this question SHA1 hashing in Delphi XE) or use the TIdHashSHA1 class which is part of Indy.

Another alternative is use more simple hash function (non-cryptographic) like the Jenkins hash function which delphi implements in the BobJenkinsHash method.

Community
  • 1
  • 1
RRUZ
  • 134,889
  • 20
  • 356
  • 483
  • -1 listing lots of hash algos doesn't address the question that was asked – David Heffernan Jul 26 '12 at 20:24
  • 1
    @DavidHeffernan I think which my answer to the question `Is there a Delphi XE2 string hash function guaranteed to be unique that I can use for lookup?` is usefull, I just mention 2 algorithms (SHA1 and Jenkins) and explain wich all the hash functions has risk of collitions. – RRUZ Jul 26 '12 at 20:29
  • @DavidHeffernan, please explain, Why do you say which none of these algorithms can be used to generate a hash and store the value in the `ObjectName` field? – RRUZ Jul 26 '12 at 20:39
  • 1
    Hash is no good. Brian wants a function that maps arbitrarily long string to a finite length value. And with the property that different input values always lead to different output values. Obviously no such function can exist. – David Heffernan Jul 26 '12 at 20:44
  • @DavidHeffernan I believe which the OP wants to store hash of a string (category + name combination) in the `ObjectName` field and then use this generated value as a lookup. So I will wait for the OP comments to check if my interpretation of his question is correct, if not i will delete my answer :). Also feel free of add your own answer with you own interpretation of the question. – RRUZ Jul 26 '12 at 20:53
  • The key is the uniqueness in the question title. And look at the comments to the other answer. Brian wants to hash without collision. My answer is my comment to the question. – David Heffernan Jul 26 '12 at 20:56
  • @DavidHeffernan I dont need the hash to be a fixed length. I suppose I was thinking that some function might exists where an 80-character string encodes to a (say) 34 character hash. – Brian Frost Jul 27 '12 at 08:58
  • @Brian That makes no sense. Try to be precise. – David Heffernan Jul 27 '12 at 09:34
  • The OP clarified that the "hashes" do not need to be reversible, or unique, just collision free. SHA1 should suffice for this. There are no known collisions for SHA1. If there were a collision found for a plain text string, it would not resemble the original string at all. It would not be human readable, or not resemble a plain text string, or it would be far too long to resemble the original. – Marcus Adams Jul 30 '12 at 12:14
  • @MarcusAdams I am confused by your words. Are you saying that if I encode any unique string with SHA1 if will produce a unique output that will be more compact than my string? – Brian Frost Aug 04 '12 at 06:00
2

No. By definition hash functions results are not unique.

You probably need to make a local list to track ObjectNames in your application and associate unique index with every object that is added, so then you could store it in DB instead of ObjectName. Or assign globally unique indexes to your objects upon creation (e.g. UInt64)

Kromster
  • 7,181
  • 7
  • 63
  • 111
  • 2
    Hash and compression are different tasks by nature. Hash is aimed at fast comparison, its output length is always constant. Compression is aimed at reducing size, but its output has no size restrictions. With short strings you may often get package size even bigger than original string. – Kromster Jul 26 '12 at 10:40
  • @KromStern Fuzzy Hashes do not have constant length: http://fuzzyhashing.codeplex.com/ – A Lombardo Jul 26 '12 at 20:52
  • @user Brian's DB has fixed length fields. – David Heffernan Jul 26 '12 at 21:01
  • @user582118: Fuzzy hashes are used in different context – Kromster Jul 27 '12 at 05:33
  • @David Yes DB fields are fixed, but my hash would have two perceived benfits - I would (somewhat) reduce the field length but the indexing would improve. – Brian Frost Jul 27 '12 at 09:01
  • I don't think uniqueness is the issue. No one to this date has found a hash collision even for SHA1. The larger issue is that hashes are not "lossless". They're one way, by definition. – Marcus Adams Jul 27 '12 at 12:59