TDictionary Hashing is broken for records of strings

Question

This reproduces the problem :

program Project1;

{$APPTYPE CONSOLE}

uses
  Generics.Collections;
type
  TStringRec = record
    s1 : string;
    s2 : string;
  end;
  TGetHash<TKey,TValue> = class(TEnumerable<TPair<TKey,TValue>>)
    public
    type
      TItem = record
        HashCode: Integer;
        Key: TKey;
        Value: TValue;
      end;
      TItemArray = array of TItem;
    public
    FItems: TItemArray;
  end;
var
  LCrossRef : TDictionary<TStringRec, integer>;
  LRec : TStringRec;
  i : integer;
begin
  LCrossRef := TDictionary<TStringRec, integer>.Create();
  LRec.s1 := 'test1';
  LRec.s2 := 'test2';
  LCrossRef.Add(LRec, 1);
  LRec.s1 := 'test1';
  LRec.s2 := 'test2';
  if LCrossRef.TryGetValue(LRec, i) then begin
    writeln('ok');
  end else begin
    LCrossRef.Add(LRec, 1);
    for i := Low(TGetHash<TStringRec, integer>
                (LCrossRef).FItems)
          to High(TGetHash<TStringRec, integer>
                (LCrossRef).FItems) do
      WriteLn(TGetHash<TStringRec, integer>(LCrossRef).FItems[i].HashCode);
    WriteLn('not ok');
  end;
  ReadLn;
end.

The dictionary fails to retrieve the item and generates a different HashCode for records containing identical strings.

This is partially noted in QC-#122791 but the workaround to use packed records does not work for records of strings (at least the above example also fails when TStringRec is declared as packed record).

Is there a sensible workaround to this?

My current strategy is to concatenate the strings that would have otherwise gone into the record and use a TDictionary<string, TValue> instead, but that is naturally unsatisfying.

How about using a TObjectDictionary with a custom IEqualityComparer that implements GetHashCode for your specific type? — Ville Krumlinde, Oct 02 '14 at 12:44
@VilleKrumlinde You don't mean `TObjectDictionary`. There is no object ownership here. — David Heffernan, Oct 02 '14 at 12:57
This answer may be of use and save you some code: https://stackoverflow.com/questions/27820171/delphi-using-records-as-key-in-tdictionary/64744711#64744711 — JoelFan, Nov 09 '20 at 01:19

David Heffernan · Accepted Answer · 2014-10-02T13:08:29.790

7

This is a known limitation, that is by design. The default comparers and hashers for records are only intended to work for pure value type records, and for records that have no padding.

The designers could have opted to use RTTI to compare/hash records. However, they opted not to do so. Some obvious plausible reasons for that choice are:

They did not wish to force the use of RTTI on the unwillling.
There is a significant performance hit incurred from the use of RTTI.

The way to deal with this is to supply your own comparers and hashers when using the generic collections.

Your current strategy of concatenating strings won't work. Consider 'a' and 'aa', and then 'aa' and 'a'. To use a text based approach you'd want to serialize the record to, say, JSON.

edited Oct 02 '14 at 13:08

answered Oct 02 '14 at 12:53

David Heffernan

601,492
42
1,072
1,490

I agree this makes sense for reference types in general. Being special and compiler managed, as strings are, however, I would have expected at least they would be hashed by default in a sensible way (this doesn't seem extraordinary or impossible). Custom comparers and hashers it is. Agreed that concatenation is a rubbish strategy - in my case it was a hack that, by nature of the specific problem, is virtually guaranteed to not conflict in the way you've noted. It was a temporary workaround until I settled on a better strategy. – J... Oct 02 '14 at 13:08
How would that be implemented? The system would have to be able to generate RTTI based comparers. Not trivial I judge. – David Heffernan Oct 02 '14 at 13:09
Perhaps. I admittedly haven't thought it through with any rigour. In any case, the point is moot. It will need to be another solution. – J... Oct 02 '14 at 13:12
3

The answer to my question here will be useful: http://stackoverflow.com/questions/11294686/what-is-the-canonical-way-to-write-a-hasher-function-for-tequalitycomparer-const The answer deserves more upvotes than my solitary one. – David Heffernan Oct 02 '14 at 13:21
Very nice - that's going in my reference list! – J... Oct 02 '14 at 13:23
1

I was really confused when I read that answer just now because I could not understand how mjn had channeled my coding style! But it turns out that I added the code in an earlier edit. The version in my code base has this comment: see http://stackoverflow.com/questions/1646807 and http://stackoverflow.com/questions/11294686 Jon Skeet's answer to the first question is also useful – David Heffernan Oct 02 '14 at 13:28
FWIW, I think they really didn't choose to use RTTI because of argument 2: the performance hit. It would totally undo the performance gain due to a good hash function. But I like the answers using `value := value * 37 + c` repeatedly. – Rudy Velthuis Oct 02 '14 at 20:40
@RudyVelthuis But there already exists a perfectly useful string hash in the generics library. Indeed, creating a `TDictionary` works just perfectly fine. Since it is also possible to inject a hash function in a custom comparer when the dictionary is created I would think that you would only need to use RTTI when initially creating the `TDictionary` instance. – J... Oct 03 '14 at 10:45
@RudyVelthuis The performance would naturally decrease slightly since a sane hash for records containing strings would need to compute one hash for every string, plus one for the whole record (replacing the string pointers with their respective hashes), but it wouldn't require RTTI calls each time... unless I'm missing something. – J... Oct 03 '14 at 10:49
@J... How would the code be generated that would implement a hasher that way? Either there would be a single function that used RTTI. Or the compiler would have to generate hashers/comparers for every record type. Either could be done but both would be significant developments. The latter probably performs better, but is also harder to implement. I guess they decided to let the users do the work if needed. An RTTI based approach is not too hard. I bet Spring4D has an implementation. – David Heffernan Oct 03 '14 at 13:35
@DavidHeffernan I wouldn't expect a fully general implementation, surely, but strings are so ubiquitous that it would be nice to support them (and not *terribly* difficult) in records of otherwise strictly value types. All the RTTI you would need would be to inspect the record type for strings, fall back to `TComparer.Default` if any other reference types exist, and otherwise build a comparer/hasher using a derived record type with string hashes replacing the string pointers. It seems a pretty straightforward substitution. – J... Oct 03 '14 at 13:48
@DavidHeffernan In fact, thinking about it, it wouldn't be too much harder to generally support records of any reference types that have defined default comparers in the same way. Binary hash the record replacing any reference type pointers with their respective `TComparer.Default` hashes... maybe I'll work on this as a side project. – J... Oct 03 '14 at 13:52
That RTTI would be hugely costly though. And I doubt you need to do this yourself. I'd be astounded if Spring4D did not already have this covered. – David Heffernan Oct 03 '14 at 13:53
@DavidHeffernan But you'd only need to do it when the `TDictionary` is initially created, no? Once you've built the comparer/hasher you're done with RTTI, yes? I haven't used Spring4D... I suppose I should check it out. – J... Oct 03 '14 at 13:54
Well, the design for default comparers is that they can be created very quickly on demand. Look at that code in `Generics.Defaults` that uses those hand crafted interfaces. Anyway, the debate is a little moot. Only the designers really know. I'm just making (educated) guesses. – David Heffernan Oct 03 '14 at 13:58

score 2 · Answer 2 · answered Oct 02 '14 at 19:45

To expand on David's answer using an example from my codebase. I have a dictionary

  Records: TDictionary<TGazetteerRecord,TGazetteerRecord>

which is instantiated

  Records := TDictionary<TGazetteerRecord,TGazetteerRecord>.Create(InitCapacity, TGazRecordComparer.Create);

What makes this work is having a custom comparer in the construction of the dictionary.

TGazRecordComparer = class(TEqualityComparer<TGazetteerRecord>)
private
public
  function Equals(const Left, Right: TGazetteerRecord): Boolean; override;
  function GetHashCode(const Value: TGazetteerRecord): Integer;  override;
end;

Th implementation for this then replaces the default code for a record type. My example actually uses a class rather than a record but I don't see why this shouldn't work perfectly fine with a record type. Note that the comparer class is reference counted and therefore will be automatically disposed of when the dictionary is destroyed.

I don't see the point of this. Where's the implementation. You just seem to be telling us that you've implemented a comparer. FWIW the normal way to do so is to call the Construct class method. — David Heffernan, Oct 02 '14 at 22:50

TDictionary Hashing is broken for records of strings

2 Answers2