6

The C# the generic HashSet<T> search performance should be O(1), and the search performance of an ObservableCollection<T> should be O(n).

I have a large amount of unique elements, each element has a DateTime property that is not unique.

Each element calculates its HashCode by simply returning its DateTime.GetHashCode().

Now I want to get a subset of my data, e.g. all elements that have a date which is between March 2012 and June 2012.

    var result = from p in this.Elements
                 where p.Date >= new DateTime(2012, 03, 01) &&
                       p.Date <= new DateTime(2012, 30, 06
                 select p;

If I run this LINQ query on a collection of 300.000 elements, it takes ~25 ms to return 80 elements that are within the given range - it does not matter if I use a HashSet<T> or an ObservableCollection<T>.

If I loop through all elements manually and check them, it takes the same time, ~25 ms.

But I do know the HashCode of all Dates that are within the given range. Is it possible to get all elements with the given HashCodes from my HashSet<T>? I think that would be much faster...

Is it possible to speed up the LINQ query? I assume that it does not make use of the special abilities of my HashSet<T>?

Ehssan
  • 739
  • 6
  • 15
  • Is the hashcode of each element its date? – Jodrell May 17 '12 at 16:45
  • There are no special abilities of a HashSet that will allow efficient retrieval of elements whose date falls within a range. A HashSet allows rapid determination of whether a particular object or value is (or isn't) in the set. – hatchet - done with SOverflow May 17 '12 at 16:50
  • My first observation is that hash codes should be different where possible if the objects differ (this clearly can't always be the case but is what you shoudl aim for). In your case this isn't the case. You have different elements with identical hashcodes which is bad. In the worst case if you only had three different unique dates then your hashset will have only three buckets and so finding something in the hashset will have to sort through all the elements in that bucket leading it to be O(n) (give or take). Also I should note that this is a general note, not directly related to the ques :) – Chris May 17 '12 at 16:52
  • Oh, and as an added note is this.elements the hashset you are talking about? Its not clear from the question... – Chris May 17 '12 at 16:54
  • If you have 300,000 elements, are you pulling them from a database? If so, you can grab only the items in the right date range, which should be much faster. – jb. May 17 '12 at 16:54
  • No, the elements are not from a database. I am just asking, because search performance in a generic HashSet should be O(1), but LINQ queries (and my own queries) execute in O(n). Oh, and just to mention it: There are very few elements that have the same hash code... – Ehssan May 18 '12 at 16:05
  • @EhssanDoust as has been pointed out though you are **not** searching by the hash when you are doing linq queries, you are simply searching over the IEnumerable and doing a comparison of the `Date` property (which in your case just happens to be the element used to generate the hash). You realise that a HashSet **has no way to retrieve an element based on the hash** right? See [this](http://stackoverflow.com/questions/1494812/why-cant-i-retrieve-an-item-from-a-hashset-without-enumeration) You really need to use a different data structure. – Sam Holder May 18 '12 at 16:58

2 Answers2

5

You're not using the right data structure. You should be using something like a sorted list (sorted on the Date property) where you can then binary search for the beginning and end of the range.

jason
  • 236,483
  • 35
  • 423
  • 525
  • Yes, I would definitely use a SortedList or SortedDicionary, but I cannot - the 'Date' of the Element is not a unique key... – Ehssan May 18 '12 at 16:10
  • @EhssanDoust why does the fact that the date not being unique stop you from using a dictionary? As long as the Equals method correctly determines when 2 instances are equal and the gethashcode always returns the same value for 2 different objects if equals between those objects is also true, then it will work. – Sam Holder May 18 '12 at 17:01
  • @SamHolder I am not sure if I understand what you say correctly, but if I want to efficiently search for an element by its Date using a Dictionary, the key of the Dictionary should be that date, right? But there are very few dates that are not unique in my collection... So I cannot use them as keys? – Ehssan May 18 '12 at 23:03
  • @EhssanDoust yes sorry, comprehension failure on my part. I forgot that you only had a date and not the full object. A sorted list should be fine though as Jason suggested, as a list can have multiple elements with the same key. so find the index of the first element with the date you want, then find the index of the element with last date, then get all elements between those indexes. – Sam Holder May 19 '12 at 10:00
4

As has been pointed out a hash set is very efficient at determining if a given hash is in the set. Your query just uses the fact that the hashset implement IEnumerable to iterate over the entire set and do the date comparison. It will not use the hashes at all. This is why the manual way takes the same time as the query.

You cannot get an element based on a hash from a hashset, you can only test for existance of the element in the set. A dictionary is what you want if you need to get it by has (which it seems you don't)

Decide what it is that you need to do with your data and use a structure which is optimised for that. This may be your own class which maintains multiple internal structures each of which is efficient at one thing (like one for searching for ranges and another for checking by existence by multiple fields), or there may be an existing structure which fits your needs. But without knowing what it is you want to do with your data its difficult to advise.

The other thing to consider is whether you are optimising prematurely. If 25ms to search manually is fast enough then maybe any structure which implements IEnumerable will be good enough. In which case you can choose one based on the other criteria you need.

Sam Holder
  • 32,535
  • 13
  • 101
  • 181
  • Thank you for your answer. I think, that the current search performance is more than sufficient, I just thought that it might be possible to retrieve elements directly by their hash code, which is as you pointed out not possible. The Remove-method of `HashSet` is much more performant that the one that is offered by any "normal" collection, so I'll definitely use a HashSet. – Ehssan May 18 '12 at 23:16