1

I have two 50kb string data in vars doc1 and doc2 coming form the database through entity framework. I want to compare these two vars to see if doc1 and doc2 are equal. I can take the hashset of the strings and compare the hashes. Or I can simply use if (doc1 == doc2) Is there a third option that's better?

If there is no third option, does anyone have any suggestion (a logical one is good) in regards to hashset v. == in terms of optimization, performance and what IL does in the background? I would imagine that a hashset would have to scan the string in a linear fashion to the end to create a unique hash string (for two vars). So does ==. Then which one is logically better?

user2767168
  • 43
  • 1
  • 5
  • 1
    Define better. Does `==` work? If so, why are you looking for something better? Do you have a performance bottleneck? – Lasse V. Karlsen Sep 12 '14 at 06:43
  • 1
    To compare strings, whatever algorithm you use, it has to walk all the way through characters in a string to compare it(if length is not different). So simply use `==` operator. – Sriram Sakthivel Sep 12 '14 at 06:43
  • Do you mean [`HashSet`](http://msdn.microsoft.com/en-us/library/bb359438%28v=vs.110%29.aspx), or [`string.GetHashCode()`](http://msdn.microsoft.com/en-us/library/system.string.gethashcode%28v=vs.110%29.aspx)? – dbc Sep 12 '14 at 06:43
  • @Lasse No. I was at a crossroad in my coding and I needed to make a decision. – user2767168 Sep 12 '14 at 07:01
  • @SriramSakthivel Makes sense. `==` as krisku mentioned has an added advantage that it stops as soon as it finds a mismatch. – user2767168 Sep 12 '14 at 07:03
  • Yes, that's how it works. The problem is if only last character is different you'll have to walk towards the end to find it. Usually it won't be like that. – Sriram Sakthivel Sep 12 '14 at 07:05
  • @dbc I must admit suddenly I drawing a blank on those two. `string.GetHashCode()` gets the hash code of the string object. Is that correct? While I was planning to pass the string to a function and get the hash code. – user2767168 Sep 12 '14 at 07:07
  • @user2767168 - the `GetHashCode()` method on all c# objects returns a hash code for them. `HashSet` is a collection that allows for fast lookup using hash codes. – dbc Sep 12 '14 at 07:10

1 Answers1

2

The == operator compares character by character but stops as soon as it finds a mismatch between the two strings, so in that regard it will be able to perform better as it likely will not have to scan the whole strings and even in the worst case when they are equal you have not performed any more work than hashing them both.

If you are really concerned about performance, you could store precomputed hashes of the long strings in the database and thus would not have to look at their content at all (provided a hash collision is not fatal).

krisku
  • 3,916
  • 1
  • 18
  • 10
  • While I accepted this as the answer I will be using `string1.Equals(string2)` as suggested by Sagar and based on reading dbc reference below to Sagar's answer. Thank you guys. You guys rock! – user2767168 Sep 12 '14 at 07:10