3

I have a program that compares two files. I ran visual studio analysis and found that my comparison time is large. Is there a quicker way to compare two string than this? (I can't use parallel foreach because it might causes errors.) Right now I'm using a concurrent dictionary but I'm open to other options. :)

var metapath = new ConcurrentDictionary<string, string>();
foreach(var me in metapath)
{
 if (line.StartsWith(me.Key.ToString()))
 {...}
}
Cœur
  • 37,241
  • 25
  • 195
  • 267
  • How large is the comparison time, does it say? – BoltClock Mar 23 '12 at 21:57
  • Do you need the line-based approach? It’s not entirely obvious from your question. Do you just want to compare entire files for equality, or the lines of individual text files? – Roman Starkov Mar 23 '12 at 22:00
  • @BoltClock well me.key.tostring is 8 characters long and line is somewhere between 200-1000 and its taking about 42 seconds for all the comparisions –  Mar 23 '12 at 22:00
  • @romkyns yes I think it needs to be line based –  Mar 23 '12 at 22:01
  • Sounds similar to this question: http://stackoverflow.com/q/8867710/409259 – Jeff Ogata Mar 23 '12 at 22:02
  • Please don't prefix your titles with "C#" and such. That's what the tags are for. – John Saunders Mar 23 '12 at 22:04
  • What's the desired result of the comparison? There are better algorithms than this; you're essentially getting O(m * n) performance when you could be getting O(m + n). – phoog Mar 23 '12 at 22:14
  • @AshBurlaczenko thanks i will do that next time –  Mar 24 '12 at 19:02
  • @JohnSaunders sorry i'm still a newb :) i'll stop prefixing with the language –  Mar 24 '12 at 19:03
  • @phoog i'm sorry i dont understand what do you mean: "what is the desired result of the comparison?" really i'm just checking a string to see if it begins with another string how could i get O(m+n)? –  Mar 24 '12 at 19:05
  • @caseyr547 I'm asking about the result of comparing the files, not of comparing each line. Are you returning the count of lines in the first that match lines in the second? Or are you returning a third file containing the matching lines? In other words, what happens in the `{...}` block? The best answer to the question depends on that. – phoog Mar 26 '12 at 13:39
  • @AshBurlaczenko it's always possible to change the accepted answer if a better one comes along. – phoog Mar 26 '12 at 13:40
  • @phoog, true but it's unlikely they would revisit this question once they'd accepted one already. – Ash Burlaczenko Mar 26 '12 at 17:29

3 Answers3

5

First of all, drop the ToString() from me.Key.ToString().

Next, use the ordinal string comparison (provided that this doesn’t impact correctness):

line.StartsWith(me.Key, StringComparison.Ordinal);

This is beneficial because standard string comparisons follow various Unicode rules on what’s equal. For example, normalized and denormalized sequences must be treated as equal. Ordinal just compares raw character data, ignoring Unicode equality rules. There is more detail on this here, for example, or here (which claims it’s faster but without quoting any numbers).

Last, profile the code. You’ll be surprised, but most of the time the slow part is not at all what you think it is. For example, it could be the part where you add things to the dictionary.

Community
  • 1
  • 1
Roman Starkov
  • 59,298
  • 38
  • 251
  • 324
1

If you compare strings exactly, String.Equals is quite good:

String.Equals(line, me.Key)

Have you seen this: What is the fastest (built-in) comparison for string-types in C#

Community
  • 1
  • 1
Andrei Drynov
  • 8,362
  • 6
  • 39
  • 39
0

It's not clear exactly what you mean by "comparision" but if you don't mean "sort" i.e. you want to check for plagiarism or something, then what about hashing the lines first and comparing the hash?

It would depend on the size of your data set as to whether there is any benefit. Large and small are highly subjective terms.

rism
  • 11,932
  • 16
  • 76
  • 116