-1

I have two list of strings:

private List<string> sortedList; //about 150k items
private List<string> mixedList; // about 200k items

One of them is sorted, the other one is not.

I use the following code to find the intersecting words:

List<string> intersectingWords=new List<string>();
StringInvariantComparer stringComparer = new StringInvariantComparer();

foreach (var item in mixedList)
{
  int res = sortedList.BinarySearch(item, stringComparer);
  if (res >= 0 && !intersectingWords.Contains(item))
    intersectingWords.Add(item);
}



public class StringInvariantComparer : IComparer<string>
{
    public int Compare(string x, string y)
    {
        return string.Compare(x, y, StringComparison.InvariantCultureIgnoreCase);
    }
}

Unfortunately, it takes about 20-25 seconds which is way too much time.

If I change the StringComparison type to Ordinal, it finishes in 1-2 seconds but can't find 30% of the words.

What would you recommend to optimize the code?

UPDATE:

If I use the following code:

var wrongRes = sortedList.Intersect(mixedList).ToList();

The result:

enter image description here

If I use the following code:

var upperMixed = new List<string>();
var upperSorted = new List<string>();
foreach (var item in mixedList)
    upperMixed.Add(item.ToUpper());
foreach (var item in sortedList)
    upperSorted.Add(item.ToUpper());
foreach (var word in upperSorted.Intersect(upperMixed))
{
    intersectingWords.Add(word.ToLower());
}

The result:

enter image description here

They are clearly not the same :)

Nestor
  • 8,194
  • 7
  • 77
  • 156
  • A binary search algorithm has O (Log N) I think...better than O (N) – rory.ap Jan 07 '17 at 23:49
  • Clearly you've seen questions on "intersection of two lists" (i.e. http://stackoverflow.com/questions/7187996/intersect-two-lists-in-c-sharp) but decided to do something different for some reason. Could you please spell out the reason and provide sample that shows words that can't be found by versions of ordinal comparison (casesensitive/caseinsensitive)? (i.e. do you care about `ß` vs `ss`) – Alexei Levenkov Jan 08 '17 at 00:02
  • Would a `Parallel.ForEach` help? – BrunoLM Jan 08 '17 at 00:02
  • @BrunoLM no - parallel is not a cure for picking wrong algoritm (O(n^2) vs. O(n)). – Alexei Levenkov Jan 08 '17 at 00:05
  • I would sort the *mixedList*, and then *merge* two lists.. (only 1 pass on both lists) – L.B Jan 08 '17 at 00:13
  • To resume all comments : ```var intersectingWords = new HashSet(mixedList.AsParallel().Where(item => sortedList.BinarySearch(item, stringComparer) >= 0));``` – Kalten Jan 08 '17 at 00:14

1 Answers1

1
  1. Replace List<string> intersectingWords with HashSet<string> intersectingWords - it's better for Contains operations.
  2. Prepare words: keep in sortedList and mixedList strings in uppercase and use Ordinal comapre. If you need original strings, you can keep structure of 2 words (normal and uppercase). Compare uppercase and add normal word in intersectingWords
Backs
  • 24,430
  • 5
  • 58
  • 85
  • 1
    Uppercasing is rarely enough to normalize all text... Plus HashSet using OrdinalIgnoreCase comparer would do the same without extra effort.... – Alexei Levenkov Jan 08 '17 at 00:03
  • @AlexeiLevenkov What do you mean by not enough? In my case, the Ordinal method worked with uppercased words. What should I expect? – Nestor Jan 08 '17 at 00:10
  • @Nestor There are letters outside ASCII range - so while you say InvariantIgnoreCase worked for particular sample it is not clear if it is actually enough for you or there are more words i.e. do you care about ß vs ss, what about use of denormalized ascents? – Alexei Levenkov Jan 08 '17 at 00:15
  • @AlexeiLevenkov I updated the initial post. I get strange results, if I use the built-in comparer. If I uppercase the words, it works okay. The sorted list comes from a file. Could be an encoding issue? – Nestor Jan 08 '17 at 00:48