5

Given

var stringList = new List<string>(new string[] {
                   "outage","restoration","efficiency"});

var queryText = "While walking through the park one day, I noticed an outage",
              "in the lightbulb at the plant. I talked to an officer about", 
              "restoration protocol for public works, and he said to contact",
              "the department of public works, but not to expect much because",
              "they have low efficiency."

How do I get the overall number of occurances of all strings in stringList from queryText?

In the above example, I would want a method that returned 3;

private int stringMatches (string textToQuery, string[] stringsToFind)
{
    //
}

RESULTS

SPOKE TOO SOON!

Ran a couple of performance tests, and this branch of code from Fabian was faster by a good margin:

private int stringMatches(string textToQuery, string[] stringsToFind)
{
    int count = 0;
    foreach (var stringToFind in stringsToFind)
    {
        int currentIndex = 0;

    while ((currentIndex = textToQuery.IndexOf(stringToFind , currentIndex, StringComparison.Ordinal)) != -1)
    {
       currentIndex++;
       count++;
    }
    }
    return count;
}

Execution Time: On a 10000 iteration loop using stopwatch:

Fabian: 37-42 milliseconds

lazyberezovsky StringCompare: 400-500 milliseconds

lazyberezovsky Regex: 630-680 milliseconds

Glenn: 750-800 milliseconds

(Added StringComparison.Ordinal to Fabians answer for additional speed.)

Wesley
  • 5,381
  • 9
  • 42
  • 65

7 Answers7

6

That might also be fast:

private int stringMatches(string textToQuery, string[] stringsToFind)
{
  int count = 0;
  foreach (var stringToFind in stringsToFind)
  {
    int currentIndex = 0;

    while ((currentIndex = textToQuery.IndexOf(stringToFind , currentIndex, StringComparison.Ordinal)) != -1)
    {
     currentIndex++;
     count++;
    }
  }
  return count;
}
Fabian Bigler
  • 10,403
  • 6
  • 47
  • 70
  • hang on. I just corrected a bug. ;) please copy the code again – Fabian Bigler Jul 26 '13 at 23:30
  • @Wesly Cheers. ;) Have you benchmarked properly with StopWatch and some million calls, though? Else the result could be flawed. – Fabian Bigler Jul 26 '13 at 23:33
  • StopWatch and 10000 calls. I just checked though, and I was only getting "1" back no matter what. Let me see if I was using an old loop. – Wesley Jul 26 '13 at 23:38
  • I checked it in a test project and I got the right result. ;) – Fabian Bigler Jul 26 '13 at 23:38
  • @Wesley Thanks for taking the time and testing it. – Fabian Bigler Jul 26 '13 at 23:40
  • 1
    Of course this is the most efficient since string methods are the most efficient way. However, this does also something different than the other approaches. It does not use `String.Split` to create a new `string[]` to find all **words**. So if a word to find is `restoration` and one word in the text is `restorationrestoration` this method returns two matches whereas the split approaches don't find a match at all since both **words** are different. Another example: this method finds a match on `inefficiency` if actually `efficiency` is the word to find. – Tim Schmelter Jul 27 '13 at 09:37
  • @TimSchmelter Good point. Though Wesley is talking about 'Occurences'. The only question is, if he wanted to get the words or literally the string. If he wanted to match the words, then I can edit my current answer accordingly. – Fabian Bigler Jul 27 '13 at 11:18
  • @TimSchmelter I added another answer to fulfill this requirement. – Fabian Bigler Jul 27 '13 at 11:51
4

This LINQ query splits text by spaces and punctuation symbols, and searches matches ignoring case

private int stringMatches(string textToQuery, string[] stringsToFind)
{
   StringComparer comparer = StringComparer.CurrentCultureIgnoreCase;
   return textToQuery.Split(new []{' ', '.', ',', '!', '?'}) // add more if need
                     .Count(w => stringsToFind.Contains(w, comparer));
}

Or with regular expression:

private static int stringMatches(string textToQuery, string[] stringsToFind)
{
    var pattern = String.Join("|", stringsToFind.Select(s => @"\b" + s + @"\b"));
    return Regex.Matches(textToQuery, pattern, RegexOptions.IgnoreCase).Count;
}
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
3

If you want to count the words in the string that are in the other collection:

private int stringMatches(string textToQuery, string[] stringsToFind)
{
    return textToQuery.Split().Intersect(stringsToFind).Count();
}
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
1

I like Tim's answer, but I try to avoid making too many strings to avoid performance issues, and I do like regular expressions, so here's another way to go:

private int StringMatches(string searchMe, string[] keys)
{
    System.Text.RegularExpressions.Regex expression = new System.Text.RegularExpressions.Regex(string.Join("|", keys), System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    return expression.Matches(searchMe).Count;
}
Glenn Cuevas
  • 166
  • 5
  • 1
    "to avoid performance issues, and I do like regular expressions" Regular expressions are very slow in this scenario compared to string operations. – Fabian Bigler Jul 26 '13 at 23:11
  • @FabianBigler I should have been more specific about what performance issue I was referring to, if the app gets a little wild creating strings, I've seen servers brought to their knees when the GC runs to clean up resources, but you very well may be right about the string approach being faster, I'm curious to see what Wesley sees in his performance testing. Also, that comment was in reference to textToQuery.Split(). – Glenn Cuevas Jul 26 '13 at 23:19
0

This is a revision of Fabian Bigler's original answer. It is about a 33% speed improvement mostly because of StringComparison.Ordinal.

Here's a link for more info on this: http://msdn.microsoft.com/en-us/library/bb385972.aspx

    private int stringMatches(string textToQuery, List<string> stringsToFind)
    {
        int count = 0, stringCount = stringsToFind.Count(), currentIndex;
        string stringToFind;
        for (int i = 0; i < stringCount; i++)
        {
            currentIndex = 0;
            stringToFind = stringsToFind[i];
            while ((currentIndex = textToQuery.IndexOf(stringToFind, currentIndex, StringComparison.Ordinal)) != -1)
            {
                currentIndex++;
                count++;
            }
        }
        return count;
    }
Chris
  • 2,766
  • 1
  • 29
  • 34
  • It is faster, but I find only 3%-5% from a few tests. Though I will say, I had to change the method signature to IList because it wouldn't allow IReadOnlyList. – Wesley Jul 27 '13 at 05:17
  • Also, added StringComparison.Ordinal to Fabians answer, and it's about %30 faster now. Not sure why it would be faster than yours, but it is but a small margin. – Wesley Jul 27 '13 at 05:30
  • I can't see any improvement to my answer, and it's not 33% faster. You're generating a lot of string references here which have to be garbage collected. line: string stringToFind = stringsToFind[i]; – Fabian Bigler Jul 27 '13 at 06:02
  • I used a normal for instead of foreach because for is typically cheaper. I should have put the string on the outside of that loop however in this test case there are so few strings to find that it wouldn't have made much of a difference. As far as the IReadOnlyList, you can change that to a List, or an IList, or whatever you want. If I'm going for speed I tend to steer clear of foreach statements. http://stackoverflow.com/questions/365615/in-net-which-loop-runs-faster-for-or-foreach – Chris Jul 29 '13 at 16:58
0

This will match only the words of your TextToQuery:

The idea of this is to check if the index before and after the match is not a letter. Also, I had to make sure to check if it's the start or end of the string.

  private int stringMatchesWordsOnly(string textToQuery, string[] wordsToFind)
        {
            int count = 0;
            foreach (var wordToFind in wordsToFind)
            {
                int currentIndex = 0;
                while ((currentIndex = textToQuery.IndexOf(wordToFind, currentIndex,         StringComparison.Ordinal)) != -1)
                {
                    if (((currentIndex == 0) || //is it the first index?
                          (!Char.IsLetter(textToQuery, currentIndex - 1))) &&
                          ((currentIndex == (currentIndex + wordToFind.Length)) || //has the end been reached?
                          (!Char.IsLetter(textToQuery, currentIndex + wordToFind.Length))))
                    {
                        count++;
                    }
                    currentIndex++;
                }
            }
            return count;
        }

Conclusion: As you can see this approach is a bit messier than my other answer and will be less performant (Still more performant than the other answers, though). So it really depends on what you want to achieve. If you have short words in your strings to find, you should probably take this answer, because e.g. an 'and' would obviously return too many matches with the first approach.

Fabian Bigler
  • 10,403
  • 6
  • 47
  • 70
0
private int stringMatches(string textToQuery, string[] stringsToFind)
{
      string[] splitArray = textToQuery.Split(new char[] { ' ', ',','.' });
      var count = splitArray.Where(p => stringsToFind.Contains(p)).ToArray().Count();
      return count;
}
Dejan Ciev
  • 142
  • 4