Occurrences of a List in a string C#

Question

Given

var stringList = new List<string>(new string[] {
                   "outage","restoration","efficiency"});

var queryText = "While walking through the park one day, I noticed an outage",
              "in the lightbulb at the plant. I talked to an officer about", 
              "restoration protocol for public works, and he said to contact",
              "the department of public works, but not to expect much because",
              "they have low efficiency."

How do I get the overall number of occurances of all strings in stringList from queryText?

In the above example, I would want a method that returned 3;

private int stringMatches (string textToQuery, string[] stringsToFind)
{
    //
}

RESULTS

SPOKE TOO SOON!

Ran a couple of performance tests, and this branch of code from Fabian was faster by a good margin:

private int stringMatches(string textToQuery, string[] stringsToFind)
{
    int count = 0;
    foreach (var stringToFind in stringsToFind)
    {
        int currentIndex = 0;

    while ((currentIndex = textToQuery.IndexOf(stringToFind , currentIndex, StringComparison.Ordinal)) != -1)
    {
       currentIndex++;
       count++;
    }
    }
    return count;
}

Execution Time: On a 10000 iteration loop using stopwatch:

Fabian: 37-42 milliseconds

lazyberezovsky StringCompare: 400-500 milliseconds

lazyberezovsky Regex: 630-680 milliseconds

Glenn: 750-800 milliseconds

(Added StringComparison.Ordinal to Fabians answer for additional speed.)

I didn't add the +, just used it as an example. I could redo the text. — Wesley, Jul 26 '13 at 22:53
http://stackoverflow.com/questions/3026571/counting-words-in-a-collection-using-linq — zs2020, Jul 26 '13 at 22:54
Looks like a way to show the count of individual words. It wouldn't be that much of a stretch to adapt, but it's not quite right. — Wesley, Jul 26 '13 at 22:56
Thanks for pointing out StringComparison.Ordinal. I edited my answer accordingly (your edit has been rejected). — Fabian Bigler, Jul 27 '13 at 05:50

Fabian Bigler · Accepted Answer · 2013-07-27T05:49:44.240

6

That might also be fast:

private int stringMatches(string textToQuery, string[] stringsToFind)
{
  int count = 0;
  foreach (var stringToFind in stringsToFind)
  {
    int currentIndex = 0;

    while ((currentIndex = textToQuery.IndexOf(stringToFind , currentIndex, StringComparison.Ordinal)) != -1)
    {
     currentIndex++;
     count++;
    }
  }
  return count;
}

edited Jul 27 '13 at 05:49

answered Jul 26 '13 at 23:26

Fabian Bigler

10,403
6
47
70

hang on. I just corrected a bug. ;) please copy the code again – Fabian Bigler Jul 26 '13 at 23:30
@Wesly Cheers. ;) Have you benchmarked properly with StopWatch and some million calls, though? Else the result could be flawed. – Fabian Bigler Jul 26 '13 at 23:33
StopWatch and 10000 calls. I just checked though, and I was only getting "1" back no matter what. Let me see if I was using an old loop. – Wesley Jul 26 '13 at 23:38
I checked it in a test project and I got the right result. ;) – Fabian Bigler Jul 26 '13 at 23:38
@Wesley Thanks for taking the time and testing it. – Fabian Bigler Jul 26 '13 at 23:40
1

Of course this is the most efficient since string methods are the most efficient way. However, this does also something different than the other approaches. It does not use `String.Split` to create a new `string[]` to find all **words**. So if a word to find is `restoration` and one word in the text is `restorationrestoration` this method returns two matches whereas the split approaches don't find a match at all since both **words** are different. Another example: this method finds a match on `inefficiency` if actually `efficiency` is the word to find. – Tim Schmelter Jul 27 '13 at 09:37
@TimSchmelter Good point. Though Wesley is talking about 'Occurences'. The only question is, if he wanted to get the words or literally the string. If he wanted to match the words, then I can edit my current answer accordingly. – Fabian Bigler Jul 27 '13 at 11:18
@TimSchmelter I added another answer to fulfill this requirement. – Fabian Bigler Jul 27 '13 at 11:51

Sergey Berezovskiy · Answer 2 · 2013-07-26T23:12:08.640

4

This LINQ query splits text by spaces and punctuation symbols, and searches matches ignoring case

private int stringMatches(string textToQuery, string[] stringsToFind)
{
   StringComparer comparer = StringComparer.CurrentCultureIgnoreCase;
   return textToQuery.Split(new []{' ', '.', ',', '!', '?'}) // add more if need
                     .Count(w => stringsToFind.Contains(w, comparer));
}

Or with regular expression:

private static int stringMatches(string textToQuery, string[] stringsToFind)
{
    var pattern = String.Join("|", stringsToFind.Select(s => @"\b" + s + @"\b"));
    return Regex.Matches(textToQuery, pattern, RegexOptions.IgnoreCase).Count;
}

edited Jul 26 '13 at 23:12

answered Jul 26 '13 at 22:58

Sergey Berezovskiy

232,247
41
429
459

Which is faster? This might end up getting executed a few hundred times an hour. – Wesley Jul 26 '13 at 23:06
1

@Wesley I can only suggest to try both on your data and compare results. Btw regex solution have one issue - it matches text inside words, trying to solve it.. – Sergey Berezovskiy Jul 26 '13 at 23:06
@Wesley yep, I fixed regex. Please, tell me results of your tests. It's interesting to know which is faster and is difference big – Sergey Berezovskiy Jul 26 '13 at 23:19
Updates made with results. – Wesley Jul 26 '13 at 23:44
2

Regex is almost always the slower option for basic string comparisons. There is a ton of overhead to scan for matches. Compiling the Regex helps a bit, but not a whole lot. – Pete Garafano Jul 27 '13 at 03:56

score 3 · Answer 3 · answered Jul 26 '13 at 22:55

3

If you want to count the words in the string that are in the other collection:

private int stringMatches(string textToQuery, string[] stringsToFind)
{
    return textToQuery.Split().Intersect(stringsToFind).Count();
}

answered Jul 26 '13 at 22:55

Tim Schmelter

450,073
74
686
939

Not exactly correct if textToQuery.Split() contains duplicate words since Intersect() returns a set. – zs2020 Jul 26 '13 at 22:58
Will also miss `efficiency.` – Johan Larsson Jul 26 '13 at 22:59
Because of the punctuation I assume? – Wesley Jul 26 '13 at 23:00
@JohanLarsson: `Intersect` is very efficient since it uses a set. However, sza is right, duplicate words are not counted multiple times and there is another problem with the point at `efficiency.`. – Tim Schmelter Jul 26 '13 at 23:04

score 1 · Answer 4 · answered Jul 26 '13 at 23:02

1

I like Tim's answer, but I try to avoid making too many strings to avoid performance issues, and I do like regular expressions, so here's another way to go:

private int StringMatches(string searchMe, string[] keys)
{
    System.Text.RegularExpressions.Regex expression = new System.Text.RegularExpressions.Regex(string.Join("|", keys), System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    return expression.Matches(searchMe).Count;
}

answered Jul 26 '13 at 23:02

Glenn Cuevas

166
5

1

"to avoid performance issues, and I do like regular expressions" Regular expressions are very slow in this scenario compared to string operations. – Fabian Bigler Jul 26 '13 at 23:11
@FabianBigler I should have been more specific about what performance issue I was referring to, if the app gets a little wild creating strings, I've seen servers brought to their knees when the GC runs to clean up resources, but you very well may be right about the string approach being faster, I'm curious to see what Wesley sees in his performance testing. Also, that comment was in reference to textToQuery.Split(). – Glenn Cuevas Jul 26 '13 at 23:19

Chris · Answer 5 · 2013-07-29T17:07:05.523

0

This is a revision of Fabian Bigler's original answer. It is about a 33% speed improvement mostly because of StringComparison.Ordinal.

Here's a link for more info on this: http://msdn.microsoft.com/en-us/library/bb385972.aspx

    private int stringMatches(string textToQuery, List<string> stringsToFind)
    {
        int count = 0, stringCount = stringsToFind.Count(), currentIndex;
        string stringToFind;
        for (int i = 0; i < stringCount; i++)
        {
            currentIndex = 0;
            stringToFind = stringsToFind[i];
            while ((currentIndex = textToQuery.IndexOf(stringToFind, currentIndex, StringComparison.Ordinal)) != -1)
            {
                currentIndex++;
                count++;
            }
        }
        return count;
    }

edited Jul 29 '13 at 17:07

answered Jul 27 '13 at 00:23

Chris

2,766
1
29
34

It is faster, but I find only 3%-5% from a few tests. Though I will say, I had to change the method signature to IList because it wouldn't allow IReadOnlyList. – Wesley Jul 27 '13 at 05:17
Also, added StringComparison.Ordinal to Fabians answer, and it's about %30 faster now. Not sure why it would be faster than yours, but it is but a small margin. – Wesley Jul 27 '13 at 05:30
I can't see any improvement to my answer, and it's not 33% faster. You're generating a lot of string references here which have to be garbage collected. line: string stringToFind = stringsToFind[i]; – Fabian Bigler Jul 27 '13 at 06:02
I used a normal for instead of foreach because for is typically cheaper. I should have put the string on the outside of that loop however in this test case there are so few strings to find that it wouldn't have made much of a difference. As far as the IReadOnlyList, you can change that to a List, or an IList, or whatever you want. If I'm going for speed I tend to steer clear of foreach statements. http://stackoverflow.com/questions/365615/in-net-which-loop-runs-faster-for-or-foreach – Chris Jul 29 '13 at 16:58

score 0 · Answer 6 · answered Jul 27 '13 at 11:49

This will match only the words of your TextToQuery:

The idea of this is to check if the index before and after the match is not a letter. Also, I had to make sure to check if it's the start or end of the string.

  private int stringMatchesWordsOnly(string textToQuery, string[] wordsToFind)
        {
            int count = 0;
            foreach (var wordToFind in wordsToFind)
            {
                int currentIndex = 0;
                while ((currentIndex = textToQuery.IndexOf(wordToFind, currentIndex,         StringComparison.Ordinal)) != -1)
                {
                    if (((currentIndex == 0) || //is it the first index?
                          (!Char.IsLetter(textToQuery, currentIndex - 1))) &&
                          ((currentIndex == (currentIndex + wordToFind.Length)) || //has the end been reached?
                          (!Char.IsLetter(textToQuery, currentIndex + wordToFind.Length))))
                    {
                        count++;
                    }
                    currentIndex++;
                }
            }
            return count;
        }

Conclusion: As you can see this approach is a bit messier than my other answer and will be less performant (Still more performant than the other answers, though). So it really depends on what you want to achieve. If you have short words in your strings to find, you should probably take this answer, because e.g. an 'and' would obviously return too many matches with the first approach.

score 0 · Answer 7 · answered Jul 27 '13 at 18:08

private int stringMatches(string textToQuery, string[] stringsToFind)
{
      string[] splitArray = textToQuery.Split(new char[] { ' ', ',','.' });
      var count = splitArray.Where(p => stringsToFind.Contains(p)).ToArray().Count();
      return count;
}

Occurrences of a List in a string C#

7 Answers7