2

I have to count string occurences in one very loooong string (about 30mb in plain text) I use the following code for now: int count = new Regex(Regex.Escape(stringThatIamLookingFor)).Matches(stringToSearchIn).Count; but is is too slow. It takes about 3 minutes on i7 and 16gb ram. The example data is:

43.996442,-31.768039
43.996432,-31.768039
43.996432,-31.768049
43.996422,-31.768049
43.996422,-31.768059

I want to count (for example) .7 Is there faster way than regeex?

ok, solved

The fastest function so far is: (I need to check only two chars.)

public int countOccurences2(string source, char charOne, char charTwo)
    {
        int count = 0;
        for (int i=0; i<=source.Length-1; i=i+2)
            if (source[i] == charOne && source[i + 1] == charTwo) { count++; }
        return count;
    }
Community
  • 1
  • 1
pythoner
  • 79
  • 1
  • 7
  • 1
    How long does it take to convert the string into a byte array and just looking for the correct two numbers with a simple for loop? Sometimes the low tech method isn't the worst. – Jens Mar 13 '15 at 17:08
  • 1
    [Knuth–Morris–Pratt algorithm](http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm) – Martin Liversage Mar 13 '15 at 17:11
  • 1
    possible duplicate of [How would you count occurrences of a string within a string?](http://stackoverflow.com/questions/541954/how-would-you-count-occurrences-of-a-string-within-a-string) – Jens Mar 13 '15 at 17:16
  • please let us know how well perform each solution. For science! :D – Sim1 Mar 13 '15 at 17:16
  • HOLY CRAP, Simone Riboldi's answer is about 10 times faster than regex. – pythoner Mar 13 '15 at 17:18
  • original solution is of Richard Watson, see the question in my answer ;) You should say thank to him and not me ;) – Sim1 Mar 13 '15 at 17:20
  • @Jens: The last time I checked that (__terrible! terrible!__) post, RegEx won for truely large needles and haystacks. – TaW Mar 13 '15 at 17:20
  • @pythoner does the solution provide the same results too? – Sim1 Mar 13 '15 at 17:24
  • Yes, but still too slow in really huge strings:c – pythoner Mar 13 '15 at 17:25
  • try with int count = source.Split(char[] source).Length - 1; – Sim1 Mar 13 '15 at 17:28
  • Since you are doing test anyway, the method posted in the original question using replace was amazingly fast, so you may want to give it a spin as well.. – TaW Mar 13 '15 at 17:28

1 Answers1

3

from this question: How would you count occurrences of a string within a string?

the following code seems the most performing:

int count = 0, n = 0;

if(substring != "")
{
    while ((n = source.IndexOf(substring, n, StringComparison.InvariantCulture)) != -1)
    {
        n += substring.Length;
        ++count;
    }
}

solution provided by Richard Watson in the mentioned question

Community
  • 1
  • 1
Sim1
  • 534
  • 4
  • 24
  • Note that the accepted answer and actually most answers are plain wrong as they count characters; also note that any Linq solution breaks down for large strings! This is the worst example of vote gauging going wrong I have ever seen on SO – TaW Mar 13 '15 at 17:22
  • but this one seems a good solution accordingly to OP too (or at least I hope so!) – Sim1 Mar 13 '15 at 17:23
  • 1
    Yes, probably because the needle is so short. I did tests last year and when both needle and haystack grow regex finally wins. – TaW Mar 13 '15 at 17:25
  • 1
    @TaW good to know, very good to know! We always need this kind of function and it is really usefull to know the weakness of each approach. – Sim1 Mar 13 '15 at 17:26
  • 1
    I might also add that due to Unicode considerations, any comparison based on simple byte comparisons will give the incorrect answer on some data, so using a method like this that respects Unicode is a definite plus. – Gary Walker Mar 13 '15 at 17:28
  • 1
    I think this answer is (arguably) wrong due to the "n += substring.Length" line - if you search for "abca" in the string "abcabca" then it will only return 1 when (arguably) it should return 2. – Jared Moore Mar 13 '15 at 18:08