1

I looked on the website, but there is no direct answer to the following issue.

What is the most efficient way to find the nth occurrence of a substring in a string in C++?

The example here shows how to find the second occurrence: http://www.cplusplus.com/reference/string/string/find/

But it seems really inefficient to first find the first match, then use that location to search for the following match etc. to find the nth match. If you want the position of the 25th match, is there a faster way?

EDIT: In the greater context, I am reading a file line by line, every response to an item has a score, and some are missing, getting an NA string. All items are separated by spaces.

I want to have the option to exclude certain items, so only search from, say, item 35 till 80, 90 to 120, and 150-200. So what I do currently is this:

string blockedLine(string line)
{
  int b_start[] = {35, 90, 150};
  int b_end[] = {80, 120, 200};
  std::vector<int> space_matches = KMP(line, " ");
  string cuttedLine = "";
  for (int i = 0; i < 3; i++)
    {
      cuttedLine.append(line.substr(space_matches[b_start[i]],
                                    space_matches[b_end[i]]));
    }
  return(cuttedLine);
}

Where KMP is the function as mentioned in one of the comments, which gets me the positions of the space occurrences, and stores them in space_matches.

I then count the occurences of NA in this appended string. The thing is that without this appending, just reading the whole line only takes 1 second on roughly 200k lines. When I use this appending method to get substrings, it takes 14 seconds which is too slow.

What can be improvements to speed this up?

PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
  • I would google for C++ String Tokenizer, that should produce a list or a vector. – flaschenpost Jun 10 '13 at 11:57
  • @ForEveR Isn't `rfind` to find the last occurrence? – PascalVKooten Jun 10 '13 at 11:57
  • There are some clever tricks that can be used for very large datasets, but for most smaller datasets, you really are best off just looping through form either end. (Obviously also depends on what you mean by fastest, and if you are searching for the same thing many times or just once) – Mats Petersson Jun 10 '13 at 11:59
  • @MatsPetersson I am interested in the case of very large datasets – PascalVKooten Jun 10 '13 at 12:00
  • 2
    Have a look for at this: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm – Mats Petersson Jun 10 '13 at 12:05
  • 1
    I second @MatsPetersson - sounds a lot like a job for KMP ( http://en.wikibooks.org/wiki/Algorithm_Implementation/String_searching/Knuth-Morris-Pratt_pattern_matcher ) – Najzero Jun 10 '13 at 12:17
  • Assuming of course that you are searching for the nth occurrence of `foobar` in a very long string. Searching for a single or two character string can probably not be improved very much. Searching for a long string definitely can. – Mats Petersson Jun 10 '13 at 12:25

1 Answers1

2
/// Returns the position of the 'Nth' occurrence of 'find' within 'str'.
/// Note that 1==find_Nth( "aaa", 2, "aa" ) [2nd "aa"]
/// - http://stackoverflow.com/questions/17023302/
size_t find_Nth(
    const std::string & str ,   // where to work
    unsigned            N ,     // N'th ocurrence
    const std::string & find    // what to 'find'
) {
    if ( 0==N ) { return std::string::npos; }
    size_t pos,from=0;
    unsigned i=0;
    while ( i<N ) {
        pos=str.find(find,from);
        if ( std::string::npos == pos ) { break; }
        from = pos + 1; // from = pos + find.size();
        ++i;
    }
    return pos;
/**
    It would be more efficient to use a variation of KMP to
    benefit from the failure function.
    - Algorithm inspired by James Kanze.
    - http://stackoverflow.com/questions/20406744/
*/
}

int main() {
    {
        //                       0123456789.123456789.123
        assert(  3 == find_Nth( "My gorila ate the banana", 1 , "gorila") );
        assert( 18 == find_Nth( "My gorila ate the banana", 1 , "banana") );

        //                       0123456789.123456789.123
        assert(  3 == find_Nth( "My banana ate the banana", 1 , "banana") );
        assert( 18 == find_Nth( "My banana ate the banana", 2 , "banana") );

        assert( std::string::npos == find_Nth( "My banana ate the banana", 3 , "banana") );
        assert( std::string::npos == find_Nth( "My banana ate the banana", 3 , "gorila") );
    }
        assert( 1==find_Nth( "aaa", 2, "aa" ) );
        assert( 0==find_Nth( "aaa", 1, "aa" ) );
    {
        std::string str;
        //     01234567
        str = "aaaaaaaa"; assert( 8==str.size() );
        assert( find_Nth( str, 0 , "aa") == std::string::npos );
        assert( find_Nth( str, 1 , "aa") == 0 );
        assert( find_Nth( str, 2 , "aa") == 1 );
        assert( find_Nth( str, 3 , "aa") == 2 );
        assert( find_Nth( str, 4 , "aa") == 3 );
        assert( find_Nth( str, 5 , "aa") == 4 );
        assert( find_Nth( str, 6 , "aa") == 5 );
        assert( find_Nth( str, 7 , "aa") == 6 );
        assert( find_Nth( str, 8 , "aa") == std::string::npos  );
        assert( find_Nth( str, 9 , "aa") == std::string::npos  );
    }
}
Adolfo
  • 281
  • 2
  • 5