C find all occurrences of substring

Question

Is there an efficient way to find all occurrences (including overlapping) of non const char *str2 in char *str1 and outputting the numbers position of matches in str1 in C (not in C++ as it differs)?

Do you need overlapping or non-overlapping occurrences? Are you aware of `strstr()`? What have you tried? — Jonathan Leffler, Nov 20 '12 at 21:34
I need including overlapping and I'm not aware of strstr() ;) — NGix, Nov 20 '12 at 21:38
How long is the string we're talking about? If it's really long, you might consider some pre-processing for fast operation afterwards. Same goes if the string changes infrequently and is queried often. — cxxl, Nov 20 '12 at 21:38
Here's a reference for [`strstr()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/strstr.html) — yes, that's POSIX, but for this function, it matches the C standard exactly. You need to learn which functions are in the standard C library and how to use them. Then questions like this become easy. — Jonathan Leffler, Nov 20 '12 at 21:41
I saw this reference, but it finds only 1 occurrence and I'm not sure if it's faster in some algorithm then for example this method [link](http://en.wikipedia.org/wiki/Suffix_array) — NGix, Nov 20 '12 at 21:44
Try it and see. It depends on what you're doing, and how often you're doing it, but I'd lay odds that for many purposes with a single scan through the data, the cost penalty of building a Suffix Array outweighs any performance benefit in the (one time) search phase. If you're going to be repeatedly analyzing the data, possibly with different strings, then pre-processing may provide a big win. But for most naïve applications, they're unlikely to help. — Jonathan Leffler, Nov 20 '12 at 21:49
If `str2` is constant and you want to find it in several `str1`, a string search algorithm like Knuth-Morris-Pratt or Boyer-Moore can be a win over `strstr`, also for pathological cases if you're only scanning one `str1`. — Daniel Fischer, Nov 20 '12 at 21:52

score 1 · Answer 1 · answered Nov 20 '12 at 21:45

1

Your function will use strstr() in a while loop to find the first match of str2 in str1. You can then print the offset of that match. You'll continue the search at the first character after the match. You'll stop the loop when strstr() no longer finds a match (signalled by strstr() returning NULL).

If you needed non-overlapping, you'd want to know the length of str2 and you'd start the next search at the matched character plus the length of str2.

answered Nov 20 '12 at 21:45

Jonathan Leffler

730,956
141
904
1,278

That's okay, but I'm looking for more efficient algorithm e.g. suffix array, because `str1` can be really long and program execution time is limited – NGix Nov 20 '12 at 21:57
Try `strstr()`; it can be hard to beat. At least on Mac OS X, I tried BM and KMP in C and didn't get close to `strstr()` in performance. I was both surprised and disappointed. Are you doing 'bioinformatics' searching on ACGTAGGTCA type strings? – Jonathan Leffler Nov 20 '12 at 21:59

Vyktor · Answer 2 · 2012-11-20T22:03:03.693

0

Using strstr() inside a loop:

int get_substr_count(const char * haystack, const char *needle)
{
    int count = 0;
    const char *tmp = haystack;
    while( tmp = strstr( tmp, needle)){
        printf( "Position: %d\n", (int)(tmp-haystack));
        ++count;
    }
    return count;
}

edited Nov 20 '12 at 22:03

answered Nov 20 '12 at 21:45

Vyktor

20,559
6
64
96

That counts the matches; it's not the same as printing the positions of the matches. – Jonathan Leffler Nov 20 '12 at 21:50
2

you should increment tmp inside while loop, else this code goes to infinite loop because it keeps searching from same index. – Jignesh Oct 20 '14 at 12:55
1

You should increment tmp with strlen(needle), because after the first occurrence of needle, strstr will return the tmp reference and it goes to infinite loop. – Andrei Minca Mar 02 '20 at 22:50

C find all occurrences of substring

2 Answers2

Linked