Store every string that start and end with a special words into an array in C

Question

I have a long string and I want to store every string that starts and ends with a special word into an array, and then remove duplicate strings. In my long string, there is no space, , or any other separation between words so that I cannot use strtok. The start marker is start and the end marker is end. This is the code I have so far (but it doesn't work because it is using strtok()).

char buf[] = "start-12-3.endstart-12-4.endstart-13-3.endstart-12-4.end";
char *array[5];
char *x;
int i = 0, j = 0;
array[i] = strtok(buf, "start");

while (array[i] != NULL) {
    array[++i] = strtok(NULL, "start");
}
//removeDuplicate(array[i]);
for (i = 0; i < 5; i++)
    for (j = 0; j < 5; j++)
        if (strcmp(array[i], array[j]) == 0)
            x[i++] = array[i];

printf("%s", x[i]);

Example input:

start-12-3.endstart-12-4.endstart-13-3.endstart-12-4.end

Output equivalent to:

char *array[]= { "start-12-3.end", "start-12-4.end", "start-13-3.end" };

The second start-12-4.end string has been eliminated in the output.

*I've also used strstr but has some issue:

int main(int argc, char **argv)
 {
char string[] = "This-one.testthis-two.testthis-three.testthis-two.test";
int counter = 0;

while (counter < 4)
{
    char *result1 = strstr(string, "this");
    int start = result1 - string;
    char *result = strstr(string, "test");
    int end = result - string;
    end += 4;
    printf("\n%s\n", result);
    memmove(result, result1, end += 4);
    counter++;
}
}

To put string into array and remove duplicate string, I've tried following code but it has issue:

int main(void)
{
char string[] = "this-one.testthis-two.testthis-three.testthis-two.test";
int counter = 0;
const char *b_token = "this";
const char *e_token = "test";
int e_len = strlen(e_token);
char *buffer = string;
char *b_mark;
char *e_mark;
char *a[50];
int i=0, j;
char *s;

while ((b_mark = strstr(buffer, b_token)) != 0 && (e_mark =strstr(b_mark, e_token)) != 0)
{
    int length = e_mark + e_len - b_mark;

    s = (char *) malloc(length);

    strncpy(s, b_mark, length);

    a[i]=s;
    i++;
    buffer = e_mark + e_len;
}
for (i=0; i<strlen(s); i++)
       printf ("%s",a[i]);
free(s);
/*  
//remove duplicate string

for (i=0; i<4; i++)
  for (j=0; j<4; j++)
  {

    if (a[i] == NULL || a[j] == NULL || i == j)
         continue;

    if (strcmp (a[i], a[j]) == 0)  {
         free(a[i]);
         a[i] = NULL; 
   }
   printf("%s\n", a[i]);
*/

return 0; 
}

But what is your *specific* question? We won't write all the code for you. Generally you need to show your code attempt, explain what it is supposed to do and what is wrong with it or what help you need with it. — kaylum, Mar 05 '16 at 21:22
It would also help if you showed an unambiguous *example* of input and required output. — Weather Vane, Mar 05 '16 at 21:29
`strtok(NULL, "start");` will break the string where *any one or more in any sequence* of the characters in `"start"` are found. It does not use the whole string as a delimiter. And yet, `"start"` is still present in your required output. Confused. — Weather Vane, Mar 05 '16 at 21:47
Use `strstr()` to locate occurrences of your start and end markers. Then use `memmove()` (or `memcpy()`) to copy parts of the strings around. Note that since your start and end markers are adjacent in the original string, you can't simply insert extra characters into it — which is also why you can't use `strtok()`. So, you'll have to make a copy of the original string. — Jonathan Leffler, Mar 05 '16 at 21:47
@Weather Vane I put my sting into a buffer. It's like this: char buf[]= start-12-3.endstart-12-4.endstart-13-3.endstart-12-4.end I will edit it. — Matrix, Mar 05 '16 at 21:49
@Jonathan Leffler ok tnx. but with strstr I can set only the word that I want to be start of string, what about end of string? — Matrix, Mar 05 '16 at 22:02
Use `strstr()` again, starting from where the start portion ends, to find the next end marker. Then, knowing the start of the whole section, and the start of the end and the length of the end, you can arrange to copy precisely the correct number of characters into the new string, and then null terminate if that's appropriate, or comma terminate. Something like: `if ((start = strstr(source, "start")) != 0 && ((end = strstr(start, "end")) != 0)` then the data is between `start` and `end + 2` (inclusive) in your source string. Repeat starting from the character after the end of 'end'. — Jonathan Leffler, Mar 05 '16 at 22:06
@Jonathan Hi, I've tried following code but it doesn't work fine; would u please tell me what's wrong with it? int main(int argc, char** argv) {char string[]="This-one.testthis-two.testthis-three.testthis-two.test"; int counter=0; while(counter<4){ char *result1 = strstr(string, "This"); int start = result1 - string; char *result = strstr(string, "test"); int end = result - string; end+=4; printf("\n%s\n",result); memmove (result,result1,end+=4); counter++;} } — Matrix, Mar 06 '16 at 14:35
The main problem appears to be searching for `This` with a capital T but the string only contains a single capital T. You should also look at [Is there a way to specify how many characters of a string to print out using `printf()`?](https://stackoverflow.com/questions/2239519/is-there-a-way-to-specify-how-many-characters-of-a-string-to-print-out-using-pri/2239571#2239571). — Jonathan Leffler, Mar 06 '16 at 14:41
@Jonathan Leffler number of characters would be different, I can not specify it. — Matrix, Mar 06 '16 at 14:53
Note that it would probably have been best to add the new code that's in your comment into the question. Amongst other reasons, it can be formatted for legibility, which it certainly can't in a comment. I've included it in my answer, so it is available in a legible form. — Jonathan Leffler, Mar 06 '16 at 15:39

score 1 · Answer 1 · answered Mar 06 '16 at 13:27

Works with provided example of yours and tested in Valgrind for mem leaks, but might require further testing.

#include <malloc.h>
#include <stdio.h>
#include <string.h>

unsigned tokens_find_amount( char const* const string, char const* const delim )
{
    unsigned counter = 0;
    char const* pos = string;
    while( pos != NULL )
    {
        if( ( pos = strstr( pos, delim ) ) != NULL )
        {
            pos++;
            counter++;
        }
    }

    return counter;
}

void tokens_remove_duplicate( char** const tokens, unsigned tokens_num )
{
    for( unsigned i = 0; i < tokens_num; i++ )
    {
        for( unsigned j = 0; j < tokens_num; j++ )
        {
            if( tokens[i] == NULL || tokens[j] == NULL || i == j )
                continue;

            if( strcmp( tokens[i], tokens[j] ) == 0 )
            {
                free( tokens[i] );
                tokens[i] = NULL;
            }
        }
    }
}

void tokens_split( char const* const string, char const* const delim, char** tokens )
{
    unsigned counter = 0;
    char const* pos, *lastpos;
    lastpos = string;
    pos = string + 1;

    while( pos != NULL )
    {
        if( ( pos = strstr( pos, delim ) ) != NULL )
        {
            *(tokens++) = strndup( lastpos, (unsigned long )( pos - lastpos ));
            lastpos = pos;
            pos++;
            counter++;
            continue;
        }

        *(tokens++) = strdup( lastpos );
    }
}

void tokens_free( char** tokens, unsigned tokens_number )
{
    for( unsigned i = 0; i < tokens_number; ++i )
    {
        free( tokens[ i ] );
    }
}

void tokens_print( char** tokens, unsigned tokens_number )
{
    for( unsigned i = 0; i < tokens_number; ++i )
    {
        if( tokens[i] == NULL )
            continue;
        printf( "%s ", tokens[i] );
    }
}

int main(void)
{
    char const* buf = "start-12-3.endstart-12-4.endstart-13-3.endstart-12-4.end";
    char const* const delim = "start";

    unsigned tokens_number = tokens_find_amount( buf, delim );
    char** tokens = malloc( tokens_number * sizeof( char* ) );
    tokens_split( buf, delim, tokens );

    tokens_remove_duplicate( tokens, tokens_number );
    tokens_print( tokens, tokens_number );

    tokens_free( tokens, tokens_number );
    free( tokens );

    return 0;
}

Nice work. What follows are nit picks: minor issues. You don't check memory allocations. Your duplicate elimination doesn't report the number of unique values. You should probably compact the array so there aren't holes in it (for your print routine to skip). Your code doesn't work with the separate start and end markers; it relies on them being contiguous, and the start marker not appearing in the text between start and end marker. (That is, your code incorrectly splits `"start-startendstart-start2end"` because it only looks for start and not for end too.) It's OK on the sample data. — Jonathan Leffler, Mar 07 '16 at 15:08
Hmmm: interesting — POSIX _does_ specify [`strnlen()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/strnlen.html) and [`strndup()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/strndup.html); they're going to be more widely available than I realized. I hadn't noticed them in the man pages before. — Jonathan Leffler, Mar 07 '16 at 15:12
Incidentally, you're making good use of functions, too — that's important. You do not want everything in a single main program; it gets unwieldy quite quickly. — Jonathan Leffler, Mar 07 '16 at 15:21

Jonathan Leffler · Accepted Answer · 2017-05-24T19:24:38.470

Basic splitting — identifying the strings

In a comment, I suggested:

Use strstr() to locate occurrences of your start and end markers. Then use memmove() (or memcpy()) to copy parts of the strings around. Note that since your start and end markers are adjacent in the original string, you can't simply insert extra characters into it — which is also why you can't use strtok(). So, you'll have to make a copy of the original string.

Another problem with strtok() is that it looks for any one of the delimiter characters — it does not look for the characters in sequence. But strtok() modifies its input string, zapping the delimiter it finds, which is clearly not what you need. Generally, IMO, strtok() is only a source of headaches and seldom an answer to a problem. If you must use something like strtok(), use POSIX strtok_r() or Microsoft's strtok_s(). Microsoft's function is essentially the same as strtok_r() except for the spelling of the function name. (The Standard C Annex K version of strtok_s() is different from both POSIX and Microsoft — see Do you use the TR 24731 'safe' functions?)

In another comment, I noted:

Use strstr() again, starting from where the start portion ends, to find the next end marker. Then, knowing the start of the whole section, and the start of the end and the length of the end, you can arrange to copy precisely the correct number of characters into the new string, and then null terminate if that's appropriate, or comma terminate. Something like:
if ((start = strstr(source, "start")) != 0 && ((end = strstr(start, "end")) != 0)
then the data is between start and end + 2 (inclusive) in your source string. Repeat starting from the character after the end of 'end'.

You then said:

I've tried following code but it doesn't work fine; would u please tell me what's wrong with it?

#include <stdio.h>
#include <string.h>


int main(int argc, char **argv)
{
    char string[] = "This-one.testthis-two.testthis-three.testthis-two.test";
    int counter = 0;
    while (counter < 4)
    {
        char *result1 = strstr(string, "This");
        int start = result1 - string;
        char *result = strstr(string, "test");
        int end = result - string;
        end += 4;
        printf("\n%s\n", result);
        memmove(result, result1, end += 4);
        counter++;
    }
}

I observed:

The main problem appears to be searching for This with a capital T but the string only contains a single capital T. You should also look at Is there a way to specify how many characters of a string to print out using printf()?

Even assuming you fix the This vs this glitch, there are other issues.

You print the entire string.
You don't change the starting point for the search.
Your moving code adds 4 to end a second time.
You don't use start.
The code should print from result1, not result.

With those fixed, the code runs but produces:

testthis-two.testthis-three.testthis-two.test

testtestthis-three.testthis-two.test

testtthis-two.test

test?

and a core dump (segmentation fault).

Code identifying the strings

This is what I created, based on a mix of your code and my commentary:

#include <stdio.h>
#include <string.h>

int main(void)
{
    char string[] = "this-one.testthis-two.testthis-three.testthis-two.test";
    int counter = 0;
    const char *b_token = "this";
    const char *e_token = "test";
    int e_len = strlen(e_token);
    char *buffer = string;
    char *b_mark;
    char *e_mark;
    while ((b_mark = strstr(buffer, b_token)) != 0 &&
           (e_mark = strstr(b_mark, e_token)) != 0)
    {
        int length = e_mark + e_len - b_mark;
        printf("%d: %.*s\n", ++counter, length, b_mark);
        buffer = e_mark + e_len;
    }
    return 0;
}

Clearly, this code does no moving of data, but being able to isolate the data to be moved is a key first step to completing that part of the exercise. Extending it to make copies of the strings so that they can be compared is fairly easy. If it is available to you, the strndup() function will be useful:

char *strndup(const char *s1, size_t n);
The strndup() function copies at most n characters from the string s1 always NUL terminating the copied string.

If you don't have it available, it is pretty straight-forward to implement, though it is more straight-forward if you have strnlen() available:

size_t strnlen(const char *s, size_t maxlen);
The strnlen() function attempts to compute the length of s, but never scans beyond the first maxlen bytes of s.

Neither of these is a standard C library function, but they're defined as part of POSIX (strnlen() and strndup()) and are available on BSD and Mac OS X; Linux has them, and probably other versions of Unix do too. The specifications shown are quotes from the Mac OS X man pages.

Example output:

I called the program stst (for start-stop).

$ ./stst
1: this-one.test
2: this-two.test
3: this-three.test
4: this-two.test
$

There are multiple features to observe:

Since main() ignores its arguments, I removed the arguments (my default compiler options won't allow unused arguments).
I case-corrected the string.
I set up constant strings b_token and e_token for the beginning and end markers. The names are symmetric deliberately. This could readily be transplanted into a function where the tokens are arguments to the function, for example.
Similarly I created the b_mark and e_mark variables for the positions of the begin and end markers.
The name buffer is a pointer to where to start searching.
The loop uses the test I outlined in the comments, adapted to the chosen names.
The printing code determines how long the found string is and prints only that data. It prints the counter value.
The reinitialization code skips all the previously printed material.

Command line options for generality

You could generalize the code a bit by accepting command line arguments and processing each of those in turn if any are provided; you'd use the string you provide as a default when no string is provided. A next level beyond that would allow you to specify something like:

./stst -b beg -e end 'kalamazoo-beg-waffles-end-tripe-beg-for-mercy-end-of-the-road'

and you'd get output such as:

1: beg-waffles-end
2: beg-for-mercy-end

Here's code that implements that, using the POSIX getopt().

#include <stdio.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char **argv)
{
    char string[] = "this-one.testthis-two.testthis-three.testthis-two.test";
    const char *b_token = "this";
    const char *e_token = "test";
    int opt;
    int b_len;
    int e_len;

    while ((opt = getopt(argc, argv, "b:e:")) != -1)
    {
        switch (opt)
        {
        case 'b':
            b_token = optarg;
            break;
        case 'e':
            e_token = optarg;
            break;
        default:
            fprintf(stderr, "Usage: %s [-b begin][-e end] ['beginning-to-end...' ...]\n", argv[0]);
            return 1;
        }
    }

    /* Use string if no argument supplied */
    if (optind == argc)
    {
        argv[argc-1] = string;
        optind = argc - 1;
    }

    b_len = strlen(b_token);
    e_len = strlen(e_token);

    printf("Begin: (%d) [%s]\n", b_len, b_token);
    printf("End:   (%d) [%s]\n", e_len, e_token);

    for (int i = optind; i < argc; i++)
    {
        char *buffer = argv[i];
        int counter = 0;
        char *b_mark;
        char *e_mark;
        printf("Analyzing: [%s]\n", buffer);
        while ((b_mark = strstr(buffer, b_token)) != 0 &&
               (e_mark = strstr(b_mark + b_len, e_token)) != 0)
        {
            int length = e_mark + e_len - b_mark;
            printf("%d: %.*s\n", ++counter, length, b_mark);
            buffer = e_mark + e_len;
        }
    }
    return 0;
}

Note how this program documents what it is doing, printing out the control information. That can be very important during debugging — it helps ensure that the program is working on the data you expect it to be working on. The searching is better too; it works correctly with the same string as the start and end marker (or where the end marker is a part of the start marker), which the previous version did not (because this version uses b_len, the length of b_token, in the second strstr() call). Both versions are quite happy with adjacent end and start tokens, but they're equally happy to skip material between an end token and the next start token.

Example runs:

$ ./stst -b beg -e end 'kalamazoo-beg-waffles-end-tripe-beg-for-mercy-end-of-the-road'
Begin: (3) [beg]
End:   (3) [end]
Analyzing: [kalamazoo-beg-waffles-end-tripe-beg-for-mercy-end-of-the-road]
1: beg-waffles-end
2: beg-for-mercy-end
$ ./stst -b th -e th
Begin: (2) [th]
End:   (2) [th]
Analyzing: [this-one.testthis-two.testthis-three.testthis-two.test]
1: this-one.testth
2: this-th
$ ./stst -b th -e te
Begin: (2) [th]
End:   (2) [te]
Analyzing: [this-one.testthis-two.testthis-three.testthis-two.test]
1: this-one.te
2: this-two.te
3: this-three.te
4: this-two.te
$

After update to question

You have to account for the trailing null byte by allocating enough space for length + 1 bytes. Using strncpy() is fine but in this context guarantees that the string is not null terminated; you must null terminate it.

Your duplicate elimination code, commented out, was not particularly good — too many null checks when none should be necessary. I've created a print function; the tag argument allows it to identify which set of data it is printing. I should have put the 'free' loop into a function. The duplicate elimination code could (should) be in a function; the string extraction code could (should) be in a function — as in the answer by pikkewyn. I extended the test data (string concatenation is wonderful in contexts like this).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static void dump_strings(const char *tag, char **strings, int num_str)
{
    printf("%s (%d):\n", tag, num_str);
    for (int i = 0; i < num_str; i++)
        printf("%d: %s\n", i, strings[i]);
    putchar('\n');
}

int main(void)
{
    char string[] =
        "this-one.testthis-two.testthis-three.testthis-two.testthis-one.test"
        "this-1-testthis-1-testthis-2-testthis-1-test"
        "this-1-testthis-1-testthis-1-testthis-1-test"
        ;
    const char *b_token = "this";
    const char *e_token = "test";
    int b_len = strlen(b_token);
    int e_len = strlen(e_token);
    char *buffer = string;
    char *b_mark;
    char *e_mark;
    char *a[50];
    int num_str = 0;

    while ((b_mark = strstr(buffer, b_token)) != 0 && (e_mark = strstr(b_mark + b_len, e_token)) != 0)
    {
        int length = e_mark + e_len - b_mark;
        char *s = (char *) malloc(length + 1);    // Allow for null
        strncpy(s, b_mark, length);
        s[length] = '\0';               // Null terminate the string
        a[num_str++] = s;
        buffer = e_mark + e_len;
    }

    dump_strings("After splitting", a, num_str);

    //remove duplicate strings
    for (int i = 0; i < num_str; i++)
    {
        for (int j = i + 1; j < num_str; j++)
        {
            if (strcmp(a[i], a[j]) == 0)
            {
                free(a[j]);             // Free the higher-indexed duplicate
                a[j] = a[--num_str];    // Move the last element here
                j--;                    // Examine the new string next time
            }
        }
    }

    dump_strings("After duplicate elimination", a, num_str);

    for (int i = 0; i < num_str; i++)
        free(a[i]);

    return 0;
}

Testing with valgrind gives this a clean bill of health: no memory faults, no leaked data.

Sample output:

After splitting (13):
0: this-one.test
1: this-two.test
2: this-three.test
3: this-two.test
4: this-one.test
5: this-1-test
6: this-1-test
7: this-2-test
8: this-1-test
9: this-1-test
10: this-1-test
11: this-1-test
12: this-1-test

After duplicate elimination (5):
0: this-one.test
1: this-two.test
2: this-three.test
3: this-1-test
4: this-2-test

Hi, I can not put them into an array based on the first code, right? to do so it needs to save output into a variable, then put it into array? — Matrix, Mar 07 '16 at 13:32
The code above allows you to identify the data you need to put into an array. You know how big the string must be; that's the calculated length. It doesn't include a null byte. You allocate enough space for the new string (remember the null byte), and then use a function to copy the data to that space and add the all-important null byte. You add the string to your array of pointers. You can decide whether you sort the data as you go (a heap sort might be appropriate) and at some point you deal with duplicates too. Or you can store in the array and sort at the end. — Jonathan Leffler, Mar 07 '16 at 14:51
sorry what do you mean of allocate enough space for the new string (remember the null byte)? for copying I can use strcpy function, if I'm right. — Matrix, Mar 07 '16 at 15:35
You need to show your best code. Do you know how to use `malloc()` and `free()`? If not, we've got some work to do — but that's why it is crucial to show your code; it tells us what you know about. You can't really use `strcpy()`; it stops at a null byte, but the strings you need to copy don't have a null byte at the point where you want copying to stop. (You should not try adding one either; that is really icky, and won't work on string literals but your code should be able to do that.) You might use `strncpy()`, or `memmove()` or `memcpy()` as I mentioned at the start. — Jonathan Leffler, Mar 07 '16 at 15:39
Strings are a sequence of characters terminated by a null byte. To be able to use functions like `strcmp()` — and `strcpy()` — and to use simple `printf()` formats like `%s` instead of needing `%.*s`, you must ensure that your strings are null terminated. — Jonathan Leffler, Mar 07 '16 at 15:40

Store every string that start and end with a special words into an array in C

2 Answers2

Basic splitting — identifying the strings

Code identifying the strings

Command line options for generality

After update to question