Basic splitting — identifying the strings
In a comment, I suggested:
Use strstr()
to locate occurrences of your start and end markers. Then use memmove()
(or memcpy()
) to copy parts of the strings around. Note that since your start and end markers are adjacent in the original string, you can't simply insert extra characters into it — which is also why you can't use strtok()
. So, you'll have to make a copy of the original string.
Another problem with strtok()
is that it looks for any one of the delimiter characters — it does not look for the characters in sequence. But strtok()
modifies its input string, zapping the delimiter it finds, which is clearly not what you need. Generally, IMO, strtok()
is only a source of headaches and seldom an answer to a problem. If you must use something like strtok()
, use POSIX strtok_r()
or Microsoft's strtok_s()
. Microsoft's function is essentially the same as strtok_r()
except for the spelling of the function name. (The Standard C Annex K version of strtok_s()
is different from both POSIX and Microsoft — see Do you use the TR 24731 'safe' functions?)
In another comment, I noted:
Use strstr()
again, starting from where the start portion ends, to find the next end marker. Then, knowing the start of the whole section, and the start of the end and the length of the end, you can arrange to copy precisely the correct number of characters into the new string, and then null terminate if that's appropriate, or comma terminate. Something like:
if ((start = strstr(source, "start")) != 0 && ((end = strstr(start, "end")) != 0)
then the data is between start
and end + 2
(inclusive) in your source string. Repeat starting from the character after the end of 'end'.
You then said:
I've tried following code but it doesn't work fine; would u please tell me what's wrong with it?
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv)
{
char string[] = "This-one.testthis-two.testthis-three.testthis-two.test";
int counter = 0;
while (counter < 4)
{
char *result1 = strstr(string, "This");
int start = result1 - string;
char *result = strstr(string, "test");
int end = result - string;
end += 4;
printf("\n%s\n", result);
memmove(result, result1, end += 4);
counter++;
}
}
I observed:
The main problem appears to be searching for This
with a capital T but the string only contains a single capital T. You should also look at Is there a way to specify how many characters of a string to print out using printf()
?
Even assuming you fix the This
vs this
glitch, there are other issues.
- You print the entire string.
- You don't change the starting point for the search.
- Your moving code adds 4 to
end
a second time.
- You don't use
start
.
- The code should print from
result1
, not result
.
With those fixed, the code runs but produces:
testthis-two.testthis-three.testthis-two.test
testtestthis-three.testthis-two.test
testtthis-two.test
test?
and a core dump (segmentation fault).
Code identifying the strings
This is what I created, based on a mix of your code and my commentary:
#include <stdio.h>
#include <string.h>
int main(void)
{
char string[] = "this-one.testthis-two.testthis-three.testthis-two.test";
int counter = 0;
const char *b_token = "this";
const char *e_token = "test";
int e_len = strlen(e_token);
char *buffer = string;
char *b_mark;
char *e_mark;
while ((b_mark = strstr(buffer, b_token)) != 0 &&
(e_mark = strstr(b_mark, e_token)) != 0)
{
int length = e_mark + e_len - b_mark;
printf("%d: %.*s\n", ++counter, length, b_mark);
buffer = e_mark + e_len;
}
return 0;
}
Clearly, this code does no moving of data, but being able to isolate the data to be moved is a key first step to completing that part of the exercise. Extending it to make copies of the strings so that they can be compared is fairly easy. If it is available to you, the strndup()
function will be useful:
char *strndup(const char *s1, size_t n);
The strndup()
function copies at most n
characters from the string s1
always NUL terminating the copied string.
If you don't have it available, it is pretty straight-forward to implement, though it is more straight-forward if you have strnlen()
available:
size_t strnlen(const char *s, size_t maxlen);
The strnlen()
function attempts to compute
the length of s
, but never scans beyond the first maxlen
bytes of s
.
Neither of these is a standard C library function, but they're defined as part of POSIX (strnlen()
and strndup()
) and are available on BSD and Mac OS X; Linux has them, and probably other versions of Unix do too. The specifications shown are quotes from the Mac OS X man pages.
Example output:
I called the program stst
(for start-stop).
$ ./stst
1: this-one.test
2: this-two.test
3: this-three.test
4: this-two.test
$
There are multiple features to observe:
- Since
main()
ignores its arguments, I removed the arguments (my default compiler options won't allow unused arguments).
- I case-corrected the string.
- I set up constant strings
b_token
and e_token
for the beginning and end markers. The names are symmetric deliberately. This could readily be transplanted into a function where the tokens are arguments to the function, for example.
- Similarly I created the
b_mark
and e_mark
variables for the positions of the begin and end markers.
- The name
buffer
is a pointer to where to start searching.
- The loop uses the test I outlined in the comments, adapted to the chosen names.
- The printing code determines how long the found string is and prints only that data. It prints the counter value.
- The reinitialization code skips all the previously printed material.
Command line options for generality
You could generalize the code a bit by accepting command line arguments and processing each of those in turn if any are provided; you'd use the string you provide as a default when no string is provided. A next level beyond that would allow you to specify something like:
./stst -b beg -e end 'kalamazoo-beg-waffles-end-tripe-beg-for-mercy-end-of-the-road'
and you'd get output such as:
1: beg-waffles-end
2: beg-for-mercy-end
Here's code that implements that, using the POSIX getopt()
.
#include <stdio.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char **argv)
{
char string[] = "this-one.testthis-two.testthis-three.testthis-two.test";
const char *b_token = "this";
const char *e_token = "test";
int opt;
int b_len;
int e_len;
while ((opt = getopt(argc, argv, "b:e:")) != -1)
{
switch (opt)
{
case 'b':
b_token = optarg;
break;
case 'e':
e_token = optarg;
break;
default:
fprintf(stderr, "Usage: %s [-b begin][-e end] ['beginning-to-end...' ...]\n", argv[0]);
return 1;
}
}
/* Use string if no argument supplied */
if (optind == argc)
{
argv[argc-1] = string;
optind = argc - 1;
}
b_len = strlen(b_token);
e_len = strlen(e_token);
printf("Begin: (%d) [%s]\n", b_len, b_token);
printf("End: (%d) [%s]\n", e_len, e_token);
for (int i = optind; i < argc; i++)
{
char *buffer = argv[i];
int counter = 0;
char *b_mark;
char *e_mark;
printf("Analyzing: [%s]\n", buffer);
while ((b_mark = strstr(buffer, b_token)) != 0 &&
(e_mark = strstr(b_mark + b_len, e_token)) != 0)
{
int length = e_mark + e_len - b_mark;
printf("%d: %.*s\n", ++counter, length, b_mark);
buffer = e_mark + e_len;
}
}
return 0;
}
Note how this program documents what it is doing, printing out the control information. That can be very important during debugging — it helps ensure that the program is working on the data you expect it to be working on. The searching is better too; it works correctly with the same string as the start and end marker (or where the end marker is a part of the start marker), which the previous version did not (because this version uses b_len
, the length of b_token
, in the second strstr()
call). Both versions are quite happy with adjacent end and start tokens, but they're equally happy to skip material between an end token and the next start token.
Example runs:
$ ./stst -b beg -e end 'kalamazoo-beg-waffles-end-tripe-beg-for-mercy-end-of-the-road'
Begin: (3) [beg]
End: (3) [end]
Analyzing: [kalamazoo-beg-waffles-end-tripe-beg-for-mercy-end-of-the-road]
1: beg-waffles-end
2: beg-for-mercy-end
$ ./stst -b th -e th
Begin: (2) [th]
End: (2) [th]
Analyzing: [this-one.testthis-two.testthis-three.testthis-two.test]
1: this-one.testth
2: this-th
$ ./stst -b th -e te
Begin: (2) [th]
End: (2) [te]
Analyzing: [this-one.testthis-two.testthis-three.testthis-two.test]
1: this-one.te
2: this-two.te
3: this-three.te
4: this-two.te
$
After update to question
You have to account for the trailing null byte by allocating enough space for length + 1
bytes. Using strncpy()
is fine but in this context guarantees that the string is not null terminated; you must null terminate it.
Your duplicate elimination code, commented out, was not particularly good — too many null checks when none should be necessary. I've created a print function; the tag argument allows it to identify which set of data it is printing. I should have put the 'free' loop into a function. The duplicate elimination code could (should) be in a function; the string extraction code could (should) be in a function — as in the answer by pikkewyn. I extended the test data (string concatenation is wonderful in contexts like this).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static void dump_strings(const char *tag, char **strings, int num_str)
{
printf("%s (%d):\n", tag, num_str);
for (int i = 0; i < num_str; i++)
printf("%d: %s\n", i, strings[i]);
putchar('\n');
}
int main(void)
{
char string[] =
"this-one.testthis-two.testthis-three.testthis-two.testthis-one.test"
"this-1-testthis-1-testthis-2-testthis-1-test"
"this-1-testthis-1-testthis-1-testthis-1-test"
;
const char *b_token = "this";
const char *e_token = "test";
int b_len = strlen(b_token);
int e_len = strlen(e_token);
char *buffer = string;
char *b_mark;
char *e_mark;
char *a[50];
int num_str = 0;
while ((b_mark = strstr(buffer, b_token)) != 0 && (e_mark = strstr(b_mark + b_len, e_token)) != 0)
{
int length = e_mark + e_len - b_mark;
char *s = (char *) malloc(length + 1); // Allow for null
strncpy(s, b_mark, length);
s[length] = '\0'; // Null terminate the string
a[num_str++] = s;
buffer = e_mark + e_len;
}
dump_strings("After splitting", a, num_str);
//remove duplicate strings
for (int i = 0; i < num_str; i++)
{
for (int j = i + 1; j < num_str; j++)
{
if (strcmp(a[i], a[j]) == 0)
{
free(a[j]); // Free the higher-indexed duplicate
a[j] = a[--num_str]; // Move the last element here
j--; // Examine the new string next time
}
}
}
dump_strings("After duplicate elimination", a, num_str);
for (int i = 0; i < num_str; i++)
free(a[i]);
return 0;
}
Testing with valgrind
gives this a clean bill of health: no memory faults, no leaked data.
Sample output:
After splitting (13):
0: this-one.test
1: this-two.test
2: this-three.test
3: this-two.test
4: this-one.test
5: this-1-test
6: this-1-test
7: this-2-test
8: this-1-test
9: this-1-test
10: this-1-test
11: this-1-test
12: this-1-test
After duplicate elimination (5):
0: this-one.test
1: this-two.test
2: this-three.test
3: this-1-test
4: this-2-test