How to replace a substring in a string?

Question

I have a string, and in it I need to find a substring and replace it. The one to be found and the one that'll replace it are of different length. My code, partially:

char *source_str = "aaa bbb CcCc dddd kkkk xxx yyyy";
char *pattern = "cccc";
char *new_sub_s = "mmmmm4343afdsafd";

char *sub_s1 = strcasestr(source_str, pattern);

printf("sub_s1: %s\r\n", sub_s1);
printf("sub_str before pattern: %s\r\n", sub_s1 - source_str); // Memory corruption

char *new_str = (char *)malloc(strlen(source_str) - strlen(pattern) + strlen(new_sub_s) + 1);

strcat(new_str, '\0');
strcat(new_str, "??? part before pattern ???");
strcat(new_str, new_sub_s);
strcat(new_str, "??? part after pattern ???");

Why do I have memory corruption?
How do I effective extract and replace pattern with new_sub_s?

Better look here first https://www.geeksforgeeks.org/c-program-replace-word-text-another-given-word/ — FEldin, Aug 18 '20 at 15:21
@kosmosu two problems, `strcat()` assumes `new_str` is null-terminated, which as user3121023 pointed out, isn't the case with `malloc()`. Secondly, `'\0'` is a `char`, not a `char*`. — Patrick Roberts, Aug 18 '20 at 15:28

chqrlie · Accepted Answer · 2020-08-18T15:53:34.453

3

There are multiple problems in your code:

you do not test if sub_s1 was found in the string. What if there is no match?
printf("sub_str before pattern: %s\r\n", sub_s1 - source_str); passes a difference of pointers for %s that expects a string. The behavior is undefined.
strcat(new_str, '\0'); has undefined behavior because the destination string is uninitialized and you pass a null pointer as the string to concatenate. strcat expects a string pointer as its second argument, not a char, and '\0' is a character constant with type int (in C) and value 0, which the compiler will convert to a null pointer, with or without a warning. You probably meant to write *new_str = '\0';

You cannot compose the new string with strcat as posted: because the string before the match is not a C string, it is a fragment of a C string. You should instead determine the lengths of the different parts of the source string and use memcpy to copy fragments with explicit lengths.

Here is an example:

char *patch_string(const char *source_str, const char *pattern, const char *replacement) {
    char *match = strcasestr(source_str, pattern);
    if (match != NULL) {
        size_t len = strlen(source_str);
        size_t n1 = match - source_str;   // # bytes before the match
        size_t n2 = strlen(pattern);      // # bytes in the pattern string
        size_t n3 = strlen(replacement);  // # bytes in the replacement string
        size_t n4 = len - n1 - n2;        // # bytes after the pattern in the source string
        char *result = malloc(n1 + n3 + n4 + 1);
        if (result != NULL) {
            // copy the initial portion
            memcpy(result, source_str, n1);
            // copy the replacement string
            memcpy(result + n1, replacement, n3);
            // copy the trailing bytes, including the null terminator
            memcpy(result + n1 + n3, match + n2, n4 + 1);
        }
        return result;
    } else {
        return strdup(source_str);  // always return an allocated string
    }
}

Note that the above code assumes that the match in the source string has be same length as the pattern string (in the example, strings "cccc" an "CcCc" have the same length). Given that strcasestr is expected to perform a case independent search, which is confirmed by the example strings in the question, it might be possible that this assumption fail, for example if the encoding of upper and lower case letters have a different length, or if accents are matched by strcasestr as would be expected in French: "é" and "E" should match but have a different length when encoded in UTF-8. If strcasestr has this advanced behavior, it is not possible to determine the length of the matched portion of the source string without a more elaborate API.

edited Aug 18 '20 at 15:53

answered Aug 18 '20 at 15:23

chqrlie

131,814
10
121
189

`You cannot compose the new string with strcat as posted.` why not? – kosmosu Aug 18 '20 at 15:26
@kosmosu: I expanded the answer with more explanations. – chqrlie Aug 18 '20 at 15:30
Don't know whether this was just an oversight, but `strcat(new_str, '\0');` shouldn't even compile. – Patrick Roberts Aug 18 '20 at 15:33
1

@PatrickRoberts: `'\0'` is a character constant with type `int` and value `0`, hence converts to a null character pointer if the prototype for `strcat()` is in scope. This is utmostly confusing, and might be flagged by the compiler with a warning. I shall be more explicit about this error. – chqrlie Aug 18 '20 at 15:41
@chqrlie you're right, I forgot about that particular footgun... – Patrick Roberts Aug 18 '20 at 15:50
why return a newly allocated string instead of the original one? – kosmosu Aug 18 '20 at 16:05
@kosmosu: because the caller would not know whether to free the return value or not after use. `patch_string` does not modify any of the argument strings and returns a new string that should be freed after use. – chqrlie Aug 18 '20 at 16:07
@chqrlie but if I pass a dynamically allocated string to `patch_string()` as a source string, I'll need to free both: a source string and a returned one, correct? – kosmosu Aug 19 '20 at 15:30
@kosmosu: The above implementation of `patch_string()` does not care if the argument strings are allocated or not. One could imagine an alternate version that takes an allocated source string and return it if unmodified or frees it and return a new one if the replacement is done. It is usually easier to use the above approach instead of the latter but both require careful life time analysis of the source and result strings. Another approach would construct the modified string into a buffer whose address and size are received as arguments. – chqrlie Aug 19 '20 at 16:54
@chqrlie but if I pass a dynamically allocated string to patch_string() as a source string, I'll need to free both: a source string and a returned one, correct? – kosmosu Aug 19 '20 at 17:21
@kosmosu: yes you should free both when you no longer need them. Note however that on most systems, there is no need to free all allocated objects before exiting a program. – chqrlie Aug 19 '20 at 19:18
@chqrlie really? why no need? even if an app is supposed to run for weeks, months and non stop? – kosmosu Aug 20 '20 at 05:30
@kosmosu: if an app is supposed to run for a long time, or to consume a lot of memory, the programmer must be careful to free allocated memory after use, avoiding rampant memory leaks which can consume resources up to the point of exhaustion causing system instability. Conversely, when an application is about to exit, it is optional to free allocated memory because all memory will be reclaimed by the OS automatically. The main purpose is to detect memory leaks: if the application cannot free all allocated blocks, it may indicate that some are unreachable in potentially large quantities. – chqrlie Aug 20 '20 at 10:37
@chqrlie isn't it obvious that a process doesn't need to free allocated memory when a process is about to exit? – kosmosu Aug 21 '20 at 05:01
@kosmosu: it may be obvious to you and me, but not to everyone and it may even be false on some embedded systems. – chqrlie Aug 21 '20 at 17:44

score 1 · Answer 2 · answered Aug 18 '20 at 15:49

printf("sub_str before pattern: %s\r\n", sub_s1 - source_str); // Memory corruption

You're taking the difference of two pointers, and printing it as though it was a pointer to a string. In practice, on your machine, this probably calculates a meaningless number and interprets it as a memory address. Since this is a small number, when interpreted as an address, on your system, this probably points to unmapped memory, so your program crashes. Depending on the platform, on the compiler, on optimization settings, on what else there is in your program, and on the phase of the Moon, anything could happen. It's undefined behavior.

Any half-decent compiler would tell you that there's a type mismatch between the %s directive and the argument. Turn those warnings on. For example, with GCC:

gcc -Wall -Wextra -Werror -O my_program.c

char *new_str = (char *)malloc(…);
strcat(new_str, '\0');
strcat(new_str, "…");

The first call to strcat attempts to append '\0'. This is a character, not a string. It happens that since this is the character 0, and C doesn't distinguish between characters and numbers, this is just a weird way of writing the integer 0. And any integer constant with the value 0 is a valid way of writing a null pointer constant. So strcat(new_str, '\0') is equivalent to strcat(new_str, NULL) which will probably crash due to attempting to dereference the null pointer. Depending on the compiler optimizations, it's possible that the compiler will think that this block of code is never executed, since it's attempting to dereference a null pointer, and this is undefined behavior: as far as the compiler is concerned, this can't happen. This is a case where you can plausibly expect that the undefined behavior causes the compiler to do something that looks preposterous, but makes perfect sense from the way the compiler sees the program.

Even if you'd written strcat(new_str, "\0") as you probably intended, that would be pointless. Note that "\0" is a pointless way of writing "": there's always a null terminator at the end of a string literal¹. And appending an empty string to a string wouldn't change it.

And there's another problem with the strcat calls. At this point, the content of new_str is not initialized. But strcat (if called correctly, even for strcat(new_str, ""), if the compiler doesn't optimize this away) will explore this uninitialized memory and look for the first null byte. Because the memory is uninitialized, there's no guarantee that there is a null byte in the allocated memory, so strcat may attempt to read from an unmapped address when it runs out of buffer, or it may corrupt whatever. Or it may make demons fly out of your nose: once again it's undefined behavior.

Before you do anything with the newly allocated memory area, make it contain the empty string: set the first character to 0. And before that, check that malloc succeeded. It will always succeed in your toy program, but not in the real world.

char *new_str = malloc(…);
if (new_str == NULL) {
    return NULL; // or whatever you want to do to handle the error
}
new_str[0] = 0;
strcat(new_str, …);

¹ _{The only time there isn't a null pointer at the end of a "…" is when you use this to initialize an array and the characters that are spelled out fill the whole array without leaving room for a null terminator.}

`And there's another problem with the strcat calls. At this point, the content of new_str is not initialized.` --> should have I used calloc instead? — kosmosu, Aug 18 '20 at 16:07
@kosmosu You can, yes. Using `malloc` and initializing the first byte is enough if you know what you're doing. Using `calloc` is marginally slower, but that rarely matters, and it's safer. I work on security-critical code and we forbid `malloc`, we only allow `calloc`. Other projects, especially ones that allocate very large blocks of memory, have different rules. — Gilles 'SO- stop being evil', Aug 18 '20 at 16:09

user3121023 · Answer 3 · 2020-08-18T16:10:46.473

snprintf can be used to calculate the memory needed and then print the string to the allocated pointer.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main ( void) {
    char *source_str = "aaa bbb CcCc dddd kkkk xxx yyyy";
    char *pattern = "cccc";
    char *new_sub_s = "mmmmm4343afdsafd";

    char *sub_s1 = strcasestr(source_str, pattern);
    int span = (int)( sub_s1 - source_str);
    char *tail = sub_s1 + strlen ( pattern);

    size_t size = snprintf ( NULL, 0, "%.*s%s%s", span, source_str, new_sub_s, tail);

    char *new_str = malloc( size + 1);

    snprintf ( new_str, size, "%.*s%s%s", span, source_str, new_sub_s, tail);

    printf ( "%s\n", new_str);

    free ( new_str);

    return 0;
}

GNU systems also have `asprintf()` which is an even more direct solution. — chqrlie, Aug 18 '20 at 16:11

How to replace a substring in a string?

3 Answers3