2

I know C is purposefully bare-bones, but I'm curious as to why something as commonplace as a substring function is not included in <string.h>.

Is it that there is not one "right enough" way to do it? Too many domain specific requirements? Can anyone shed any light?

BTW, this is the substring function I came up with after a bit of research. Edit: I made a few updates based on comments.

void substr (char *outStr, const char *inpStr, int startPos, size_t strLen) {
    /* Cannot do anything with NULL. */
    if (inpStr == NULL || outStr == NULL) return;

    size_t len = strlen (inpStr);

    /* All negative positions to go from end, and cannot
    start before start of string, force to start. */
    if (startPos < 0) {
        startPos = len + startPos;
    }
    if (startPos < 0) {
        startPos = 0;
    }

    /* Force negative lengths to zero and cannot
    start after end of string, force to end. */
    if ((size_t)startPos > len) {
        startPos = len;
    }

    len = strlen (&inpStr[startPos]);
    /* Adjust length if source string too short. */
    if (strLen > len) {
        strLen = len;
    }

    /* Copy string section */
    memcpy(outStr, inpStr+startPos, strLen);
    outStr[strLen] = '\0';
}

Edit: Based on a comment from r I also came up with this one liner. You're on your own for checks though!

#define substr(dest, src, startPos, strLen) snprintf(dest, BUFF_SIZE, "%.*s", strLen, src+startPos)
Derek Springer
  • 2,666
  • 1
  • 14
  • 12
  • You can use a combination of strtok and strchr to create your own substring type function, but you have to watch it since strtok is destructive to the original string. – Burton Samograd Sep 13 '11 at 17:54
  • 3
    Wouldn't `strncpy` let you do the same thing? – Tom Zych Sep 13 '11 at 17:54
  • 4
    Any question that asks "Why does the X standard not include feature Y" are tricky to answer definitively. – Oliver Charlesworth Sep 13 '11 at 17:55
  • 1
    `strncpy` doesn't do quite what you think it does. I'd use `memcpy` here personally. (Also, the `size_t` type is preferred for array indices and sizes.) – Chris Lutz Sep 13 '11 at 17:57
  • nitpick - size_t is unsigned, you you don't have to check for variables of type size_t as being less than zer0. – selbie Sep 13 '11 at 18:00
  • @selbie - Yes. It would also lose the OP's clever negative index trick, which he is backporting from higher-level languages but is not really great in C (in my opinion). It's a tradeoff to think about (for the first parameter anyway). – Chris Lutz Sep 13 '11 at 18:03
  • A bit of clarity: I chose int instead of size_t so I could get a Python-like idiom of negative values to count from the end of the string. Perhaps it's not the *purest* version I could have made, but it's what I needed from it. – Derek Springer Sep 13 '11 at 18:04
  • 2
    If you ask 10 C programmers for a specification for a generic substring function you're likely to get 10 different answers. Should it allocate memory? Should it allow negative indexes? Do we need a substringn function that also takes the length of the destination buffer? etc. – user786653 Sep 13 '11 at 18:08
  • 2
    @Tom: `strncpy`is very unlikely to be the right answer to *any* particular problem. – Keith Thompson Sep 13 '11 at 18:16
  • @user786653 - Obviously _my_ ideas about string handling are the _right_ ones. ;) – Chris Lutz Sep 13 '11 at 18:23
  • BTW, I just changed the strncpy to memcpy based on everyone's advice. – Derek Springer Sep 13 '11 at 18:26
  • 1
    For what it's worth, the standard library would put the destination parameter before the source, as with `memcpy` and `strcpy`. – Steve Jessop Sep 13 '11 at 18:43
  • @Steve: Thanks for advice--I'll refrain from making more edits, but change my own source. – Derek Springer Sep 13 '11 at 18:50

6 Answers6

7

Basic standard library functions don't burden themselves with excessive expensive safety checks, leaving them to the user. Most of the safety checks you carry out in your implementation are of expensive kind: totally unacceptable in such a basic library function. This is C, not Java.

Once you get some checks out of the picture, the "substrung" function boils down to ordinary strlcpy. I.e ignoring the safety check on startPos, all you need to do is

char *substr(const char *inpStr, char *outStr, size_t startPos, size_t strLen) {
  strlcpy(outStr, inpStr + startPos, strLen);
  return outStr;
}

While strlcpy is not a part of the standard library, but it can be crudely replaced by a [misused] strncpy. Again, ignoring the safety check on startPos, all you need to do is

char *substr(const char *inpStr, char *outStr, size_t startPos, size_t strLen) {
  strncpy(outStr, inpStr + startPos, strLen);
  outStr[strLen] = '\0';
  return outStr;
}

Ironically, in your code strncpy is misused in the very same way. On top of that, many of your safety checks are the direct consequence of your choosing a signed type (int) to represent indices, while proper type would be an unsigned one (size_t).

AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
  • What would the proper way to use it be? – Derek Springer Sep 13 '11 at 17:58
  • I'd rather see `memcpy` here than `strncpy`. – Chris Lutz Sep 13 '11 at 18:01
  • @Chris Lutz: `memcpy` will not stop at terminating `\0` in the input. To use `memcpy` you have to calculate `strlen` first. I agree that `strncpy` is totally misused here, but I was aiming for brevity. – AnT stands with Russia Sep 13 '11 at 18:04
  • @Derek Springer: `strncpy` is a function that converts C-string to fixed-width strings. That's what it is used for. Using it for "safe" string copying is a crime against programming. http://stackoverflow.com/questions/2114896/why-is-strlcpy-and-strlcat-considered-to-be-insecure/2115015#2115015 – AnT stands with Russia Sep 13 '11 at 18:08
  • @AndreyT - I was assuming that `strLen + startPos <= strlen(inpStr)` which is the kind of assumption most C string functions will make. – Chris Lutz Sep 13 '11 at 18:08
  • @Chris Lutz: Hm... I'd say that most standard functions don't make this assumption. All limited-length string functions typically look for two termination conditions: either the length is exhausted or the terminator is encountered. – AnT stands with Russia Sep 13 '11 at 18:10
  • 1
    I like how the void function returns a char *. (I actually thought that function would be better if it returned it anyway for initialization purposes) – Joe Sep 13 '11 at 18:15
  • @AndreyT - That's true. We're assuming (as most functions do) that `outStr` has `strLen + 1` bytes (which may be bad - I'd rather be assume `strlen` bytes, but then the substring number will look odd). I suppose my tendancy to avoid regular string functions has resulted in some data loss when it comes to their usual functioning. Hmm. – Chris Lutz Sep 13 '11 at 18:28
3

Perhaps because it's a one-liner:

snprintf(dest, dest_size, "%.*s", sub_len, src+sub_start);
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • Do you know the relative efficiency of snprintf over memcpy? – Derek Springer Sep 13 '11 at 18:59
  • 3
    @Derek Springer: In string handling you should be more worried about safety than efficiency. If you, after lengthy and sober profiling, determine that using `snprintf` is too slow in your application, you are probably better off trying to avoid a direct substring operation rather than using `memcpy` instead of `snprintf`. – user786653 Sep 13 '11 at 19:14
  • Does snprintf guarantee null termination on the destination buffer? – selbie Sep 13 '11 at 19:32
  • @Selbie: Yes it does--I was wondering the same thing. From the man page: _The functions snprintf() and vsnprintf() write at most size bytes (including the trailing null byte ('\0')) to str._ – Derek Springer Sep 13 '11 at 19:39
  • Yes `snprintf` always terminates, unless `dest_size` is zero in which case it does not write anything. – R.. GitHub STOP HELPING ICE Sep 13 '11 at 20:19
2

You DO have strcpy and strncpy. Aren't enough for you? With strcpy you can simulate the substring from character to end, with strncpy you can simulate the substring from character for a number of characters (you only need to remember to add the \0 at the end of the string). strncpy is even better than the C# equivalent, because you can overshoot the length of the substring and it won't throw an error (if you have allocated enough space in dest, you can do strncpy(dest, src, 1000) even if src is long 1. In C# you can't.) As written in the comment, you can even use memcpy, but remember to always add a \0 at the end of the string, and you must know how many characters you are copying (so you must know exactly the length of the src substring) AND it's a little more complex to use if a day you want to refactor your code to use wchar_t AND it's not type-safe (because it accepts void* instead of char*). All this in exchange for a little more speed over strncpy

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • 3
    `strncpy` isn't "a safe `strcpy`." Be careful around code that appears to use it as such. – Chris Lutz Sep 13 '11 at 17:59
  • @Chris If you always terminate it with a "bonus `\0`", I don't see any problem at using it. But yes, the fact that you have t pass to it bufferlength-1 IS a big problem :-) But in the end... More money for me :-) :-) – xanatos Sep 13 '11 at 18:03
  • I prefer to use `memcpy` when I can. It doesn't perform the extra (often unnecessary) work of checking for nul-termination, or filling in unused space with zeroes, and you always know how much data it copies. – Chris Lutz Sep 13 '11 at 18:06
  • @chris Yes, if you know the length of the source string it's faster. But I have to tell the truth, I programmed for years under VC++ and every time multiplying for sizeof(TCHAR) was VERY boring (TCHAR can be wchar_t (2bytes on W) or char depending on compilation #defines). And then, it's premature optimization of the worst type :-) – xanatos Sep 13 '11 at 18:10
  • `strncpy` can easily either leave the target unterminated, or needlessly pad it with extra `'\0'` bytes. It was designed for use with a very specific data layout used to hold file names in early Unix filesystems (a fixed-size array padded at the end with 0 or more `'\0'` bytes). – Keith Thompson Sep 13 '11 at 18:18
  • @Keith I have written that you have to remember to add a \0 at the end of the buffer. I DO know this. I ALWAYS use the "combo" strncpy + zero terminator at the end of the buffer. In the end, we like C this way, unsafe at any speed :-) :-) – xanatos Sep 13 '11 at 18:20
  • @xanatos - I use `sizeof *ptr` so that if the type changes the size will stay the same (and I think VC++'s `TCHAR` is a bad idea, but I'm an OS X/Unix guy with little Windows experience so I can't have an opinion). And I don't think it's premature optimization in this case because `strncpy` and `memcpy` are different functions with different purposes. I just think `memcpy`'s purpose is closer to the "bounded `strcpy`" ideal than `strncpy`. – Chris Lutz Sep 13 '11 at 18:21
  • 3
    @xanatos: But what about the needless padding? `strncpy` was designed for a very specific purpose, one that we rarely run into these days. IMHO it shouldn't be in the standard library. – Keith Thompson Sep 13 '11 at 18:21
  • @Keith Surely, but it's premature optimization. It's a no no (yes... I hate it from the inner part of my heart... But they made me write "I mustn't do premature optimization" 1000 times on the blackboard :-) ). I always wanted do know WHY they decided that the `strncpy` should pad the string. Mah. In the end writing a macro that makes strncpy + \0 at the end or memcpy + \0 at the end are trivial things to do. – xanatos Sep 13 '11 at 18:23
  • @xanatos - It's not premature optimization to say "these two functions are similar and can both easily work here, but let's use the one that doesn't do unnecessary work." Premature optimization is using unreadable bit-twiddling hacks instead of a few clear math functions "to avoid floating-point arithmetic." – Chris Lutz Sep 13 '11 at 18:30
  • @Chris only if you know the length of src, otherwise you'll need strlen+memcpy+\0... So it's three commands against two (strncpy+\0 or \0 of byte 0 of dest + strncat) Is strncat good enough? You only have to remember to zero the initial byte of the dest buffer. – xanatos Sep 13 '11 at 18:32
  • 1
    @Chris, I think you should replace "to avoid floating-point arithmetic" with "to be faster than floating-point arithmetic". If the reason for avoiding floating point is anything else (like wanting reproducible bit-exact answers) then it's probably a very legitimate concern, not premature optimization. – R.. GitHub STOP HELPING ICE Sep 13 '11 at 18:35
0

Here's a lighter weight version of what you want. Avoids the redundant strlen calls and guarantees null termination on the destination buffer (something strncpy won't do).

void substr(char* pszSrc, int start, int N, char* pszDst, int lenDest)
{
    const char* psz = pszSrc + start;
    int x = 0;

    while ((x < N) && (x < lenDest))
    {
        char ch = psz[x];
        pszDst[x] = ch;
        x++;
        if (ch == '\0')
        {
           return;
        }
    }

    // guarantee null termination
    if (x > 0)
    {    
        pszDest[x-1] = 0;
    }
}

Example:
char *pszLongString = "This is a long string";
char szSub[10];
substr(pszLongString, 0, 4, szSub, 10); // copies "long" into szSub and includes the null char

So while there isn't a formal substring function in C, C++ string classes usually have such a method:

#include <string>
...
std::string str;
std::string strSub;

str = "This is a long string";

strSub = str.substr(10, 4); // "long"

printf("%s\n", strSub.c_str());
selbie
  • 100,020
  • 15
  • 103
  • 173
  • I don't think that's exactly the same thing: strstr returns a pointer to the first occurrence of str2 in str1--helpful only if you know exactly what you are looking for. What I'm talking about is returning just "burg" from "hamburger." – Derek Springer Sep 13 '11 at 17:56
  • Your new function takes 5 arguments, but you only call it with 4. – Chris Lutz Sep 13 '11 at 18:38
  • @Oli - that's rather draconian of you to downvote since I gave both a "C" and "C++" answer. – selbie Sep 13 '11 at 19:30
  • @selbie - At the time of his downvote, you only had a C++ answer. – Chris Lutz Sep 13 '11 at 19:32
0

In C you have a function that returns a subset of symbols from a string via pointers: strstr.

char *ptr;
char string1[] = "Hello World";
char string2[] = "World";

ptr = strstr(string1, string2)

*ptr will be pointing to the first character occurrence.

BTW you did not write a function but a procedure, ANSI string functions: string.h

Eder
  • 1,874
  • 17
  • 34
  • 1
    This is the same comment selbie originally made--I don't think that's exactly the same thing: strstr returns a pointer to the first occurrence of str2 in str1--helpful only if you know exactly what you are looking for. – Derek Springer Sep 13 '11 at 18:19
-1
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char* substr(const char *string, size_t from, size_t to);

int main(int argc, char *argv[])
{
    char *string = argv[1];

    const char *substring = substr(string,6,80);

    printf("string is [%s] substring is [%s]\n",string,substring);

    return 0;
}

const char* substr(const char *string, size_t from, size_t to)
{
    if (to <= from) 
        return NULL;

    if (from >= to)
        return NULL;

    if (string == NULL)
        return NULL;

    if (strlen(string) == 0)
        return NULL;

    if (from < 0)
        from = 0;

    if (to > strlen(string))
        to = strlen(string);

    char *substring = malloc(sizeof(char) * ((to-from)+1));

    size_t index;

    for (index = 0; from < to; from++, index++)
        substring[index] = string[from];

    substring[index] = '\0';

    return substring;
}
johnny
  • 258
  • 2
  • 12