-3

I have a function that is basically a rewritten strtok_r because that function is causing me grief.

char *betterStrtok(char *str, char *delim, char **loc)
{
    int iter = 0;
    char *tmp;
    if(str)
    {
        char mod[strlen(str) + 2];
        char *out = malloc(strlen(str) + 2);
        char curr = str[0];
        strcpy(mod, str);
        while(curr)
        {
            tmp = strchr(delim, curr);
            if(tmp)
            {
                mod[iter] = 0;
                strcpy(out, mod);
                *loc = &mod[iter + 1];
                //printf("Inside function: \"%s\"\n", *loc);
                return out;
            }
            if(curr)
            {
                curr = mod[++iter];
            }
            else
            {
                *loc = &mod[0];
                strcpy(out, mod);
                return out;
            }
        }
        return NULL;
    }
    else
    {
        char mod[strlen(*loc) + 2];
        strcpy(mod, *loc);
        char *tloc = malloc(sizeof loc + 2);
        char *out = malloc(strlen(*loc) + 2);
        char curr = mod[0];
        while(curr)
        {
            tmp = strchr(delim, curr);
            if(tmp)
            {
                mod[iter] = 0;
                strcpy(out, mod);
                tloc = &mod[iter + 1];
                strcpy(*loc, tloc);
                return out;
            }
            if(curr)
            {
                curr = mod[++iter];
            }
            else
            {
                *loc = &mod[0];
                strcpy(out, mod);
                return out;
            }
        }
        return NULL;
    }
}

So my issue is *loc has the appropriate thing in it after the first pass, and when I check what's in it outside the function, it's mostly there except the last character is something weird. Let's say this is the setup:

char *addr = malloc(60);
char **supaddr = &addr;
char *strtotok = "Hello, world!";
char *thetok;
thetok = betterStrtok(strtotok, ",", supaddr);
printf("Outside function: \"%s\"\n", addr);

Adding print statements right before the return and right after calling the function shows something like this:

Inside function: " world!"
Outside function: " w"

The question is: how can I prevent the string from changing or how can I do something else so I can save the "rest" of the original string without returning it?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • In my original use of the function, the returned string in the outside function part had almost all of the string intact, save for the last character. Instead it was replaced by some unknown characters that didn't show up in this post – Keith Lara Mar 24 '17 at 00:24
  • Is there a link to your complete code? – Manav Dubey Mar 24 '17 at 00:26
  • 1
    `*loc = &mod[iter + 1];` is set local address to `addr` of `main`. (And occurs memory leak) – BLUEPIXY Mar 24 '17 at 00:27
  • 4
    @narusin: Please don't ask for a link, but a [mcve]. It must be **in the question**, – too honest for this site Mar 24 '17 at 00:28
  • How can I fix this then? I understand that, but how can I pass back the string through the pointer? – Keith Lara Mar 24 '17 at 00:29
  • `char *mod = malloc(strlen(str) + 2);`? – ad absurdum Mar 24 '17 at 00:33
  • Sorry @Olaf. I just wanted to compile the code. – Manav Dubey Mar 24 '17 at 00:37
  • Note that the delimiter in `strtok_r()` (or `strtok()` or `strtok_s()`) can list multiple characters. Using `strchr()` to find 'the delimiter' is therefore broken according to standard `strtok*()` functionality. You need to look up `strpbrk()`, probably. – Jonathan Leffler Mar 24 '17 at 00:52
  • This code looks like Java. Not good. – wildplasser Mar 24 '17 at 00:55
  • 1
    You're also calling `malloc()` once on the first invocation of your function, and twice on each subsequent invocation (when `str` is NULL). You're never calling `free()`. You have a memory leak of disastrous proportions! Maybe you should explain what you find wrong with the standard `strtok_r()`? The main criticism I have of it is that you can't tell which (of many possible) delimiters was zapped by the function — because it writes a null byte over the delimiter it matched. Anyway, as it stands, what you've got is not a better version of `strtok*()` in any of its guises. – Jonathan Leffler Mar 24 '17 at 00:55
  • So you're going to "improve" a runtime library function, are you? Well, I suppose it will be interesting to watch this educational experience in progress. – Carey Gregory Mar 24 '17 at 00:57
  • strtok_r doesn't exist with the c99 standard, and I can't use strtok because I have a need for nested calls. My only option is to write a new function, and I don't need to worry about multiple delimiters next to each other. I also use it with a few delimeters so my strchr function works fine. I'm not worried about free() or memory leaks at this time, but I guess that's bad coding on my end. I also would much rather use java if I could. Now instead of criticizing my coding, could you answer my question? – Keith Lara Mar 24 '17 at 01:45
  • 1
    `strtok_r()` is not in the C Standard, but it is [POSIX](http://pubs.opengroup.org/onlinepubs/9699919799/functions/strtok_r.html). – ad absurdum Mar 24 '17 at 02:06
  • 1
    Maybe I missed the purpose of this. If you *want* to use `strtok_r`, but feel you can't because it isn't in the standard lib and rather is a POSIX function, then pardon the obvious question: why not just hunt down an implementation of `strtok_r` and reap it? [There's even implementations on Stack Overflow](http://stackoverflow.com/questions/12975022/strtok-r-for-mingw). And you'll quickly discover you don't need to `malloc` *anything*. That is, if replicating `strtok_r` is *really* what you want to do. – WhozCraig Mar 24 '17 at 07:44
  • I tried that but I couldn't find anything I liked or understood. Can't use POSIX cause there's a "standard" that the teacher wants to uphold. The teacher said I could write a function or figure out how to make strtok work. – Keith Lara Mar 28 '17 at 01:28

2 Answers2

1

If you start replacing standard library (or POSIX) functions with your own implementations, first look hard at how the facility is used. For example, compare fgets() and getline().

If I were you, I'd probably use

size_t  extract_token(const char *src_ptr, const size_t src_len,
                      char **token_ptr, size_t *token_size, size_t *token_len);

which extracts a token from a src_len -byte buffer at src_ptr. (Unlike string-based methods, this can handle embedded nul bytes.)

The return value is the number of characters consumed from src_ptr. The token is copied (expanded?) to dynamically allocated token_ptr. The allocated length is in token_size, and the length of the token in token_len.

If extract_token() encounters only whitespace but no token, it returns the number of whitespace chars consumed, with zero assigned to token_len. For simplicity, let's assume the function always sets errno; to zero if success, to a nonzero error code if an error occurs.

A simple loop that tokenizes lines read from standard input would be

char   *line_ptr = NULL;
size_t  line_size = 0;
ssize_t line_len;
long    line_num = 0;

char   *token_ptr = NULL;
size_t  token_size = 0;
size_t  token_len;

char   *cur, *end;
size_t  n;

while (1) {

    line_len = getline(&line_ptr, &line_size, stdin);
    line_num++;

    if (line_len < 1) {
        if (ferror(stdin) || !feof(stdin)) {
            fprintf(stderr, "Standard input: Line %ld: Read error.\n", line_num);
            return EXIT_FAILURE;
        }
        break;
    }

    cur = line_ptr;
    end = line_ptr + line_len;

    while (1) {

        if (cur >= end) {
            errno = 0;
            cur = end;
            break;
        }

        n = extract_token(cur, (size_t)(end - cur),
                          &token_ptr, &token_size, &token_len);
        if (errno) {
            /* cur + n is the offending character in input */
            fprintf(stderr, "Standard input: Line %ld: Cannot tokenize line.\n", line_num);
            exit(EXIT_FAILURE);
        }

        /* Do something with token;
            token_ptr  points to the token,
            token_len  is the length of the token
            token_size is the size allocated for the token
        */
    }                
}

/* Since the line and token buffers are no longer needed,
   free them. I like to clear the variables too, just in
   case.
*/
free(line_ptr);
line_ptr = NULL;
line_size = 0;

free(token_ptr);
token_ptr = NULL;
token_size = 0;

Note that when reading files with clear record-and-field formatting, like CSV files, I do prefer to read the tokens directly from the file using a getline()-like interface, either

int next_field(char **ptr, size_t *size, size_t *len, FILE *in);
int next_record(FILE *in);

or

int next_wfield(wchar_t **ptr, size_t *size, size_t *len, FILE *in);
int next_wrecord(FILE *in);

where next_field() (or next_wfield() for wide input) obtains the next field in the current record, preferably handling de-quoting and de-escaping, and next_wrecord() skips any leftover fields in the current record, and moves to the beginning of next record.

Using fgetc() or fgetwc() the code implementing the above is quite straightforward (even if CSV quoting rules are implemented), although it won't be as fast as possible using more advanced methods. Because CSV and other such file formats are not optimal anyway, the slight loss in speed is normally neglible/unnoticeable. Most importantly, if you try it out, you'll see that code that uses next_field()/next_record() is quite robust, and easy to read, write, and maintain in the long term.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
0

The answer was in the comments. Turns out if I changed mod to a pointer it worked perfectly.