0

I'm trying to parse argv back into a single string to use for a system call. The string shows up fine through the printf call (even without the \0 terminator), but using it as a parameter to system creates all sorts of undefined behaviour.

How can I ensure the string is properly terminated? Is there a better and more reliable way to go about parsing char[][] into char[]?

#include <stdio.h>
int main(int argc,char *argv[]){

char cmd[255]="tcc\\tcc.exe ";
char**ptr=argv+1;
while(*ptr){
   strcat(cmd,*ptr);
   ++ptr;
   if(*ptr)cmd[strlen(cmd)]=' ';
}


printf("cmd: ***%s***\n",cmd);
system(cmd);
}

I just discovered and corrected another flaw in this code, a system commands needs (escaped) backslashes for file paths

aelgoa
  • 1,193
  • 1
  • 8
  • 24
  • 5
    @RichardJ.RossIII: Yes, `argv[argc] == NULL`. – md5 Mar 02 '13 at 16:39
  • 2
    `strcat` already properly terminates the string. Can you explain your question better? – Carl Norum Mar 02 '13 at 16:41
  • 1
    Of tangential relevance: http://en.wikipedia.org/wiki/Schlemiel_the_Painter's_algorithm. – Oliver Charlesworth Mar 02 '13 at 16:46
  • @OliCharlesworth That's essential, rather than tangential! –  Mar 02 '13 at 16:50
  • I thaught it would, but it didn't so I added a clause to manually null-terminate it, which didn't help. I removed it again now because it doesn't make sense indeed to do it twice. It doesn't work either way though. – aelgoa Mar 02 '13 at 17:13
  • Did you try to manage to use `execv(const char *path, char *const argv[]);` which takes an array of strings? – Déjà vu Mar 02 '13 at 18:06

2 Answers2

3

This instruction:

   if(*ptr)
       cmd[strlen(cmd)]=' ';
   else
       cmd[strlen(cmd)]='\0';

will break cmd, because it will overwrite its zero termination. Try instead:

l = strlen(cmd);
if (*ptr) {
    cmd[l++] = ' ';
}
cmd[l] = 0x0;

This will append a space, and zero terminate the string. Actually, since it is already zero terminated, you could do better:

if (*ptr) {
    int l = strlen(cmd);
    cmd[l++] = ' ';
    cmd[l  ] = 0x0;
}

Update

A better alternative could be this:

int main(int argc, char *argv[])
{
        char cmd[255]="tcc/tcc.exe";
        char **ptr=argv+1;
        for (ptr = argv+1; *ptr; ptr++)
        {
            strncat(cmd, " ",  sizeof(cmd)-strlen(cmd));
            strncat(cmd, *ptr, sizeof(cmd)-strlen(cmd));
        }
        printf("String: '%s'.\n", cmd);
        return 0;
}

We use strncat() to check that we're not overrunning the cmd buffer, and the space gets applied in advance. This way there's no extra space at the end of the string.

It is true that strncat() is a mite slower than directly assigning cmd[], but factoring the safety and debugging time, I think it's worthwhile.

Update 2

OK, so let's try to do this fast. We keep track of what cmd's length ought to be in a variable, and copy the string with memcpy() which is slightly faster than strcpy() and does neither check string length, nor copy the extra zero at end of string.

(This saves something - remember that strcat() has to implicitly calculate the strlen of both its arguments. Here we save that).

int main(int argc, char *argv[])
{
        #define MAXCMD  255
        char    cmd[MAXCMD]="tcc/tcc.exe";
        int     cmdlen  = strlen(cmd);
        char    **ptr=argv+1;
        for (ptr = argv+1; *ptr; ptr++)
        {
            /* How many bytes do we have to copy? */
            int l = strlen(*ptr);
            /* STILL, this check HAS to be done, or the program is going to crash */
            if (cmdlen + 1 + l + 1 < MAXCMD)
            {
                /* No danger of crashing */
                cmd[cmdlen++]   = ' ';
                memcpy(cmd + cmdlen, *ptr, l);
                cmdlen  += l;
            }
            else
            {
                printf("Buffer too small!\n");
            }
        }
        cmd[cmdlen] = 0x0;
        printf("String: '%s'.\n", cmd);
        return 0;
}

Update 3 - not really recommended, but fun

It is possible to try and be smarter than the compiler's usually built-in strlen and memcpy instructions (file under: "Bad ideas"), and do without strlen() altogether. This translates into a smaller inner loop, and when strlen and memcpy are implemented with library calls, much faster performances (look ma, no stack frames!).

int main(int argc, char *argv[])
{
        #define MAXCMD  254
        char    cmd[MAXCMD+1]="tcc/tcc.exe";
        int     cmdlen  = 11; // We know initial length of "tcc/tcc.exe"!
        char    **ptr;
        for (ptr = argv+1; *ptr; ptr++)
        {
            cmd[cmdlen++] = ' ';
            while(**ptr) {
                cmd[cmdlen++] = *(*ptr)++;
                if (MAXCMD == cmdlen)
                {
                        fprintf(stderr, "BUFFER OVERFLOW!\n");
                        return -1;
                }
            }
        }
        cmd[cmdlen] = 0x0;
        printf("String: '%s'.\n", cmd);
        return 0;
}

Discussion - not so fun

Shamelessly cribbed from many a lecture I received from professors I thought shortsighted, until they were proved right each and every time.

The problem here is to exactly circumscribe what are we doing - what's the forest this particular tree is part of.

We're building a command line that will be fed to a exec() call, which means that the OS will have to build another process environment and allocate and track resources. Let's step a bit backwards: an operation will be run that will take about one millisecond, and we're feeding it a loop that might take ten microseconds instead of twenty.

The 20:10 (that's 50%!) improvement we have on the inner loop translates in a 1020:1010 (that's about 1%) just the overall process startup operation. Let's imagine the process takes half a second - five hundred milliseconds - to complete, and we're looking at 500020:500010 or 0.002% improvement, in accord with the never-sufficiently-remembered http://en.wikipedia.org/wiki/Amdahl%27s_law .

Or let's put it another way. One year hence, we will have run this program, say, one billion times. Those 10 microseconds saved now translate to a whopping 10.000 seconds, or around two hours and three quarters. We're starting to talk big, except that to obtain this result we've expended sixteen hours coding, checking and debugging :-)

The double-strncat() solution (which is actually the slowest) supplies code that is easier to read and understand, and modify. And reuse. The fastest solution, above, implicitly relies on the separator being one character, and this fact is not immediately apparent. Which means that reusing the fastest solution with ", " as separator (let's say we need this for CSV or SQL) will now introduce a subtle bug.

When designing an algorithm or piece of code, it is wise to factor not only tightness of code and local ("keyhole") performances, but also things like:

  • how that single piece affects the performances of the whole. It makes no sense to spend 10% of development time on less than 10% of the overall goal.
  • how easy it is for the compiler to interpret it (and optimize it with no further effort on our part, maybe even optimize specifically for different platforms -- all at no cost!)
  • how easy will it be for us to understand it days, weeks, or months down the line.
  • how non-specific and robust the code is, allowing to reuse it somewhere else (DRY).
  • how clear its intent is - allowing to reengineer it later, or replace with a different implementation of the same intent (DRAW).

A subtle bug

This in answer to WilliamMorris's question, so I'll use his code, but mine has the same problem (actually, mine is - not completely unintentionally - much worse).

This is the original functionality from William's code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    #define CMD "tcc/tcc.exe"
    char cmd[255] = CMD;
    char *s = cmd + sizeof CMD - 1;
    const char *end = cmd + sizeof cmd - 1;
    // Cycle syntax modified to pre-C99 - no consequences on our code
    int i;
    for (i = 1; i < argc; ++i) {
        size_t len = strlen(argv[i]);
        if (s + len >= end) {
            fprintf(stderr, "Buffer overrun!\n");
            exit(1);
        }
        // Here (will) be dragons
        //*s++ = '.';
        //*s++ = '.';
        //*s++ = '.';

        *s++ = ' ';
        memcpy(s, argv[i], len);
        s += len;
    }
    *s = '\0';
    // Get also string length, which should be at most 254
    printf("%s: string length is %d\n", cmd, (int)strlen(cmd));
    return 0;
}

The buffer overrun check verifies that the string written so far, plus the string that has yet to be written, together do not exceed the buffer. The length of the separator itself is not counted, but things will work out somehow:

        size_t len = strlen(argv[i]);
        if (s + len >= end) {
            fprintf(stderr, "Buffer overrun!\n");
            exit(1);
        }

Now we add on the separator in the most expeditious way - by repeating the poke:

        *s++ = ', ';
        *s++ = ' ';

Now if s + len is equal to end - 1, the check will pass. We now add two bytes. The total length will be s + len + 2, which is equal to end plus one:

tcc/tcc.exe, It, was, the, best, of, times, it, was, the, worst, of, times, it, was, the, age, of, wisdom, it, was, the, age, of, foolishness, it, was, the, epoch, of, belief, it, was, the, epoch, of, incredulity, it, was, the, season, of, Light, it, was: string length is 254

tcc/tcc.exe, It, was, the, best, of, times, it, was, the, worst, of, times, it, was, the, age, of, wisdom, it, was, the, age, of, foolishness, it, was, the, epoch, of, belief, it, was, the, epoch, of, incredulity, it, was, the, season, of, Light, it, ouch: string length is 255

With a longer separator, such as "... ", the problem is even more evident:

tcc/tcc.exe... It... was... the... best... of... times... it... was... the... worst... of... times... it... was... the... age... of... wisdom... it... was... the... age... of... foolishness... it... was... the... epoch... of... belief... it... was... longer: string length is 257

In my version, the fact that the check requires an exact match leads to catastrophic results, since once the buffer is overrun, the match will always fail and result in a massive memory overwrite.

If we modify my version with

if (cmdlen >= MAXCMD)

we will get a code that always intercepts buffer overruns, but still does not prevent them up to the delimiter's length minus two; i.e., a hypothetical delimiter 20 bytes long could overwrite 18 bytes past cmd's buffer before being caught.

I would point out that this is not to say that my code had a catastrophic bug (and so, once fixed, it'll live happy ever after); the point was that the code was structured in such a way that, for the sake of squeezing a little speed, a dangerous bug could easily go unnoticed, or the same bug could easily be introduced upon reuse of what looked like "safe and tested" code. This is a situation that one would be well advised to avoid.

(I'll come clean now, and confess that I myself rarely did... and too often still don't).

LSerni
  • 55,617
  • 10
  • 65
  • 107
  • 2
    `if(*ptr) strcat(cmd, " ");` would be simpler – David Heffernan Mar 02 '13 at 16:48
  • Why would I null-terminate my string at each iteration? I removed the else clause, cause it was indeed not necessary. Thank you for the suggested alternative, but I am looking for the fastest, most compact way to go about this, and using a for-loop, introducing more variables and post-incrementation don't suit that purpose. Thanks for showing and explaining a usage of `strncat`. I still don't get why the string doesn't get null-terminated by strcat in the last iteration. – aelgoa Mar 02 '13 at 17:12
  • 1
    The problem is that `strcat` does not know where the string will end, since you obliterated the initial `\x00` in `cmd`. I'll try to update the code. – LSerni Mar 02 '13 at 17:19
  • Ah of course, I'm gonna see if I can make it work this way. There really isn't a more direct way I could write directly to the memory location at the pointer and finish it with 0x0 ? – aelgoa Mar 02 '13 at 17:28
  • OK, it *seems* to be working, but... maintenance is going to be something else. I don't think this is going to save more than a few microseconds. I think this is the fastest you can go and still be reasonably sure the program isn't going to be shot out of the sky by a rogue input (e.g. 256 bytes worth of arguments). – LSerni Mar 02 '13 at 17:28
  • cool, very nice work, thanks. I'll play around with it, see how much could be scaped off the execution time using lighter constructs and I'll post my findings soon. – aelgoa Mar 02 '13 at 17:40
  • @aelgoa You are misguided if you think execution time is relevant here. What's more the question is not about performance. You are about to call system!! It's also disappointing that you accept the answer that doesn't answer the question that you asked. And disappointing that you don't vote. This is the answer that you should accept. Up vote the other one too. – David Heffernan Mar 02 '13 at 19:18
  • concatenation of array of strings comes up in a lot of places, so I wanted the best possible Implementation. i'm really happy about all the work and explanation of Isemi, but William's is a slightly better and faster solution. I haven't upvoted the other answers yet, but I will! – aelgoa Mar 03 '13 at 09:33
  • wait, wait, wait. String array concatenation is a slightly different can of worms altogether! You might want to look at http://stackoverflow.com/questions/1383649/concatenating-strings-in-c-which-method-is-more-efficient . As you can imagine, I support Ned Batchelder's point of view :-) – LSerni Mar 03 '13 at 10:11
  • @lserni - in what way does my solution (or your update 2, which looks the same) "implicitly relies on the separator being one character"? (perhaps comment on my answer) – William Morris Mar 03 '13 at 14:11
  • @DavidHeffernan - does it really not answer the question? It does terminate the string and it is better... – William Morris Mar 03 '13 at 14:13
  • @WilliamMorris, I need some code to answer that, so I'll just add to my answer. – LSerni Mar 03 '13 at 15:21
  • Ah, ok, I see what you mean. It is safe as it stands. I tested it before posting: safety comes from the '-1' in `sizeof cmd - 1` (space for the terminator) and the '=' in `if (s + len >= end) {`, ensuring space for the separator. But as you say, it is implicit. To be fair, anyone adding extra separator chars without adjusting the limits would be asking for whatever they got. – William Morris Mar 03 '13 at 17:17
1

This might be a more complicated than you want but it avoids buffer overflows and it also exits if the buffer is too small. Note that continuing to loop once there is not enough space in the buffer for argv[N] can result in trailing strings (argv[N+1] etc) that are shorter than argv[N] being added to the string even though argv[N] was omitted...

Note that I'm using memcpy, because by that point I already know how long argv[i] is.

int main(int argc, char **argv)
{
#define CMD "tcc/tcc.exe"
    char cmd[255] = CMD;
    char *s = cmd + sizeof CMD - 1;
    const char *end = cmd + sizeof cmd - 1;

    for (int i = 1; i < argc; ++i) {
        size_t len = strlen(argv[i]);
        if (s + len >= end) {
            fprintf(stderr, "Buffer overrun!\n");
            exit(1);
        }
        *s++ = ' ';
        memcpy(s, argv[i], len);
        s += len;
    }
    *s = '\0';
    printf("%s\n", cmd);
    return 0;
}
William Morris
  • 3,554
  • 2
  • 23
  • 24
  • i did a little profiling and your version came out slightly faster on my machhine. Here are the mean times: **my original code**: 0.023916 / **Isemi's**: 0.011689 / **William's**: 0.010003 so I'm accepting your answer as the best one. – aelgoa Mar 02 '13 at 19:06