0

Consider this simple program that concatenates all specified parameters and prints them in standard output. I used 2 for loops to append the strings, one to calculate the length of that string and one to concatenate the strings. Is there a way doing it with only one loop? It wouldn't be more efficient reallocating memory for each string to concatenate, would it? How would Java's StringBuilder be implemented in C? Would it loop twice as I did?

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char** argv)
{
    size_t len = 0;

    // start for loop at i = 1 to skip the program name specified in argv
    for(int i = 1; i < argc; i++)
        len += strlen(argv[i]) + 1; // +1 for the space 

    char* toAppend = (char*)malloc(len * sizeof(char) + 1);
    toAppend[0] = '\0'; // first string is empty and null terminated 

    for(int i = 1; i < argc; i++)
    {
        strcat(toAppend, argv[i]);
        strcat(toAppend, " ");
    }

    printf(toAppend);
    free(toAppend);
}
Winter
  • 3,894
  • 7
  • 24
  • 56
  • 5
    The problem with your code is `strcat`, every iteration it goes through the entire string to find the last character, use a moving pointer instead. – imreal Sep 20 '18 at 00:22
  • 3
    Side tip: a more readable way to write `(char*)malloc(len * sizeof(char) + 1)` is `malloc(len + 1)`. – Ry- Sep 20 '18 at 00:28
  • `stpncpy` is probably better than `strcat`. If you're interested in the Java `StringBuilder` implementation, you can find the "good" stuff in [`AbstractStringBuilder`](https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/lang/AbstractStringBuilder.java) (I suggest you look at `expandCapacity(int minimumCapacity)`) – Elliott Frisch Sep 20 '18 at 00:29
  • @Ry- Is that because sizeof(char) is always one? – Winter Sep 20 '18 at 00:31
  • 1
    @Winter: Yep, and because `void*` doesn’t need to be cast. – Ry- Sep 20 '18 at 00:31
  • @ElliottFrisch Oh so basically it's just like an arraylist that reallocates whenever needed? I'm wondering how someone would implement that without encapsulation provided by OO languages to do the reallocation automatically. – Winter Sep 20 '18 at 00:34
  • @Winter: The definition of sizeof is a size measured in units of char. Writing `sizeof(char)` is like saying "how many meters are in a meter?" – R.. GitHub STOP HELPING ICE Sep 20 '18 at 00:34
  • 1
    @ElliottFrisch: You can implement something like java AbstractStringBuilder in C, but it's highly inefficient compared to the right way to do it, allocating the right total size to begin with. Doing that kind of thing largely defeats the purpose of writing C; you get the difficulty of writing C and the slowness of Java. – R.. GitHub STOP HELPING ICE Sep 20 '18 at 00:35
  • @R.. Thank you, that's exactly the kind of detail I wanted to know. – Winter Sep 20 '18 at 00:36
  • @Winter: POSIX does have a way to do it with friendly function calls without repeatedly growing the buffer (which is costly; each growth is expected to cost O(n)); see the notes on `fmemopen` I added to my answer. – R.. GitHub STOP HELPING ICE Sep 20 '18 at 00:40
  • 2
    Depending on your usage patterns, you may want to consider sized strings, rather than NUL-terminated ones. – Acorn Sep 20 '18 at 00:45
  • [don't cast the result of `malloc` in C](https://stackoverflow.com/q/605845/995714) – phuclv Sep 20 '18 at 01:11
  • What about keeping track of everything in a struct? https://pastebin.com/ekQV3NaW – Brandon Sep 20 '18 at 01:29

3 Answers3

2

Your method of allocation is efficient, measuring the total length and allocating just once. But the concatenation loop repeatedly measures the length of the output buffer from the start to concatenate to it, resulting in quadratic runtime.

To fix it track your position as you go:

size_t pos = 0;
for(int i = 1; i < argc; i++) {
    size_t len = strlen(argv[i]);
    memcpy(toAppend+pos, argv[i], len);
    pos += len;
    toAppend[pos] = ' ';
    pos++;
}
toAppend[pos] = 0;

This is the most efficient way to actually concatenate in memory, but the most efficient of all is not to concatenate. Instead:

for(int i = 1; i < argc; i++)
    printf("%s ", argv[i]);

The whole reason stdio is buffered is so you don't have to build arbitrary-length in-memory buffers to do efficient output; instead it buffers up to a fixed size automatically and flushes when the buffer is full.

Note that your usage of printf is wrong and dangerous in the event that your input contains a % character anywhere; it should be printf("%s", toAppend);.

If you're writing to POSIX (or POSIX-ish) systems rather than just plain C, another option would be fmemopen, which would allow you to write the loop just as:

for(int i = 1; i < argc; i++)
    fprintf(my_memfile, "%s ", argv[i]);
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • The OP's method is NOT the most efficient, and clearly the OP is interested in learning about concatenation, part of the question mentions `StringBuilder` in Java. – imreal Sep 20 '18 at 00:24
  • Thank you for the answer and the warning. However, it goes that the point of this program was to create a minified, verifiable example. What I actually wanted to do requires using the string in memory. – Winter Sep 20 '18 at 00:24
  • Even better than `printf("%s", toAppend)` would be `fputs(toAppend, stdout)`. (I think GCC will actually make this optimization for you.) – Jonathon Reinhart Sep 20 '18 at 00:29
  • @JonathonReinhart: There’s a space, though. – Ry- Sep 20 '18 at 00:29
  • @R.. `fmemopen` appears to be a great idea but I still have to specify the length of the buffer, so it doesn't remove the first loop. Would it be faster in that case? It's indeed simpler at least. – Winter Sep 20 '18 at 00:48
  • @Winter: Keeping the first loop is what makes it more efficient, and why my original answer mistakenly said you were doing things right. If you keep enlarging the buffer as you go, the naive resizing pattern will take quadratic time and the optimal one is still O(n log n), whereas getting the buffer size right to begin with is linear-time. – R.. GitHub STOP HELPING ICE Sep 20 '18 at 00:58
2

efficient way to concatenate strings in c

An efficient way is to calculate the string lengths - and remember them.

size_t sum = 1; // for \0
if (argc > 2) sum += argc - 2.  // spaces
size_t length[argc];  // This is a VLA, available C99 and optionally in C11
for(int i = 1; i < argc; i++)
  length[i] = strlen(argv[i]);
  sum += length[i];
}

Then allocate, and then check for errors.

char *dest = malloc(sum);
if (dest == NULL) Handle_OutOfMemory();

Copy each string in turn

char *p = dest;
for(int i = 1; i < argc; i++)
  // Use either memcpy() or strcpy().
  // memcpy() tends to be faster for long strings than strcpy().
  memcpy(p, argv[i], length[i]);  
  p += length[i]; // advance insertion point
  if (i > 1) {
    *p++ = ' '; // space separators
  }
}
*p = '\0';

Now use dest[].

printf("<%s>\n", dest);

Free resources when done.

free(dest);

It wouldn't be more efficient reallocating memory for each string to concatenate, would it?

Usually repetitive re-allocations is best avoided, yet for small short strings it really makes scant difference. Focus on big O. My answer is O(n). Relocating in a loop tends to be O(n*n).

If performance was critical, try various approaches and profile for the intended system. The point being what is fast on one machine may differ on another. Usually it is best to first code a reasonable clear approach.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
0

The most efficient way is probably to not use any str functions and copy the characters "by hand":

char* toAppend = malloc(len + 1);

size_t j = 0;
for(size_t i = 1; i < argc; i++)
{
  for(size_t k = 0; argv[i][k]; k++)
    toAppend[j++] = argv[i][k];
  toAppend[j++] = ' ';
}
toAppend[j - 1] = '\0'; // Remove the last space and NULL-terminate the string
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • `j` and `k` have the wrong type. They need to be `size_t`. – R.. GitHub STOP HELPING ICE Sep 20 '18 at 00:38
  • There's no point having `i` as `size_t` though, as argc is an int – Winter Sep 20 '18 at 00:39
  • @Winter But it doesn't hurt, either. – DYZ Sep 20 '18 at 00:40
  • Yeah, making `i` have type `size_t` is nice too, but not necessary like the others. This is because `i` is bounded between 0 and `argc`, which is bounded by `INT_MAX` since it has type `int`. – R.. GitHub STOP HELPING ICE Sep 20 '18 at 00:40
  • Could you declare main like this? `int main(size_t argc, char** argv)` It appears so, gcc doesn't make any warning. – Winter Sep 20 '18 at 00:55
  • 1
    `i` as `size_t` may emit some warnings with `i < argc` due to comparison of mis-matched integer sign-ness. e.g. "warning: comparison between signed and unsigned integer expressions [-Wsign-compare]". There is little practical down-side either way _here_, yet it does encourage non-use of the finicky _[-Wsign-compare]_. That is a love/hate compiler option. Since `argc` is signed, I'd go with `int i` here. – chux - Reinstate Monica Sep 20 '18 at 01:09
  • libc functions usually yield better results than simple loops. Pointers usually result in better performance than array access, in particular 2-dmentional arrays. So, it is not the most efficient method. – Serge Sep 20 '18 at 02:23
  • @Serge Any proofs to your claims? – DYZ Sep 20 '18 at 03:04
  • 1
    @DYZ glibc functions are usually optimized using hardware features and/or built-in compiler features for the specific operation and it is very difficult to match. Thougth it depends on the library, compiler and os version. Arrays require position calculation(at least addition with single dimension), in particular 2-dimensional arrays (multiplication and addition). – Serge Sep 20 '18 at 11:43
  • @Serge But you know that there are no 2d arrays in this snippet, right? An array of pointers is not a 2d array. As for the efficiency of indexes vs pointers, any modern compiler will optimize both constructs, at least in this context. – DYZ Sep 20 '18 at 14:17
  • @DYZ `argv[i][k];` – Serge Sep 20 '18 at 14:37
  • 1
    @Serge Not everything that looks like a 2d array, is a 2d array. This one is _not_ a 2d array. – DYZ Sep 20 '18 at 15:05
  • @DYZ what is it then? you access your `char**` or `char *[]` using 2-dimensional indexing, causing compiler to calculate 2-dimensional position in there. – Serge Sep 20 '18 at 15:22
  • @Serge It is an array of pointers. If I replace the indexed accesses in the inner loop with pointer-based accesses, my compiler (gcc 5.4.0) generates exactly the same code. What you say about performance penalties may have been true 15-20 years ago, but not anymore. – DYZ Sep 20 '18 at 18:37