My answer won't initially focus on concatenating a string correctly; but rather will attempt to address some issues in your code as it stands and give you some backdrop thoughts that may help clarify how to think about things in C. And then we'll look at concatenating the strings
Before we start, some thoughts on the structure of C-strings
Thinking in C is very much like thinking like a computer (CPU, memory, etc.); So for data types that work natively on CPU, C has characters (single byte things), shorts (double byte words), longs (4 byte words), ints, floats, and doubles, all things that a CPU natively understands. And the ability to create arrays of these things or pointers to memory locations where these types exist.
So how do we create a string then? Do we create a new type?
Well, since CPUs don't understand strings neither does C... Not in its most primitive form anyway (the C parser has no type associated with strings).
But strings are very useful so there had to be a reasonably simple notion of what a string should be had it be decided upon.
All a C-string is, is a bytes in sequential memory that don't include a the NUL char;
NUL (pronounced something like nool) is the name we give to value to a byte in memory that has value of 0. In C this is signified by \0
. So if I write NUL it means character \0
;
NOTE 1: This is different from the the C NULL which is a memory address of value zero;
NOTE 2: NUL of course is not character zero ('0') which has a value of 48;
So any function that works on strings starts a memory location pointed to by a char * (read char pointer); and just keeps on doing its operations byte (character) by byte (character) until it runs into a value of 0 for a byte indicating the end of string. At which time, hopefully it stops doing what it's doing because the string has ended and returns the results of its operations.
So if we make our strings be defined as an array of characters that ends in 0, and we completely avoid having to create any artificial concept of string beyond that.
And that's exactly what C does; it just settled on this notion as the convention to use; and the compiler just provides a simple shortcut to declare arrays of characters that are NUL terminated using the double quotes and that's it. There is no special type for strings in C.
So with all of this in mind let's look at your code:
char hello[] = { 'H', 'e', 'l', 'l', 'o', '\0' };
char world[] = { ',', ' ', 'W', 'o', 'r', 'l', 'd', '!', '\0' };
You declared two arrays of single byte (char) and terminated them with \0;
This is IDENTICAL to the following C statements:
char hello[] = "Hello";
char world[] = ", World!";
When compiled on a Linux machine running on a 64-bit Intel computer, both your pair and the one above emit the following (identical) machine code output:
Disassembly of section .data:
0000000000000000 <hello>:
0: 48 65 6c 6c 6f 00 Hello.
0000000000000006 <world>:
6: 2c 20 57 6f 72 6c 64 21 00 , World!.
If you're using Linux you can try that out; let me know and I'll show you the commands as an addendum below.
Notice that in both cases a 00
byte appeared at the end. In your case it was explicitly declared by you in the array; and in the second case it was implicitly injected by the C compiler when emitting the data corresponding to the <hello>
and <world>
symbols.
Okay, so now that you understand how that works; you can see that:
// This is bad: :-)
for (int i = 0; i < 6; i++) {
if (hello[i] == '\0') {
for (int j = 0; j < 9; j++) {
int index = 5 + j;
hello[index] = world[j];
}
}
}
The looping above is very weird. Actually there are a bunch of things wrong with it (e.g. the loop nested inside the outer for
loop is wrong);
But rather than pointing out the problems, let's just look at the basic correct solution.
When you program for strings you DON'T know how big they are; so the condition of the form i < N
in the for
loops dealing with strings is not the usual way to go.
Here is a way to loop through the characters in a string (a char array terminating with \0
):
char *p; /* Points to the characters in strings */
char str[] = "Hello";
for ( p = str; *p != 0; p++ ) {
printf("%c\n", *p);
}
So let's figure out what's happening here:
for ( p = str; ...
^^^^^^^^^
p
is a char pointer. In the beginning we point it to hello
(which is where the variable hello is loaded in memory when you run the program) and check if the value at this memory location (obtained by *p
) it's equal to '\0' or not:
for (p = str; *p != 0; ...)
^^^^^^^
If it's not we do our for
loop because the condition is true; in our case *p=='H'
so we enter the loop:
for (p = str; *p != 0; p++)
^^^
Here we now do our increment / decrement / something else first. But in this case the ++
operator is postfixed to p
; so p
(which is a memory address) will increment its value at the END of the statements in the loop; so now the loop enters the { ... }
that does its thing and at the end the ++
happens and we enter the condition check again:
for (p = str; *p != 0; p++)
^^^^^^^
So you can see that this would set p
to point to memory locations for 'H' 'e' 'l' 'l' 'o' '\0'; and then it hits '\0' it will exit.
Concatenating strings:
So now that we know that we want to concatenate "Hello" and ", World!".
First we need to find the end of Hello
and then we need to start sticking the ", World!" to the end of it:
Well we know that our for
loop above finds the end of hello; so if we do nothing in it at the end of it, *p
will point to where the '\0' at the end of Hello
is:
char str1[] = "Hello";
char str2[] = ", World";
char *p; /* points str1 */
char *q; /* points str2 */
for (p = str1; *p!=0; p++) {
/* Skip along till the end */
}
/* Here p points to '\0' in str1 */
/* Now we start to copy characters from str2 to str1 */
for (q = str2; *q != 0; p++, q++ ) {
*p = *q;
}
Note that in the first pass *p
was pointing to '\0' at the end of str1 so when we assign *p = *q
that '\0' gets replaced by ','; and the '\0' disappears from str1 completely which we'll have to inject at the end; note that we still have to increment p
and q
at the end and continue looping while *q != 0
.
Now that the loop ends we stick a '\0' at the end because we destroyed the one we had:
*p = 0;
And that is concatenation.
Important part about memory
If you notice in the assembler output above; Hello\0
took up six bytes and , World\0
started at address 0000000006
(hello started at 000000000) in the data segment.
Which means that if you write beyond the number of bytes of str1[] and it doesn't have enough space which is our case (why is explained below), we'll end up overwriting part of memory that belongs to something else (str2[]) for example;
The reason we don't have enough memory is because we just declared a character array that's big enough to hold our initialization value:
char str[] = "Foofoo";
will make str be exactly 7 bytes.
But we can ask C to give more space to str
than just the initialization value. For example,
char str[20] = "Foofoo";
This will give str
20 bytes, and set the first seven to "Foofoo\0". The rest are typically set to \0
as well;
So the disassembling above would look like:
Disassembly of section .data:
0000000000000000 <str>:
0: 48 65 6c 6c 6f 00 00 00 00 00 00 00 00 00 00 00 Foofoo..........
10: 00 00 00 00 ....
Remember in C you have to think like a computer. If you don't explicitly ask for memory you won't have it. So if we are to do your concatenation either we have to use an array that's big enough because we explicitly declared it that way:
char foo[1000]; /* Lots of room */
Or we ask for a memory location at run time using malloc
(a topic for another post).
Let's just look at a working solution then:
concat.c:
#include <stdio.h>
char str1[100] = "Hello";
char str2[] = ", World!"; /* No need to make this big */
int main()
{
char *p;
char *q;
printf("str1 (before concat): %s\n", str1);
for (p = str1; *p != 0; p++) {
/* Skip along to find the end */
}
for (q = str2; *q != 0; p++, q++ ) {
*p = *q;
}
*p = 0; /* Set the last character to 0 */
printf("str1 (after concat): %s\n", str1);
return 0;
}
Disassembling on Linux:
If you compile the above into JUST an object file and don't link it to an executable, you'll keep things less messy:
gcc -c concat.c -o concat.o
You can disassemble concat.o using object dump:
objdump -d concat.o
You'll notice a LOT of unnecessary code in the dump dealing with the printf statements:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 10 sub $0x10,%rsp
8: be 00 00 00 00 mov $0x0,%esi
d: bf 00 00 00 00 mov $0x0,%edi
12: b8 00 00 00 00 mov $0x0,%eax
17: e8 00 00 00 00 callq 1c <main+0x1c>
So to get rid of it just comment out the printf's in your code. Then recompile using the line
gcc -O3 -c concat.c -o concat.o
again. Now you'll get a much cleaner output;
The -O3
removes some of the frame pointers (MUCH later subject) related instructions and the assembler will be specific to your code base:
Here is the concat.o output when compiled using above and dumped out using:
objdump -S -s concat.o
concat.o: File format elf64-x86-64
Contents of section .text:
0000 803d0000 000000b8 00000000 740b6690 .=..........t.f.
0010 4883c001 80380075 f70fb615 00000000 H....8.u........
0020 84d2741d b9000000 000f1f80 00000000 ..t.............
0030 4883c101 88104883 c0010fb6 1184d275 H.....H........u
0040 efc60000 31c0c3 ....1..
Contents of section .data:
0000 48656c6c 6f000000 00000000 00000000 Hello...........
0010 00000000 00000000 00000000 00000000 ................
0020 00000000 00000000 00000000 00000000 ................
0030 00000000 00000000 00000000 00000000 ................
0040 00000000 00000000 00000000 00000000 ................
0050 00000000 00000000 00000000 00000000 ................
0060 00000000 2c20576f 726c6421 00 ...., World!.
Contents of section .comment:
0000 00474343 3a202844 65626961 6e20342e .GCC: (Debian 4.
0010 342e352d 38292034 2e342e35 00 4.5-8) 4.4.5.
Contents of section .eh_frame:
0000 14000000 00000000 017a5200 01781001 .........zR..x..
0010 1b0c0708 90010000 14000000 1c000000 ................
0020 00000000 47000000 00000000 00000000 ....G...........
Disassembly of section .text:
0000000000000000 <main>:
0: 80 3d 00 00 00 00 00 cmpb $0x0,0x0(%rip) # 7 <main+0x7>
7: b8 00 00 00 00 mov $0x0,%eax
c: 74 0b je 19 <main+0x19>
e: 66 90 xchg %ax,%ax
10: 48 83 c0 01 add $0x1,%rax
14: 80 38 00 cmpb $0x0,(%rax)
17: 75 f7 jne 10 <main+0x10>
19: 0f b6 15 00 00 00 00 movzbl 0x0(%rip),%edx # 20 <main+0x20>
20: 84 d2 test %dl,%dl
22: 74 1d je 41 <main+0x41>
24: b9 00 00 00 00 mov $0x0,%ecx
29: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
30: 48 83 c1 01 add $0x1,%rcx
34: 88 10 mov %dl,(%rax)
36: 48 83 c0 01 add $0x1,%rax
3a: 0f b6 11 movzbl (%rcx),%edx
3d: 84 d2 test %dl,%dl
3f: 75 ef jne 30 <main+0x30>
41: c6 00 00 movb $0x0,(%rax)
44: 31 c0 xor %eax,%eax
46: c3 retq