6

I am attempting to create a UNIX shell in C. If it were in Java, it would be a piece of cake, but I am not so experienced in C. Arrays in C confuse me a bit. I am not sure how to declare or access certain data structures.

I would like to create a string to read in each line. Easy enough: simply an array of characters. I would initialize it as follows:

char line[256]; //Maximum size of each line is 255 characters

And to access an element of this array, I would do as follows:

line[0] = 'a'; //Sets element 0 to 'a'
fgets( line, sizeof line, stdin ); //Gets a line from stdin and places it in line

How does declaring and using a string in this manner differ from declaring it as a pointer? From my understanding, an array in C decays to a pointer. So, would the following be equivalent?

char *line = (char*) malloc( sizeof(char) * 256 );
line[0] = 'a';
fgets( *line, sizeof(line), stdin );

When do you use the pointer character '*', and when don't you? In the example above, is including the '*' in fgets necessary, or correct?

Now, I would like to create an array of strings, or rather, an array of pointers which point to strings. Would I do so as follows?

char *arr[20]; // Declares an array of strings with 20 elements

And how would I access it?

arr[0] = "hello" // Sets element zero of arr to "hello"

Is this correct?

How would I pass this array to a function?

execvp("ls", arr); // Executes ls with argument vector arr

Is that correct, or would I use the pointer *arr? If so, why?

Now even worse, I would like an array of arrays of strings (for example, if I wanted to hold multiple argument vectors, in order to execute multiple commands in pipe sequence). Would it be declared as follows?

char **vector_arr[20]; // An array of arrays of strings

And how would I access an element of this array?

execvp("ls", vector_arr[0]); // Executes ls with first element of vector_arr as argument vector

I thought that I grasped a decent understanding of what a pointer is, and even how arrays relate to pointers, however I seem to be having trouble relating this to the actual code. I guess that when dealing with pointers, I don't know when to reference *var, var, or &var.

Johndt
  • 4,187
  • 1
  • 23
  • 29
  • 3
    Too many questions here. Suggest reading a good C book. – OldProgrammer Feb 14 '14 at 02:50
  • Actually I think OP managed to hit most every common array/string question in a single post and to do so rather logically and eloquently. Should the post follow the site dictates and be broken up? Maybe. But a good answer would make nice reference in a single page. – Duck Feb 14 '14 at 02:52
  • It might make a nice reference, but I'm not sure StackOverflow is about writing references/tutorials? – keshlam Feb 14 '14 at 02:55
  • I think it's essentially one question (how do arrays work in C), taken to a few different levels of abstraction. I can use "strings" in C all day, but when I attempt to abstract the idea of an array to create an array of strings it seems to reveal a lack of understanding of the underlying concept. – Johndt Feb 14 '14 at 03:00
  • possible duplicate of [Making an Array to Hold Arrays of Character Arrays in C](http://stackoverflow.com/questions/16091848/making-an-array-to-hold-arrays-of-character-arrays-in-c) – jww Feb 14 '14 at 07:29
  • I am not sure that "writing a Unix shell" is a good idea if one does not understand "C arrays." I think it would make more sense to master the lower level concepts before tackling the higher-level ones. This is assuming, of course, that this is an academic endeavor and not a practical one. There are already too many Unix shells, we do not need another one. –  Feb 14 '14 at 17:26
  • 1
    @John Gaughan: Writing a Unix shell is a very common assignment for CS students in an introductory Operating Systems course. In addition to giving good coding practice with arrays (e.g. for command-line parsing), it also introduces job-control concepts and may be a student's first non-trivial program. – par Feb 14 '14 at 20:27
  • @John Gaughan I understand arrays fine. I understand programming fine. This is, however, my first use of the C language. All of the concepts of a UNIX shell are quite simple (command line parsing, forking, piping, redirects, etc.). The difficult part is learning a new language. I do thank you for your concern, however. – Johndt Feb 17 '14 at 16:11

3 Answers3

7

Let's talk about expressions and types as they relate to arrays in C.

Arrays

When you declare an array like

char line[256];

the expression line has type "256-element array of char"; except when this expression is the operand of the sizeof or unary & operators, it will be converted ("decay") to an expression of type "pointer to char", and the value of the expression will be the address of the first element of the array. Given the above declaration, all of the following are true:

 Expression             Type            Decays to            Equivalent value
 ----------             ----            ---------            ----------------
       line             char [256]      char *               &line[0]
      &line             char (*)[256]   n/a                  &line[0]
      *line             char            n/a                  line[0]
    line[i]             char            n/a                  n/a
   &line[0]             char *          n/a                  n/a
sizeof line             size_t          n/a                  Total number of bytes 
                                                               in array (256)

Note that the expressions line, &line, and &line[0] all yield the same value (the address of the first element of the array is the same as the address of the array itself), it's just that the types are different. In the expression &line, the array expression is the operand of the & operator, so the conversion rule above doesn't apply; instead of a pointer to char, we get a pointer to a 256-element array of char. Type matters; if you write something like the following:

char line[256];
char *linep = line;
char (*linearrp)[256] = &line;

printf( "linep    + 1 = %p\n", (void *) (linep + 1) );
printf( "linearrp + 1 = %p\n", (void *) (linearrp + 1) );

you'd get different output for each line; linep + 1 would give the address of the next char following line, while linearrp + 1 would give the address of the next 256-element array of char following line.

The expression line is not an modifiable lvalue; you cannot assign to it, so something like

char temp[256];
...
line = temp;

would be illegal. No storage is set aside for a variable line separate from line[0] through line[256]; there's nothing to assign to.

Because of this, when you pass an array expression to a function, what the function receives is a pointer value, not an array. In the context of a function parameter declaration, T a[N] and T a[] are interpreted as T *a; all three declare a as a pointer to T. The "array-ness" of the parameter has been lost in the course of the call.

All array accesses are done in terms of pointer arithmetic; the expression a[i] is evaluated as *(a + i). The array expression a is first converted to an expression of pointer type as per the rule above, then we offset i elements from that address and dereference the result.

Unlike Java, C does not set aside storage for a pointer to the array separate from the array elements themselves: all that's set aside is the following:

+---+
|   | line[0]
+---+
|   | line[1]
+---+
 ...
+---+
|   | line[255]
+---+

Nor does C allocate memory for arrays from the heap (for whatever definition of heap). If the array is declared auto (that is, local to a block and without the static keyword), the memory will be allocated from wherever the implementation gets memory for local variables (what most of us call the stack). If the array is declared at file scope or with the static keyword, the memory will be allocated from a different memory segment, and it will be allocated at program start and held until the program terminates.

Also unlike Java, C arrays contain no metadata about their length; C assumes you knew how big the array was when you allocated it, so you can track that information yourself.

Pointers

When you declare a pointer like

char *line;

the expression line has type "pointer to char" (duh). Enough storage is set aside to store the address of a char object. Unless you declare it at file scope or with the static keyword, it won't be initialized and will contain some random bit pattern that may or may not correspond to a valid address. Given the above declaration, all of the following are true:

 Expression             Type            Decays to            Equivalent value
 ----------             ----            ---------            ----------------
       line             char *          n/a                  n/a
      &line             char **         n/a                  n/a
      *line             char            n/a                  line[0]
    line[i]             char            n/a                  n/a
   &line[0]             char *          n/a                  n/a
sizeof line             size_t          n/a                  Total number of bytes
                                                               in a char pointer
                                                               (anywhere from 2 to
                                                               8 depending on the
                                                               platform)

In this case, line and &line do give us different values, as well as different types; line is a simple scalar object, so &line gives us the address of that object. Again, array accesses are done in terms of pointer arithmetic, so line[i] works the same whether line is declared as an array or as a pointer.

So when you write

char *line = malloc( sizeof *line * 256 ); // note no cast, sizeof expression

this is the case that works like Java; you have a separate pointer variable that references storage that's allocated from the heap, like so:

+---+ 
|   | line -------+
+---+             |
 ...              |
+---+             |
|   | line[0] <---+
+---+
|   | line[1]
+---+
 ...
+---+
|   | line[255]
+---+

Unlike Java, C won't automatically reclaim this memory when there are no more references to it. You'll have to explicitly deallocate it when you're finished with it using the free library function:

free( line );

As for your specific questions:

fgets( *line, sizeof(line), stdin );

When do you use the pointer character '*', and when don't you? In the example above, is including the '*' in fgets necessary, or correct?

It is not correct; fgets expects the first argument to have type "pointer to char"; the expression *line has type char. This follows from the declaration:

char *line; 

Secondly, sizeof(line) only gives you the size of the pointer, not the size of what the pointer points to; unless you want to read exactly sizeof (char *) bytes, you'll have to use a different expression to specify the number of characters to read:

fgets( line, 256, stdin );
Now, I would like to create an array of strings, or rather, an array of pointers which point to strings. Would I do so as follows?
char *arr[20]; // Declares an array of strings with 20 elements

C doesn't have a separate "string" datatype the way C++ or Java do; in C, a string is simply a sequence of character values terminated by a 0. They are stored as arrays of char. Note that all you've declared above is a 20-element array of pointers to char; those pointers can point to things that aren't strings.

If all of your strings are going to have the same maximum length, you can declare a 2D array of char like so:

char arr[NUM_STRINGS][MAX_STRING_LENGTH + 1]; // +1 for 0 terminator

and then you would assign each string as

strcpy( arr[i], "some string" );
strcpy( arr[j], some_other_variable );
strncpy( arr[k], MAX_STRING_LENGTH, another_string_variable );

although beware of strncpy; it won't automatically append the 0 terminator to the destination string if the source string was longer than the destination. You'll have to make sure the terminator is present before trying to use it with the rest of the string library.

If you want to allocate space for each string separately, you can declare the array of pointers, then allocate each pointer:

char *arr[NUM_STRINGS];
...
arr[i] = malloc( strlen("some string") + 1 );
strcpy( arr[i], "some string" );
...
arr[j] = strdup( "some string" ); // not available in all implementations, calls
                                  // malloc under the hood
...
arr[k] = "some string";  // arr[k] contains the address of the *string literal*
                         // "some string"; note that you may not modify the contents
                         // of a string literal (the behavior is undefined), so 
                         // arr[k] should not be used as an argument to any function
                         // that tries to modify the input parameter.

Note that each element of arr is a pointer value; whether these pointers point to strings (0-terminated sequences of char) or not is up to you.

Now even worse, I would like an array of arrays of strings (for example, if I wanted to hold multiple argument vectors, in order to execute multiple commands in pipe sequence). Would it be declared as follows?
char **vector_arr[20]; // An array of arrays of strings

What you've declared is an array of pointers to pointers to char; note that this is perfectly valid if you don't know how many pointers to char you need to store in each element. However, if you know the maximum number of arguments per element, it may be clearer to write

char *vector_arr[20][N];

Otherwise, you'd have to allocate each array of char * dynamically:

char **vector_arr[20] = { NULL }; // initialize all the pointers to NULL

for ( i = 0; i < 20; i++ )
{
  // the type of the expression vector_arr is 20-element array of char **, so
  // the type of the expression vector_arr[i] is char **, so
  // the type of the expression *vector_arr[i] is char *, so
  // the type of the expression vector[i][j] is char *, so
  // the type of the expression *vector_arr[i][j] is char

  vector_arr[i] = malloc( sizeof *vector_arr[i] * num_args_for_this_element );
  if ( vector_arr[i] )
  {
    for ( j = 0; j < num_args_for_this_element )
    {
      vector_arr[i][j] = malloc( sizeof *vector_arr[i][j] * (size_of_this_element + 1) );
      // assign the argument
      strcpy( vector_arr[i][j], argument_for_this_element );
    }
  }
}

So, each element of vector_arr is an N-element array of pointers to M-element arrays of char.

John Bode
  • 119,563
  • 19
  • 122
  • 198
  • 1
    Thank you. A very informative answer. It really helped to clarify everything. I think I've got a pretty good grasp now. I've successfully parsed the command line input and executed single commands. Now it's just a matter of setting up piping and redirects, which shouldn't be too difficult. – Johndt Feb 17 '14 at 16:59
3

You're really on the right track.

In your second example, where you use malloc(), the fgets() command would be called like so:

fgets( line, sizeof(line), stdin ); /* vs. fgets( *line ... ) as you have */

The reason for this is that in C a named array variable is always just a pointer. So:

char line[256];

declares (and defines) a pointer called line that points to 256 bytes of memory allocated at compile time (probably on the stack).

char *line; also declares a pointer, but the memory it points to is not assigned by the compiler. When you call malloc you typecast the return value to char * and assign it to line so the memory is allocated dynamically on the heap.

Functionally though, the variable line is just a char * (pointer to char) and if you look at the declaration of fgets in the <stdio.h> file, you'll see what it expects as its first argument:

char *fgets(char * restrict str, int size, FILE * restrict stream);

... namely a char *. So you could pass line either way you declared it (as a pointer or as an array).

With respect to your other questions:

char *arr[20]; declares 20 uninitialized pointers to char *. To use this array, you would iterate 20 times over the elements of arr and assign each one with some result of malloc():

arr[0] = (char *) malloc( sizeof(char*) * 256 );
arr[1] = (char *) malloc( sizeof(char*) * 256 );
...
arr[19] = (char *) malloc( sizeof(char*) * 256 );

Then you could use each of the 20 strings. To pass the second one to fgets, which expects a char * as its first argument, you would do this:

fgets( arr[1], ... );

Then fgets gets the char * it expects.

Be aware of course that you have to call malloc() before you attempt this or arr[1] would be uninitialized.

Your example using execvp() is correct (assuming you allocated all these strings with malloc() first. vector_arr[0] is a char **, which execvp() expects. [Remember also execvp() expects the last pointer of your vector array to have the value NULL, see the man page for clarification].

Note that execvp() is declared like so (see <unistd.h>)

int execvp(const char *file, char *const argv[]);

removing the const attribute for clarity, it could also have been declared like so:

int execvp( const char *file, char **argv );

The declaration of char **array being functionally equivalent to char *array[].

Remember also that in every example where we use malloc(), you'll have to at some point use a corresponding free() or you'll leak memory.

I'll also point out that, generally speaking, although you can do an array of vectors (and arrays of arrays of vectors and so on), as you extend your arrays more and more dimensionally you'll find the code gets harder and harder to understand and maintain. Of course you should learn how this all works and practice until you understand it fully, but if in the course of designing your code you find yourself thinking you need arrays of arrays of arrays you are probably overcomplicating things.

par
  • 17,361
  • 4
  • 65
  • 80
  • So, from my understanding, declaring an array using arr[int] notation makes the array constant. In the case of a string, I would then have to use the string functions (strcpy, strcat) to modify the string, correct? When declared as a pointer however *arr, the array is dynamic. Do you have to use malloc to create space for a value before you assign it? Or does assigning a value to the array create the space? Thanks for your answer, it helped a lot. – Johndt Feb 14 '14 at 03:18
  • 1
    Unless you use the `const` keyword, the array content is *not* constant. `const` is a separate topic so don't worry about it yet. Regardless of which way you declare `line` (with malloc or not) you can still say `line[0] = 'a';` (try it!). Just don't pass the one that was allocated by the compiler to `free()`. – par Feb 14 '14 at 03:23
  • 2
    *makes the array constant* is a bad way to think of it. Think of it as `arr[int]` has an assigned fixed address. With `*arr` the pointer variable arr has a fixed address but what it points to (it's value) is variable and can change. – Duck Feb 14 '14 at 03:24
  • In C you *must* make sure you have space available for a value before you assign it. If you declare line like so: `char line[256];` you tell the compiler to allocate 256 bytes for you, so it is ready to use at the next line of code. `char *line;` simply says `line` can be used as an array, but it says nothing about how big the memory is that `line` points to. You determine that by using `malloc` and you have to keep track of how big the array is. `sizeof(line)` will behave differently depending on how you declare line. One returns the array size, the other returns the pointer size (try it!) – par Feb 14 '14 at 03:27
  • @par: The statement "declares (and defines) a pointer called line that points to 256 bytes of memory allocated at compile time (probably on the stack)." is not correct; no storage is set aside for a pointer value separately from the array elements. There is no `line` variable that's separate from `line[0]` through `line[255]`. – John Bode Feb 14 '14 at 16:05
  • For comparison, the `nrutil` appendix of "Numerical Recipes in C" has examples of functions that allocate `char *cvector` and `char **cmatrix`. It may also have had `int ***itensor` (or maybe I only extrapolated that). Useful as a reference to study examples, but not exactly pedagogical in explaining why. – kbshimmyo Feb 14 '14 at 17:45
  • @John Bode: My statement is correct--the *value* of `line` is the address of the first byte of the array. It *is* a pointer (and of course `line` can be used as one). I did not mean to imply that `char line[256];` allocates separately a pointer and then memory as you'd see with the `malloc()` use case, but I do mean to say it should still be thought of as distinct: both a pointer (the address of the array) and the memory for the array. Under the covers the compiler is either managing the value of `line` in a register or a stack-offset instruction. Either way it truly is independent. – par Feb 14 '14 at 20:15
2

Here is a partly answer to the OP.

char *line = (char*) malloc( sizeof(char) * 256 );
line[0] = 'a';
fgets( *line, sizeof(line), stdin );

the arguments to fgets() is wrong, it should be fgets( line, 256, stdin );.

Explanation:

  1. fgets() expects its first argument a char *, so you can use a pointer to char or an array of char (this array name will degrade to char * in this case).

    When used as a argument to a function, an array name will degrade to a pointer.

  2. becuase line is a pointer, sizeof(line) will give you the size of a pointer (usually 4 in 32-bit system); but if line is an array, such as char line[100], sizeof(line) will give you the size of the array, in this case, 100 * sizeof(char).

    When used as an argument of sizeof operator, array name will not degrade to a pointer.

Community
  • 1
  • 1
Lee Duhem
  • 14,695
  • 3
  • 29
  • 47
  • Thanks for your answer. So then for the size of *line, I would want to use strlen(line) instead, correct? If I wanted the size of the array of strings, then I would have to have the size stored, as sizeof would not work? – Johndt Feb 14 '14 at 03:22
  • 1
    @JohnT You cannot use `strlen(line)` to get the length of `line`, becuase `strlen()` expects a string, in C, this means a sequence of characters and a `\0`, but the content of this `mallac`ed memory is unknown, it could be anything. – Lee Duhem Feb 14 '14 at 03:34
  • 1
    You're kind of trying to run before you walk, but strlen() will tell you the logical length of line (which is what you want), not the physical length. In C, you denote the end of a string by having the last byte set to zero. So `line[0] = 'a'; line[1] = 0;` would create a string that is logically one character long, and strlen() would return 1. Notice though that you actually had to use two bytes, one for 'a' and one for the NULL (zero) terminator. And *that* says nothing about the physical size of line, which we know is 256 bytes! – par Feb 14 '14 at 03:35
  • @JohnT By 'the size of the array of strings`, which size do you want to get? The number of strings in this array? The sum of the lengths of all strings in this array? Or something else? – Lee Duhem Feb 14 '14 at 03:37
  • 1
    Okay, I get it now. If using fgets(), you would want the amount of space the array has (256), so strlen() would not be appropriate in this case, but when you want the length of a string. I guess I sorta knew that, just slipped my mind. – Johndt Feb 14 '14 at 03:40