0

I'm relatively new to C, so any help understanding what's going on would be awesome!!!

I have a struct called Token that is as follows:

//Token struct
struct Token {
    char type[16];
    char value[1024];
};

I am trying to read from a file and append characters read from the file into Token.value like so:

    struct Token newToken;
    char ch;
    ch = fgetc(file);
    strncat(newToken.value, &ch, 1);

THIS WORKS!

My problem is that Token.value begins with several values I don't understand, preceding the characters that I appended. When I print the result of newToken.value to the console, I get @�����TheCharactersIWantedToAppend. I could probably figure out a band-aid solution to retroactively remove or work around these characters, but I'd rather not if I don't have to.

In analyzing the � characters, I see them as (in order from index 1-5): \330, \377, \377, \377, \177. I read that \377 is a special character for EOF in C, but also 255 in decimal? Do these values make up a memory address? Am I adding the address to newToken.value by using &ch in strncat? If so, how can I keep them from getting into newToken.value?

Note: I get a segmentation fault if I use strncat(newToken.value, ch, 1) instead of strncat(newToken.value, &ch, 1) (ch vs. &ch).

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 4
    This struct is uninitialized. See here for how to initialize a struct: https://stackoverflow.com/questions/330793/how-to-initialize-a-struct-in-accordance-with-c-programming-language-standards The consequence of not intializing it is that it may have garbage data in it before you start writing to it. That data looks like a string, so strcat appends to the end of it. – Nick ODell Feb 16 '23 at 02:55
  • There may be worth remembering that C "strings" are just arrays of chars, and the "end" of a string is signaled by the first byte with a \000 value (starting to look from the beginning of the string). – Diego Ferruchelli Feb 16 '23 at 03:01
  • Ah that was it. Is there any problem with initializing the variables in a struct as NULL? @Nick ODell – Andrew Estes Feb 16 '23 at 03:02
  • @AndrewEstes You have to initialize each char of the array (actually just the first one) with a (char) 0 value. If you're using pointers instead of arrays (or a pointer to the first char of the array), don't mistake initializing the array elements with initializing the pointer itself. NULL is defined as (void *) 0, not as (char) 0. – Diego Ferruchelli Feb 16 '23 at 03:10
  • 1
    The use of `strncat()` doesn’t work like you think it does, either. Using it that way is asking for a non-terminated string. – Dúthomhas Feb 16 '23 at 03:18
  • I could manually terminate the strings though, right? Once I've read the characters I want to read I could append a `\0` no? @Dúthomhas – Andrew Estes Feb 16 '23 at 03:24
  • @AndrewEstes Two things. You have to initialize the target "string" (the array inside your struct) (actually, just set its first char to 0) BEFORE adding other chars to it. You don't need to write a 0 AFTER each call to strncat() because the function does it. But Dúthomhas is right: the second parameter to strncat() must also be a 0 terminated string (that is, the char you have just read followed by another char with a 0 after it). – Diego Ferruchelli Feb 16 '23 at 03:46
  • 2
    A better/faster solution (if you're always appending one char at a time) would be to have a variable (maybe an int field inside your struct) with the next position to be written, beginning with 0. Let's call it `index`. Then, you just do `newToken.value [index++] = ch`. That's it. Just remember to check that `index` is not exceeding the size of the target string minus two (one because arrays begin at [0] and another one reserved for writing the final 0) (yes, in this scenario, you have to manually add the final 0) – Diego Ferruchelli Feb 16 '23 at 03:54
  • "_THIS WORKS!_" - clearly it doesn't, as you then go on to explain. – Clifford Feb 16 '23 at 07:32
  • 2
    Take a look also at https://www.joelonsoftware.com/2001/12/11/back-to-basics/ for a detailed explanation why `strncat()` is often a bad solution. – Clifford Feb 16 '23 at 08:02

2 Answers2

2

I'll try to consolidate the answers already given in the comments.

This version of the code uses strncat(), as yours, but solving the problems noted by Nick (we must initialize the target) and Dúthomhas (the second parameter to strncat() must be a string, and not a pointer to a single char) (Yes, a "string" is actually a char[] and the value passed to the function is a char*; but it must point to an array of at least two chars, the last one containing a '\0'.)

Please be aware that strncat(), strncpy() and all related functions are tricky. They don't write more than N chars. But strncpy() only adds the final '\0' to the target string when the source has less than N chars; and strncat() always adds it, even if it the source has exactly N chars or more (edited; thanks, @Clifford).

#include <stdio.h>
#include <string.h>

int main() {
    FILE* file = stdin; // fopen("test.txt", "r");
    if (file) {
        struct Token {
            char type[16];
            char value[1024];
        };

        struct Token newToken;
        newToken.value[0] = '\0';                       // A '\0' at the first position means "empty"

        int aux;
        char source[2] = "";                            // A literal "" has a single char with value '\0', but this syntax fills the entire array with '\0's
        while ((aux = fgetc(file)) != EOF) {
            source[0] = (char)aux;
            strncat(newToken.value, source, 1);         // This appends AT MOST 1 CHAR (and always adds a final '\0')
        }

        strncat(newToken.value, "", 1);                 // As the source string is empty, it just adds a final '\0' (superfluous in this case)
        printf(newToken.value);
    }
    return 0;
}

This other version uses an index variable and writes each singe char directly into the "current" position of the target string, without using strncat(). I think is simpler and more secure, because it doesn't mix the confusing semantics of single chars and strings.

#include <stdio.h>
#include <string.h>

int main() {
    FILE* file = stdin; // fopen("test.txt", "r");
    if (file) {
        struct Token {
            int index = 0;
            char type[16];
            char value[1024];        // Max size is 1023 chars + '\0'
        };

        struct Token newToken;
        newToken.value[0] = '\0';    // A '\0' at the first position means "empty". This is not really necessary anymore

        int aux;
        while ((aux = fgetc(file)) != EOF)
            // Index will stop BEFORE 1024-1 (value[1022] will be the last "real" char, leaving space for a final '\0')
            if (newToken.index < sizeof newToken.value -1)
                newToken.value[newToken.index++] = (char)aux;

        newToken.value[newToken.index++] = '\0';
        printf(newToken.value);
    }
    return 0;
}

Edited: fgetc() returns an int and we should check for EOF before casting it to a char (thanks, @chqrlie).

  • 1
    See https://linux.die.net/man/3/strncat and note "_src does not need to be null-terminated if it contains n or more bytes._", which makes `&c` valid, when `num == 1`. – Clifford Feb 16 '23 at 16:54
  • @Clifford You're right, but Microsoft documentation is not so clear. I guess it's implementation dependant (or, more probably, a de facto standard derived from what the "original" version of the function did). Anyway, it's not a good practice to use a non-terminated string, and it's not a good practice using the "classic", unsecure versions of these functions. What surprised me was that `strncat()` writes at most N characters plus the `'\0'`. That is, it *always* appends the terminator to the target string! This is totally inconsistent with the behaviour of `strncpy()`, which I had in mind. – Diego Ferruchelli Feb 16 '23 at 21:25
  • `(source[0] = (char)fgetc(file)) != -1` cannot detect end of file if `char` type is unsigned. – chqrlie Feb 16 '23 at 22:06
  • @chqrlie The default (sgned or unsigned char) is implementation defined. I ran it on x86. But yes, you should explicitly declare it as signed. It won't affect the rest of the code. – Diego Ferruchelli Feb 17 '23 at 00:26
  • 1
    @DiegoFerruchelli: declaring the array as `signed char` is not the correct way to fix this issue: you would be stopping the loop if a byte value of `0xff` is read from the file. The loop test should use an `int` variable to store the return value of `fgetc()` and compare it to `EOF`, and only store the byte to the array if not at end of file. The second program would be OK if you change the type of `source` to `int` and change the test to `while ((source = fgetc(file)) != EOF)`. Note that `EOF` is not necessarily defined as `-1`. – chqrlie Feb 17 '23 at 08:52
  • @DiegoFerruchelli: also you should probably not increment the index in `newToken.value[newToken.index++] = '\0';` – chqrlie Feb 17 '23 at 08:55
  • @chqrlie No, that's perfectly right and **very** usual in C. The form `index++` takes the value of `index` for using it in the expression (in this case, for pointing to a specific char in the string) and then (that is, AFTER that) increments the value of the variable. It"s called the post-increment operator. There's also a form `++index`, which first increments.and then takes the value. – Diego Ferruchelli Feb 17 '23 at 11:54
  • @DiegoFerruchelli: regarding `newToken.value[newToken.index++] = '\0';` the code is perfectly defined, but I think `newToken.index` should not be incremented when you write the null terminator. This makes it the length of the string and the place where further characters can be appended. – chqrlie Feb 17 '23 at 14:46
  • @chqrlie Oh! You were talking about THAT `index++`! Yes, I thought about it, and both options are acceptable. But I think it's more consistent to _always_ advance the counter, even after adding the terminator (which is also a char and also takes space in the buffer). This way, `index` _always_ point to the next unused place in the buffer (not the next writable place in the string) and _always_ contains the used length of the buffer (not the `strlen()`, which will be `index-1`). My decision was made in the context of this specific question and having in mind the need to prevent buffer overruns. – Diego Ferruchelli Feb 17 '23 at 15:34
  • @chqrlie You were right about the EOF detection (although it only affects one specific case). I already corrected that, adding an explicit cast to char when inserting the value into the buffer (just for clarity). Thanks. – Diego Ferruchelli Feb 17 '23 at 15:42
1

You are appending string that is not initialised, so can contain anything. The end I'd a string is indicated by a NUL(0) character, and in your example there happened to be one after 6 bytes, but there need not be any within the value array, so the code is seriously flawed, and will result in non-deterministic behaviour.

You need to initialise the newToken instance to empty string. For example:

struct Token newToken = { "", "" } ;

or to zero initialise the whole structure:

struct Token newToken = { 0 } ;

The point is that C does not initialise non-static objects without an explicit initialiser.

Furthermore using strncat() is very inefficient and has non-deterministic execution time that depends on the length of the destination string (see https://www.joelonsoftware.com/2001/12/11/back-to-basics/).

In this case you would do better to maintain a count of the number of characters added, and write the character and terminator directly to the array. For example:

size_t index ;
int ch = 0 ;

do
{
    ch = fgetc(file);
    if( ch != EOF ) 
    {
        newToken.value[index] = (char)ch ;
        index++ ;
        newToken.value[index] = '\0' ;
    }
} while( ch != EOF && 
         index < size of(newToken.value) - 1 ) ;
Clifford
  • 88,407
  • 13
  • 85
  • 165
  • I agree (everything here was already said before). But if you're really concerned about performance, you should just add the final `'\0'` (to the target string) in the very end of the routine, after the loop. It's no use writing it after each read char, and then writing the next char at the same position. Also, as I said, `strncat()` **does** add a final `'\0'`, but only when the input string is shorter than the specified number of chars. Performance aside, these functions are tricky, ugly and not good for newcomers. – Diego Ferruchelli Feb 16 '23 at 15:58
  • 1
    @DiegoFerruchelli I originally considered placing the null after the loop but decided against it. In this case there is no significant performance advantage to be gained because the overhead is orders of magnitude less than the time spent in file I/O (`fgetc()`), so I considered an unnecessary optimisation. If the loop perhaps got expanded to many lines, it would be easy to forget the termination, and it might no longer have "proximity" making it less clear whay it was there. – Clifford Feb 16 '23 at 16:47
  • 1
    @DiegoFerruchelli The `strncat()` call specifically mentioned. The `strncat()` not null terminating point was nonsense and deleted. I took `Dúthomhas' comment at face value. My defence for not being familiar with `strncat()` behaviour is that I don't need to know because I would semdom use it and never in this manner! ;-) – Clifford Feb 16 '23 at 16:48
  • Yes, of course, the time spent in the extra `'\0'` is not meaningful at all. But I don't think this as an optimisation, but as a matter of clarity. And about the double checking of the condition, both in the `if` and in the `while`, well... that... ;) – Diego Ferruchelli Feb 16 '23 at 20:15