0

I have a notepad file with approximately 150,000 words (representing a dictionary). I'm trying to scan in each word and print it to the console. This setup works fine:

void readDictionary(FILE *ifp, int numWords) {
    fscanf(ifp, "%d", &numWords);
    printf("%d\n", numWords);

    int i;
    char* words = (char*)malloc(20 * sizeof(char));
    for(i = 0; i < numWords; i++) {
        fscanf(ifp, "%s", words);
        printf("%s\n", words);
    }
}

However, this code obviously overwrites "words" each time it loops. I'm trying to get each word to save to a certain array element. I did the following but it instantly crashes (I changed the memory allocation to 2D because I read around here and it seems that is what I am supposed to do):

void readDictionary(FILE *ifp, int numWords) {
    fscanf(ifp, "%d", &numWords);
    printf("%d\n", numWords);

    int i;
    char** words = (char**)malloc(20 * sizeof(char*));
    for(i = 0; i < numWords; i++) {
        fscanf(ifp, "%s", words[i]);
        printf("%s\n", words[i]);
    }
}

Any help is appreciated. I've read around on many posts but haven't figured it out.

Josh
  • 55
  • 3
  • 10

2 Answers2

3

In your second version, you allocate space for 20 pointers, but you leave those pointers uninitialized and without anything to point to. I'm sure you can imagine how that presents a problem when you then try to read from your dictionary into the memory designated by one of those pointers.

It looks like you want to allocate space for numwords pointers

char** words = malloc(numwords * sizeof(*words));

, and for each of them, to allocate space for a word.

for(i = 0; i < numWords; i++) {
    words[i] = malloc(20);  //  by definition, sizeof(char) == 1
    // ...

Additionally, do check the return value of malloc(), which will be NULL in the event of allocation failure.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • Thank you! I was lost with the memory allocation part but this makes sense the way you explained it. – Josh Jan 25 '17 at 20:24
1

The first problem is you only allocated space for a list of words (ie. character pointers) but you didn't allocate space for the words themselves.

char** words = (char**)malloc(20 * sizeof(char*));

This allocates space for 20 character pointers and assigns it to words. Now words[i] has space for a character pointer but not for the characters.

words[i] contains garbage, because malloc does not initialize memory. When you pass it into fscanf, fscanf tries to use the garbage in words[i] as a memory location to write characters to. That's either going to corrupt some memory in the program, or more likely it tries to read a memory location is isn't allowed to and crashes. Either way, it's not good.

You have to allocate memory for the string, pass that to fscanf, and finally put that string into words[i].

char** words = malloc(numWords * sizeof(char*));
for(i = 0; i < numWords; i++) {
    char *word = malloc(40 * sizeof(char));
    fscanf(ifp, "%39s", word);
    words[i] = word;
    printf("%s\n", words[i]);
}

Note that I didn't cast the result of malloc, that's generally considered unnecessary.

Also note that I allocated space for numWords in the list. Your original only allocates space for 20 words, once it goes over that it'll start overwriting allocated memory and probably crash. As a rule of thumb, avoid constant memory allocations. Get used to dynamic memory allocation as quickly as you can.


Also note that I limited how many characters fscanf is allowed to read to the size of my buffer (minus one because of the null byte at the end of strings). Otherwise if your word list contained "Pneumonoultramicroscopicsilicovolcanoconiosis", 45 characters, it would overrun the word buffer and start scribbling on adjacent elements and that would be bad.

This leads to a new problem that are common to fscanf and scanf: partial reads. When the code above encounters "Pneumonoultramicroscopicsilicovolcanoconiosis" fscanf(ifp, "%39s", word); will read in the first 39 characters, "Pneumonoultramicroscopicsilicovolcanoco" and stop. The next call to fscanf will read "niosis". You'll store and print them as if they were two words. That's no good.

You could solve this by making the word buffer bigger, but now most words will be wasting a lot of memory.

scanf and fscanf have a whole lot of problems and are best avoided. Instead, it's best to read entire lines and parse them with sscanf. In this case you don't need to do any parsing, they're just strings, so getting the line will suffice.

fgets is the usual way to read a line, but that also requires that you try and guess how much memory you'll need to read in the line. To mitigate that, have a large line buffer, and copy the words out of it.

void strip_newline( char* string ) {
    size_t len = strlen(string);
    if( string[len-1] == '\n' ) {
        string[len-1] = '\0';
    }
}

...

int i;

/* The word list */
char** words = malloc(numWords * sizeof(char*));

/* The line buffer */
char *line = malloc(1024 * sizeof(char*));

for(i = 0; i < numWords; i++) {
    /* Read into the line buffer */
    fgets(line, 1024, ifp);

    /* Strip the newline off, fgets() doesn't do that */
    strip_newline(line);

    /* Copy the line into words */
    words[i] = strdup(line);

    printf("%s\n", words[i]);
}

strdup won't copy all 1024 bytes, just enough for the word. This will result in using only the memory you need.


Making assumptions about files, like that they'll have a certain number of lines, is a recipe for problems. Even if the file says it contains a certain number of lines you should still verify that. Otherwise you'll get bizarre errors as you try to read past the end of the file. In this case, if the file has less than numWords it'll try to read garbage and probably crash. Instead, you should read the file until there's no more lines.

Normally this is done by checking the return value of fgets in a while loop.

int i;    
for( i = 0; fgets(line, 1024, ifp) != NULL; i++ ) {
    strip_newline(line);
    words[i] = strdup(line);
    printf("%s\n", words[i]);
}

This brings up a new problem, how do we know how big to make words? You don't. This brings us to growing and reallocating memory. This answer is getting very long, so I'll just sketch it.

char **readDictionary(FILE *ifp) {
    /* Allocate a decent initial size for the list */
    size_t list_size = 256;
    char** words = malloc(list_size * sizeof(char*));

    char *line = malloc(1024 * sizeof(char*));

    size_t i;    
    for( i = 0; fgets(line, 1024, ifp) != NULL; i++ ) {
        strip_newline(line);

        /* If we're about to overflow the list, double its size */
        if( i > list_size - 1 ) {
            list_size *= 2;
            words = realloc( words, list_size * sizeof(char*));
        }

        words[i] = strdup(line);
    }

    /* Null terminate the list so readers know when to stop */
    words[i] = NULL;

    return words;
}

int main() {
    FILE *fp = fopen("/usr/share/dict/words", "r");
    char **words = readDictionary(fp);

    for( int i = 0; words[i] != NULL; i++ ) {
        printf("%s\n", words[i]);
    }
}

Now the list will start at size 256 and grow as needed. Doubling grows pretty fast without wasting too much memory. My /usr/share/dict/words has 235886 lines in it. That can be stored in 218 or 262144. 256 is 28 so it only requires 10 expensive calls to realloc to grow to the necessary size.

I've changed it to return the list, because there isn't much good in building the list if you're just going to use it immediately. This allows me to demonstrate another technique in working with dynamically sized lists, null termination. The last element in the list is set to NULL so anyone reading the list knows when to stop. This is safer and simpler than trying to pass the length around with the list.


That was a lot, but that's all the basic stuff you need to do when working with files in C. It's good to do it manually, but fortunately there are libraries out there which make doing this sort of thing a lot easier. For example, Gnome Lib provides a lot of basic functionality including arrays of pointers that automatically grow as needed.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Thanks for this, tons of useful info! As for the last part, I do need to do these things manually as my professor has requested, but I assume it is to get us more comfortable with the language itself. – Josh Jan 25 '17 at 21:42
  • @Josh Yeah, it's good to do it yourself a few times. Then when it comes to real code, use a library. – Schwern Jan 26 '17 at 00:46