Sorting files in C

Question

I am trying to write a program which opens up a text file, reads from the file, changes upper case to lower case, and then counts how many times that word has occurred in the file and prints results into a new text file.

My code so far is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <conio.h>
#include <ctype.h>
#include <string.h>

int main()
{

    FILE *fileIN;
    FILE *fileOUT;
    char str[255];
    char c;
    int i = 0;

    fileIN = fopen ("input.txt", "r");
    fileOUT = fopen ("output.txt", "w");

    if (fileIN == NULL || fileOUT == NULL)
    {
        printf("Error opening files\n");
    }

    else
    {
        while(! feof(fileIN)) //reading and writing loop
        {
            fscanf(fileIN, "%s", str); //reading file


            i = 0;
            c = str[i];
            if (isupper(c)) //changing any upper case to lower case
            {
                c =(tolower(c));
                str[i] = putchar(c);
            }

            printf("%s ", str); //printing output

                            fprintf(fileOUT, "%s\n", str); //printing into file
        }




        fclose(fileIN);
        fclose(fileOUT);
    }
    getch();
}

the input.txt file contains the following "The rain in Spain falls mainly in the plane" Don't ask why. After the running of the program as is the output would look like: the rain in spain falls mainly in the plane

I have managed to lower case the upper case words. I am now having trouble understanding how I would count the occurrences of each word. eg in the output I would want it to say "the 2" meaning 2 had appeared, this would also mean that i do not want any more "the" to be stored in that file.

I am thinking strcmp and strcpy but unsure how to use those the way i want.

Help would be much appreciated

(Sorry if formatting bad)

If you're don't care about performance, you could brute-force the count (build a list of char-pointers to dynamically allocated copies of the words, searching for each before adding to the list and updating a counter if found, otherwise add it with an initial count of 1. You could further improve the performance by binary-insertion-sorting the list as you build it if you had time. But I wouldn't go that mile unless (a) you know you will be handling large files (many hundreds into thousands of words), and (b) you get the basic algorithm down *first*. — WhozCraig, Apr 04 '13 at 23:45
You are using `feof` incorrectly: http://stackoverflow.com/questions/5431941/while-feof-file-is-always-wrong — William Pursell, Apr 04 '13 at 23:54
I am using it as demonstrated by my teachers and powerpoints we have been given, and I have not come across the error that the question you have linked is outlining. — TheAngryBr1t, Apr 05 '13 at 00:09
With all due respect, assuming professorial infallibility when it comes to software engineering is going to hurt like salt in an open wound upon exiting academia. You have here, at *your* disposal, literally thousands of engineers, some of which I will *guarantee* you have forgotten more about the language and its standards, the standard libraries, and the behaviors of both than most instructors will ever know. By all means question what you read here and what you hear there. But don't turn a blind eye to either. — WhozCraig, Apr 05 '13 at 01:05
I am not saying that he is not wrong, I am just saying I have not come across this problem yet and until I do I am not going to worry about it, especially seeing im only 8 weeks into my course. — TheAngryBr1t, Apr 05 '13 at 07:18
Then ask yourself why the `while(!feof())` loop iterates the loop body once for an empty file. — Jens, Apr 05 '13 at 10:58

score 1 · Accepted Answer · edited May 23 '17 at 12:16

You may want to create a hash table with the words as keys and frequencies as values.

Sketch ideas:

recognize words, i.e. alphanumeric string separated by white space, try using strtok()
for each word
- search for the word in the hash table based dictionary
  - if found: increment the frequency
  - if not found: insert a new entry in the dictionary as (word, 1)

At the end, print the contents of the dictionary, i.e. for all entries, entry.word and entry.frequency

See this question and answer for details: Quick Way to Implement Dictionary in C It is based on Section 6.6 of the bible "The C Programming Language"

UPDATE based on OP's comment:

Hash table is just an efficient table, if you do not want to use it, you can still use vanilla tables. Here are some ideas.

typedef struct WordFreq {
    char  word[ N ];
    int   freq;
} WordFreq;

WordFreq wordFreqTable[ T ];

(N is the maximum length of a single word, T is the maximum number of unique words)

For searching and inserting, you can do a linear search in the table for( int i = 0; i != T; ++i ) {

I haven't learnt Hash tables in class yet so I don't think my teacher will allow me to use it. — TheAngryBr1t, Apr 04 '13 at 23:46
Thank you for that update, I will try to implement and get back to you and let you know if I have succeeded or failed xD — TheAngryBr1t, Apr 05 '13 at 00:11
@TheAngryBr1t: How did it go? Hope you were able to solve the problem. — Arun, Apr 07 '13 at 17:51

score 0 · Answer 2 · answered Apr 05 '13 at 10:54

easy sample(need error catch, do free memory, sorting for use qsort, etc...)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#define BUFFSIZE 1024

typedef struct _wc {
    char *word;
    int count;
} WordCounter;

WordCounter *WordCounters = NULL;
int WordCounters_size = 0;

void WordCount(char *word){
    static int size = 0;
    WordCounter *p=NULL;
    int i;

    if(NULL==WordCounters){
        size = 4;
        WordCounters = (WordCounter*)calloc(size, sizeof(WordCounter));
    }
    for(i=0;i<WordCounters_size;++i){
        if(0==strcmp(WordCounters[i].word, word)){
            p=WordCounters + i;
            break;
        }
    }
    if(p){
        p->count += 1;
    } else {
        if(WordCounters_size == size){
            size += 4;
            WordCounters = (WordCounter*)realloc(WordCounters, sizeof(WordCounter)*size);
        }
        if(WordCounters_size < size){
            p = WordCounters + WordCounters_size++;
            p->word = strdup(word);
            p->count = 1;
        }
    }
}

int main(void){
    char buff[BUFFSIZE];
    char *wordp;
    int i;

    while(fgets(buff, BUFFSIZE, stdin)){
        strlwr(buff);
        for(wordp=buff; NULL!=(wordp=strtok(wordp, ".,!?\"'#$%&()=@ \t\n\\;:[]/*-+<>"));wordp=NULL){
            if(!isdigit(*wordp) && isalpha(*wordp)){
                WordCount(wordp);
            }
        }
    }
    for(i=0;i<WordCounters_size;++i){
        printf("%s:%d\n", WordCounters[i].word, WordCounters[i].count);
    }

    return 0;
}

demo

>WordCount.exe
The rain in Spain falls mainly in the plane
^Z
the:2
rain:1
in:2
spain:1
falls:1
mainly:1
plane:1

Sorting files in C

2 Answers2