1

Here's my task:

Write a program that asks user to enter name of a file. The program opens the file in binary mode and calculates frequency of all characters [0-255] and prints a list of ten most frequent characters and their frequencies.

Here is most of the code I already wrote:

#pragma warning(disable: 4996)

#include <stdio.h>

int count[26];

int main() {
    FILE *f;
    int i;
    char ch;
    char filename[80];

    printf("Enter File name\n");
    gets(filename);
    f = fopen("file.txt", "rb");

    while (!feof(f)) {
        ch = fgetc(f);
        count[ch - 'a']++;
    }
    for (i = 0; i < 26; i++)
        printf("[%c] = %d times\n", 65 + i, count[i]);

    fclose(f);

    return 0;
} 

I'm able to calculate and print the frequency of all characters. How can I print only the ten most frequent?

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 1
    By the way, your usage of `while (!feof(f))` is [wrong](https://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong) and it will lead to our-of-range access of the array `count`. You should check if readings are successful **before** using what are "read". – MikeCAT Oct 08 '20 at 14:22
  • 1
    1. create an array of structure that have character and frequency 2. sort the array according to frequency in decending order 3. print first 10 elements of the array – MikeCAT Oct 08 '20 at 14:23
  • 1
    You'll break the array if the data isn't in the range `'a'` to `'z'`. And it may well be that `'\n'` is more frequent than `'j'`. – Weather Vane Oct 08 '20 at 14:32

4 Answers4

3

Your code contains several flaws. First and foremost, while(!feof(file)) is always wrong

Then, never ever use gets(). It's dangerous

Then we have that count cannot take all values, since itś only 26 big. It's declaration should be char count[256]; and there's no reason to declare it outside main. Well, the only reason here would be that then you don't have to initialize it to zero, but that's not a good reason to use a global.

And since this is an obvious homework, I'm not going to give you a complete solution. But I would do something like this:

struct count_entry { // Yeah, bad name but whatever
    unsigned char c;
    int n;
};

struct count_entry count[256];
for(int i=0; i<256; i++) {
    count[i] = (struct count_entry){.c=i, .n=0};
}

FILE *in = fopen(filename, "rb");

int ch; // Yes, should be int

while ((ch = fgetc(in)) != EOF) 
    count[ch].n++;

What I would do then is to sort count with qsort. Read the documentation to find out how. Looks like you could do something like this, but I have not read it carefully: How to sort struct using qsort

And then you just print the first 10 like this:

for(int i=0; i<10; i++) 
    printf("%u %d\n", count[i].c, count[i].n);
klutt
  • 30,332
  • 17
  • 55
  • 95
0

A simple solution can be to sort the frequencies at the end, though I am sure it can be solved more optimally using advanced data structures. I'll provide a simple and naive solution.

Maintain two arrays of size 256 each. One will hold the characters (a, b, c etc.) and another will hold the corresponding frequencies. The character array will be initialised once, while the frequencies are computed (Similar to what you have already done).

In the second phase, you sort the frequency array using some algorithm like Bubble Sort, Insertion sort etc. As you shift frequencies, also shift the corresponding characters so that the mapping remains intact.

Once sorted, you can take the top N frequencies and the character array will tell you the actual characters.

Nikhil Baliga
  • 137
  • 1
  • 1
  • 8
0

There are multiple problems in the code:

  • gets() is a security flaw, don't use this function. Use scanf() or fgets() instead.
  • you read the filename into filename but use "file.txt"
  • you do not test for fopen() failure.
  • while (!feof(f))... is incorrect.
  • you increment count[ch - 65] without testing if the byte is indeed an uppercase letter.

Her is a modified version:

#include <stdio.h>

int main() {
    int count[256] = { 0 };
    char filename[80];
    FILE *f;
    int ch;

    printf("Enter File name\n");
    if (scanf("%79[^\n]", filename) != 1)
        return 1;

    f = fopen(filename, "rb");
    if (f == NULL) {
        fprintf(stderr, "cannot open %s\n", filename);
        return 1;
    }
    while ((ch = fgetc(f)) != EOF) {
        count[ch]++;
    }
    fclose(f);

    for (int n = 0; n < 10; n++) {
        int maxc = 0;
        for (int i = 1; i < 256; i++) {
            if (count[i] > count[maxc])
                maxc = i;
        }
        if (count[maxc] != 0) {
            printf("[%c] = %d times\n", maxc, count[maxc]);
            count[maxc] = 0;
        } else {
            break;
        }
    }
    return 0;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Perfect, thanks. You already help me but if I could ask 1 final question please. What does this line do? if (count[i] > count[maxc]) maxc = i; – Marcos Sagretti Oct 09 '20 at 07:17
  • The test checks if the count for the current byte value in the loop `count[i]` is larger than the count of the byte currently holding the largest count. If it is larger, I set `maxc = i`, in other words we have a new record. When the loop is finished, I print the greatest count and the corresponding byte, if it is non 0, which would happen if the file is empty. – chqrlie Oct 09 '20 at 19:38
-1

You can traverse the count array 10 times to obtain the maximum(and their positions so that you don't record the same maximum again and again). If your program is slated to run with '10' most frequent characters then in this approach you will have time complexity as O(10*length of count array) which is ultimately constant.

However, if you were to change '10' to 'n' then you might want to consider something like a map that is built on a self balanced binary search tree.

Prasanjit Rath
  • 166
  • 2
  • 13