1

I have an assigment, where I have to find the frequency of each characters in a text file, the problem is that my first lenguaje is spanish so the text fila has accented characters like "á" and I have to count "á" like "a", my code is :

int main(){
    int c;
    FILE *file;
    file = fopen("prueba.txt", "r");
    int letters[27] = {0}; 
    if (file){
        while ((c=getc(file)) !=EOF  )
        {
            if( ((c-65) >=0 && (c-65) <= 25)){
                letters[c-65]++;
            }
            else if( (c-97) >=0 && (c-97) <= 25){
                letters[c-97]++;
            }
            else if( c ==181 || c== 160){ //a
                letters[0]++;
            }
            else if( c == 130 || c== 144){//e 
                letters[4]++;
            }
            else if(c ==161 || c==214){//i
                letters[8]++;
            }
            else if(c == 162 || c ==224){
                letters[14]++;
            }
            else if(c ==163 || c == 233){
                letters[20]++;
            }
            else if( c==164 || c== 165){
                letters[26]++;
            }
        }
        fclose(file);
    }
}

But I found that my code read "á" like a multicharacter so c takes three values 195,161,10 and not 160, what can I do?

Mari
  • 137
  • 4
  • 5
    Please don't use [magic numbers](https://en.wikipedia.org/wiki/Magic_number_(programming))! If by e.g. `65` you mean the ASCII encoded value for `'A'` then it's better to explicitly say `'A'` (even though what you're doing is not portable anyway). – Some programmer dude Sep 09 '21 at 06:05
  • 3
    Also note that [ASCII](https://en.wikipedia.org/wiki/ASCII) is really a *seven* bit encoding, and "extended" characters (with values above `127`) will depend on the OS and its settings. – Some programmer dude Sep 09 '21 at 06:07
  • The `á` character is [encoded as UTF-8 in two bytes](https://en.wikipedia.org/wiki/UTF-8#Encoding), and has the value 225. The third byte is just a newline character. It's easy enough to convert UTF-8 to a decimal number, but I have no idea how you're supposed to find all of [the unicode code points](https://en.wikipedia.org/wiki/List_of_Unicode_characters) that can be used for variants of `a`. – user3386109 Sep 09 '21 at 06:21
  • Maybe use a Unicode library to convert the text to NFD form, and only look at base characters and ignore combining ones? – Shawn Sep 09 '21 at 06:43
  • Does this answer your question? [How to Read/Write UTF8 text files in C?](https://stackoverflow.com/questions/21737906/how-to-read-write-utf8-text-files-in-c) – JosefZ Sep 09 '21 at 16:21

1 Answers1

1

what can I do?

Commenters already noted that your text file is UTF-8 (and not extended ASCII) encoded, and gave a link how to read the multibyte characters. Now, in order to sum up the occurrences of each letter regardless of diacritics, we can take advantage of a locale where the collating position of the several variants of a letter is the same, e. g. a Spanish locale - coincidentally, since you say your first lenguaje is spanish, such might already be your environment, but you can also explicitly use es_ES.UTF-8 or similar. This way we can identify which letters belong together without the tedious task of searching through code tables. Here's an accordingly modified version of your program:

#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
#include <assert.h>
#include <string.h>
int main()
{
    // "": locale from environment; "es_ES.UTF-8": specific locale
    if (!setlocale(LC_ALL, "")) return 1;
    FILE *file = fopen("prueba.txt", "r");
    if (!file) return perror("prueba.txt"), 1;
    // the letters you want to count:
    wchar_t alphabet[] = L"abcdefghijklmnopqrstuvwxyzñ";
    size_t n = sizeof alphabet / sizeof *alphabet;
    int letters[n]; // the letter counters
    memset(letters, 0, sizeof letters);
    wchar_t collate[n];     // representatives for the letters
    assert(wcsxfrm(collate, alphabet, n) < n);  // to be sure
    wint_t c;
    while (c = towlower(getwc(file)), c != WEOF)
    {
        wchar_t s[2];  // representative of the current character
        wcsxfrm(s, (wchar_t [2]){c}, 2);
        // find letter, otherwise last element
        ++letters[wcscspn(collate, s)];
    }
    fclose(file);
    long t = 0;
    for (int i = 0; i < n; ++i)    // print the counters
        t += letters[i],
        printf("%lc: %d\n", (wint_t)alphabet[i], letters[i]);
    printf("total: %ld\n", t);
}

It prints the counts of all the letters and thereafter of other characters and the total count. Note that the total number is smaller than the file length if there are multibyte characters in the file.

Armali
  • 18,255
  • 14
  • 57
  • 171