1

I have a question that I'm hoping you can help me with.

I'm trying to read chars from a file that i will perform a frequency analysis on. I decided the easiest way for this is to have an array that has index 0-255 and increment the corresponding index (from the read chars decimal value) by one every time that char is read. The problem i have is that it seems only the 7bit chars are saved. Look below for the code.

int frequency(FILE *freqfilep)
{    
    printf("frequency function called!\n");

    int start = 1;
    int *frqarray = calloc(256,sizeof(int));
    unsigned char tecken;

    FILE *fp;
    fp = fopen("freqfile.txt","r");

    if (fp == NULL) 
    {
        perror("Error in opening file");
        start = 0;
    }
    do
    {
        tecken = fgetc(fp);

        if (feof(fp))
        {
            start = 0;
        }
        else
        {
            frqarray[(int)tecken] ++;
        }
    }
    while (start != 0);

    printf("a%d\n", frqarray[97]);
    printf("b%d\n", frqarray[98]);
    printf("c%d\n", frqarray[99]);
    printf("1%d\n", frqarray[49]);
    printf("2%d\n", frqarray[50]);
    printf("3%d\n", frqarray[51]);
    printf("å%d\n", frqarray[134]);
    printf("ä%d\n", frqarray[132])
    printf("ö%d\n", frqarray[148]);

    fclose(fp);

    return 0;
}

The file I'm reading from contains the following chars:

aaa bbb ccc 111 222 333 ååå äää ööö

So the printf's in the bottom of my code should say:

a3
b3
c3
13
23
33
å3
ä3
ö3

But the result is

a3
b3
c3
13
23
33
å0
ä0
ö0

So I'm guessing that there is some issue with reading the 8bit characters, I've looked around a bit on the forum and found some relatively similar posts where the answer has been that I need to use a buffer like this fread(&buffer, 256, 1, file); but I'm not sure how to implement it.

dbush
  • 205,898
  • 23
  • 218
  • 273
Byfjunarn
  • 41
  • 5
  • 2
    Are you sure those last 3 sets of characters aren't multibyte characters? – dbush Feb 02 '16 at 16:04
  • Take a look [HERE](http://stackoverflow.com/questions/21737906/how-to-read-write-utf8-text-files-in-c) – LPs Feb 02 '16 at 16:24

2 Answers2

2

Those characters are most likely not single byte characters with the high bit set, but multibyte characters.

These characters are represented by the following UTF-8 codepoints:

  • å: 0xc3 0xa5 (decimal 195 165)

  • ä: 0xc3 0xa4 (decimal 195 164)

  • ö: 0xc3 0xb6 (decimal 195 182)

Add the following to your code:

printf("195 %d\n", frqarray[195]);
printf("165 %d\n", frqarray[165]);
printf("164 %d\n", frqarray[164]);
printf("182 %d\n", frqarray[182]);

And you'll probably get this output:

195 9
165 3
164 3
182 3

EDIT:

If you need to do frequency analysis of characters, use fgetwc to read in the characters instead. If you expect all characters to be in the basic multilingual set (Unicode characters U-0000 - U-FFFF) you can create an array of size 65536 and output that. If you're expecting characters beyond that range, you might want to use a different scheme.

dbush
  • 205,898
  • 23
  • 218
  • 273
  • It works perfectly with the UTF-8 codes you gave me. Turns out i was using the wrong codes as I was using the extended ASCII code table. I'll keep fgetwc in mind if i run into any problem ahead. Thank you dbush! – Byfjunarn Feb 02 '16 at 19:52
  • @Byfjunarn Glad I could help. Feel free to [accept this answer](http://stackoverflow.com/help/accepted-answer) if you found it useful. – dbush Feb 02 '16 at 19:53
1

You are likely running into an encoding problem, which you could verify by printing out the whole frequency table. Likely you will find that in addition to not recording any appearances of some of the characters you were expecting, it will have recorded appearances of some characters you were not expecting.

This comes down to the fact that C chars and especially unsigned chars are basically representations of bytes, not of "characters" in, say, Unicode's sense of the term. If the file you are reading is encoded in a multibyte encoding (UTF-8 is fairly likely), then your fgetc() will read the individual bytes of that encoding, and will not decode them into code point values. Moreover, it is not certain that the character encoding used internally by your C program is the same as the encoding of the file.

If you want to read character data then you need to decode it correctly. If you don't want to write decoding logic in your program itself, then you must make sure the input file is encoded as your program expects. A transcoder such as iconv may be able to help with that, but you do need to know both the file's current encoding and the encoding to which you want to transform.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157