3

Im trying to write a simple C code that counts how many times a byte is repeated in a file. We tried the code with .txt files and works wonders (max size tested: 137MB). But when we tried it with an image (even small, 2KB) it returned Segmentation Fault 11.

I've done some research and found some specific libs for images, but I don't want to resort to them since the code it's not only meant for images, but for virtually any type of file. Is there a way to simple read a file byte per byte regardless of anything else (extension, meta, etc).

This is the code:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {

    FILE *f;
    char *file;
    long numTotalBytes = 0;
    int bytesCount[256] = {0}; 

    f = fopen ( argv[1], "rb");
    fseek(f, 0L, SEEK_END);
    numTotalBytes = ftell(f);
    rewind(f);

    file = calloc(1, numTotalBytes);    
    fread(file, numTotalBytes, 1, f);
    fclose(f);

        printf("numTotalBytes: %ld", numTotalBytes); //<- this gives the right output even for images

    unsigned int i;
    for (i=0; i<numTotalBytes; ++i) {
        unsigned char pointer = file[i]; //<- This access fails at file[1099]
        int pointer_int = (int)pointer;
        printf("iteration %i with pointer at %i\n", i, pointer_int); //<- pointer_int is never below 0 or above 255
        //++bytesCount[(int)file[i]];
        ++bytesCount[pointer_int];
    }

    free(file);
}

Some extra info:
- Changing the extension of the img to .txt doesn't work.
- The code returns Segmentation Fault exactly at iteration 1099 (file I'm using is aprox 163KB so file[i] should accept accesses up to aprox file[163000]).
- For txt files works perfect. Reads the bytes one by one and counts them as expected, regardless of file size.
- I'm on Mac (you never know...)

//EDIT: I have edited the code for a more desglosed and explanatory one because some of you where telling me things I've already tried.

//EDIT_2: Ok guys, never mind. This version should work in any other computer that its not mine. I think the problem is with my terminal when passing arguments but I just switched OS and it works.

p4x
  • 384
  • 1
  • 5
  • 16
  • 1
    the extension really doesn't mean anything... – Ryan Jun 17 '16 at 15:35
  • Try changing `(int)` to `(unsigned)`. You don't want negative indexes. – alk Jun 17 '16 at 15:38
  • `fseek()` and `ftell()` is not a good way to calculate size of files. [FIO19-C. Do not use fseek() and ftell() to compute the size of a regular file - CERT C Coding Standard - CERT Secure Coding Standards](https://www.securecoding.cert.org/confluence/display/c/FIO19-C.+Do+not+use+fseek()+and+ftell()+to+compute+the+size+of+a+regular+file) – MikeCAT Jun 17 '16 at 15:44
  • You are mixing `long` with `int` for your byte count and loops, while `calloc` and `fread` take type `size_t`. Best to use `long` as that is what `ftell` returns. – Weather Vane Jun 17 '16 at 15:50
  • Re the edit, you still have `char *file;` which should be `unsigned char *file;` – Weather Vane Jun 17 '16 at 16:09

3 Answers3

4
  • Do check if fopen() and calloc() are successful.
  • The format specifier to print long is %ld, not %lu.
  • (int)file[i] is bad for array index because converting char to int will preserve its value if all values that can be represented as char are representable in int, and because if char is signed in your environment (and setting), it may access negative index, cause out-of-range access and invoke undefined behavior.

You should change ++bytesCount[(int)file[i]]; to ++bytesCount[(unsigned char)file[i]]; in order to prevent using negative index.

Also note that ftell() with SEEK_END may note be supported for binary stream (N1570 7.21.9.2 The fseek function), so it is better to read one-by-one using fgetc() in order to avoid undefined behavior and to use less memory.

MikeCAT
  • 73,922
  • 11
  • 45
  • 70
  • - Original code has checks for fopen and calloc already, I just put a smaller version in here. - Changed the %lu to %ld (thanks!). - Access to bytesCount works well, I've done a lot of debugging, splite the code line into two, with a lot of prints, casts to int, unsigned int, unsigned char, etc. They sometimes prints negative numbers if I dont use unsigned (as expected) but the access work well anyway. It's the acces to file[i] the one causing segmentation fault. The other never access anything below 0 or over 255. I may put the desglosed code if it helps. – p4x Jun 17 '16 at 15:40
1

MikeCAT just beat me to it. A bit more explanation follows, in case it helps.

To fix: change file to unsigned char *file and the increment to ++bytesCount[file[i]];.

Exaplanation: per this answer, a plain char may be signed or unsigned. In this case, I'm guessing it defaults to signed. That means any value >=0x80 will become a negative number. Such values are not likely to be in your English-language text file, but are very likely to be in an image! The typecast to (int) will keep negatives negative. Therefore, the code will index byteCounts with a negative number, leading to the segmentation fault.

Community
  • 1
  • 1
cxw
  • 16,685
  • 2
  • 45
  • 81
  • `0x80` may also be converted to a negative number. – MikeCAT Jun 17 '16 at 15:41
  • Edited the original post for a better explanation. Its not the bytesCount[index] the one failing but the file[index] access causing segmentation fault. – p4x Jun 17 '16 at 15:56
0

It might be caused by this line

++bytesCount[(int)file[i]];

The bytesCount is array of 256 ints. If file[i] is more than 256, you are accessing invalid memory and that can cause segmentation fault.

Rohan
  • 52,392
  • 12
  • 90
  • 87
  • Edited the original post for a better explanation. Its not the bytesCount[index] the one failing but the file[index] access causing segmentation fault. – p4x Jun 17 '16 at 15:56