8

how can i make a checksum of a file using C? i dont want to use any third party, just default c language and also speed is very important (its less the 50mb files but anyway)

thanks

Checksummmmm
  • 141
  • 1
  • 1
  • 6
  • Is there a particular checksum/hash algorithm you're interested in? – Michael Burr Aug 12 '10 at 00:58
  • the "fast and simple one", if there's any..just want to have "true" to a bool if the test is ok – Checksummmmm Aug 12 '10 at 01:01
  • Any checksum is much faster than the disk I/O, so that doesn't really matter. You need to decide what you want here. If you want a cryptographic hash, that's a bit different than CRC32 or Murmur. – Steven Sudit Aug 12 '10 at 01:37

5 Answers5

17

I would suggest starting with the simple one and then only worrying about introducing the fast requirement if it turns out to be an issue.

Far too much time is wasted on solving problems that do not exist (see YAGNI).

By simple, I mean simply starting a checksum character (all characters here are unsigned) at zero, reading in every character and subtracting it from the checksum character until the end of the file is reached, assuming your implementation wraps intelligently.

Something like in the following program:

#include <stdio.h>

unsigned char checksum (unsigned char *ptr, size_t sz) {
    unsigned char chk = 0;
    while (sz-- != 0)
        chk -= *ptr++;
    return chk;
}

int main(int argc, char* argv[])
{
    unsigned char x[] = "Hello_";
    unsigned char y = checksum (x, 5);
    printf ("Checksum is 0x%02x\n", y);
    x[5] = y;
    y = checksum (x, 6);
    printf ("Checksum test is 0x%02x\n", y);
    return 0;
}

which outputs:

Checksum is 0x0c
Checksum test is 0x00

That checksum function actually does both jobs. If you pass it a block of data without a checksum on the end, it will give you the checksum. If you pass it a block with the checksum on the end, it will give you zero for a good checksum, or non-zero if the checksum is bad.

This is the simplest approach and will detect most random errors. It won't detect edge cases like two swapped characters so, if you need even more veracity, use something like Fletcher or Adler.

Both of those Wikipedia pages have sample C code you can either use as-is, or analyse and re-code to avoid IP issues if you're concerned.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • -1 there are much better hash functions that are still simple. http://www.cse.yorku.ca/~oz/hash.html – u0b34a0f6ae Oct 30 '11 at 16:33
  • 2
    @Kaizer, it has nothing to do with simplicity. Those functions in that link you provide are _hash_ functions and their purpose is totally different to checksumming - their intent is to maximise the balance between buckets for key distribution, not simply get an indication of a file "value" for checking (they can be _used_ for that but they provide no benefit in that case). In addition, they all perform more complex operations than simple addition and, to quote the question, "speed is very important". – paxdiablo Oct 30 '11 at 21:21
  • simply adding all characters is the simplest possible checksum, sure, but it does not protect against any swaps like `"Holle_"`. – u0b34a0f6ae Oct 30 '11 at 21:33
  • 2
    @Kaizer, I'm not sure what "swap" you're talking about there but I'm assuming you're meaning swapped characters somewhere in the file. But _any_ checksum (or hash for that matter) is vulnerable to input value errors that cannot be detected. That is their nature since they involve loss of information. You can improve the likelihood of catching some of those problems if you make the output value more dependent on position (such as with djb2) but this introduces extra calculations, slowing down the process. It was the emphasis on speed that led me to concentrate on the simple solution. – paxdiablo Oct 30 '11 at 21:55
  • However, I'm not here to plead my case, you've made your call, all I can do is explain why I think you're mistaken :-) I don't really want to clog up the comments system with more explanation so I'll leave it there. – paxdiablo Oct 30 '11 at 21:55
9
  1. Determine which algorithm you want to use (CRC32 is one example)
  2. Look up the algorithm on Wikipedia or other source
  3. Write code to implement that algorithm
  4. Post questions here if/when the code doesn't correctly implement the algorithm
  5. Profit?
Paul Tomblin
  • 179,021
  • 58
  • 319
  • 408
2

Simple and fast

FILE *fp = fopen("yourfile","rb");
unsigned char checksum = 0;
while (!feof(fp) && !ferror(fp)) {
   checksum ^= fgetc(fp);
}

fclose(fp)
sizzzzlerz
  • 4,277
  • 3
  • 27
  • 35
  • Sooo wrong. First: [**Why is “while( !feof(file) )” always wrong?**](https://stackoverflow.com/questions/5431941/why-is-while-feoffile-always-wrong) Second: `fgetc()` returns `int`, not `char` because `EOF` is a negative `int` value that can not be represented as a `char`. This code will include an extra `EOF` returned from `fgetc()` and truncated to a `char` value in the "checksum". – Andrew Henle Dec 03 '22 at 01:44
2

Generally, CRC32 with a good polynomial is probably your best choice for a non-cryptographic-hash checksum. See here for some reasons: http://guru.multimedia.cx/crc32-vs-adler32/ Click on the error correcting category on the right-hand side to get a lot more crc-related posts.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
1

I would recommend using a BSD implementation. For example, http://www.freebsd.org/cgi/cvsweb.cgi/src/usr.bin/cksum/

Brandon Horsley
  • 7,956
  • 1
  • 29
  • 28