140

My understanding is that a hash code and checksum are similar things - a numeric value, computed for a block of data, that is relatively unique.

i.e. The probability of two blocks of data yielding the same numeric hash/checksum value is low enough that it can be ignored for the purposes of the application.

So do we have two words for the same thing, or are there important differences between hash codes and checksums?

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
Richard Ev
  • 52,939
  • 59
  • 191
  • 278
  • 4
    To summarize the answers below: A hash code reduces the input to a small number, in a way that minimizes the chance of collisions. A checksum, on the other hand, reduces the input to a small number, in a way that minimizes the chance of collisions. You can make one sound different from the other by arbitrarily rephrasing that description. – Dan Stahlke Jul 28 '15 at 22:52
  • 4
    @DanStahlke - No, that isn't what the answers below say. Yes, they both reduce input to a smaller number. But there are many, many ways to do so, how to choose what algorithm to use? That depends on your goal. To summarize the top two answers: the *goal* of a checksum is "*to detect the most common errors*". Choose an algorithm that yields a different checksum, for whatever errors are "most common" in your scenario. If you are worried about one or two bits being toggled, you can pick an algorithm that *guarantees* detection of that specific error! This is a very specific trade-off. – ToolmakerSteve Mar 01 '18 at 05:07
  • 1
    @DanStahlke - on the other hand, *hash code* covers a broad range of possible trade-offs. If we mean a value used in making a hash table, we know that there *will* be collisions, lots of them. This is a very different trade-off (than a checksum). We are trying to reduce collisions *on average*. We don't guarantee anything. There may be some inputs that differ by only one bit, yet yield the same hash. This is perfectly fine, if *on average* we get a good spread of hash values. Yet would be unacceptable for a checksum. – ToolmakerSteve Mar 01 '18 at 05:17
  • [More on what hash values are used for and their focus.](https://security.stackexchange.com/questions/168492/how-does-file-hashing-works?newreg=0d967122ee9b4a6b86dd5122926774ce). – CGTheLegend Mar 16 '19 at 04:53

13 Answers13

90

I would say that a checksum is necessarily a hashcode. However, not all hashcodes make good checksums.

A checksum has a special purpose --- it verifies or checks the integrity of data (some can go beyond that by allowing for error-correction). "Good" checksums are easy to compute, and can detect many types of data corruptions (e.g. one, two, three erroneous bits).

A hashcode simply describes a mathematical function that maps data to some value. When used as a means of indexing in data structures (e.g. a hash table), a low collision probability is desirable.

Zach Scrivena
  • 29,073
  • 11
  • 63
  • 73
  • 7
    Maybe one could be used as the other, but considering that they have different design goals this just confuses the issue. – Wim Coenen Jan 20 '09 at 10:55
  • 8
    @gumbo: no, not every hashcode is a checksum. See string example from MSalters below. – MarcH Mar 17 '16 at 16:18
45

There is a different purpose behind each of them:

  • Hash code - designed to be random across its domain (to minimize collisions in hash tables and such). Cryptographic hash codes are also designed to be computationally infeasible to reverse.
  • Check sum - designed to detect the most common errors in the data and often to be fast to compute (for effective checksumming fast streams of data).

In practice, the same functions are often good for both purposes. In particular, a cryptographically strong hash code is a good checksum (it is almost impossible that a random error will break a strong hash function), if you can afford the computational cost.

Rafał Dowgird
  • 43,216
  • 11
  • 77
  • 90
  • 1
    Also it's good to mention that non-cryptographic version of hash codes may provide a good tradeoff between computation time (close to CRC) and error detection, whether it's intentional or just communication error/bit rot (CRC cannot be expected to detect intentional tampering because it's relatively easy to intentionally design a collision). – gaborous Mar 04 '15 at 05:34
  • 2
    To me, the key phrase in your answer, is that checksum is *designed to detect the most common errors*. Yes, that's it. it is a hash algorithm that has been chosen to yield different values for *likely* corruptions of the data. That is a specific purpose, and leads to specific algorithms, which optimize for that - depending on the types of perturbations one is concerned about. – ToolmakerSteve Mar 01 '18 at 04:59
  • A functional checksum might have lots of collisions across its entire domain, as long as errors you are checking for are unlikely to lead to such collisions. For example, there could be some periodicity in your checksum that you don't care about, because errors never take you there. (A good example is parity, which only protects against single-bit errors.) So the definition of a good checksum inherently depends on the error model. – Jonathan Jeffrey Oct 16 '20 at 14:06
25

There are indeed some differences:

  • Checksums just need to be different when the input is different (as often as possible), but it's almost as important that they're fast to compute.
  • Hash codes (for use in hashtables) have the same requirements, and additionally they should be evenly distributed across the code space, especially for inputs that are similar.
  • Cryptographic hashes have the much more stringent requirement that given a hash, you cannot construct an input that produces this hash. Computation times comes second, and depending on the applicatin it may even be desirable for the hash to be very slow to compute (in order to combat brute force attacks).
Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • 1
    I don't think checksums being different for different inputs has any benefits. They're just for checking integrity, not for hashing. – user541686 Jul 03 '12 at 22:58
  • 1
    @Mehrdad: so how do you propose checking integrity without getting different results for different inputs? – Michael Borgwardt Jul 06 '12 at 07:05
  • Er, maybe I misworded what I said? I was referring to the part where you said "as far as possible" -- I'm just saying there's no reason for them to be unpredictable or "far" like hashes are. As long as there is *some* change in the checksum when the input undergoes a typical change, it's a fine checksum. Contrast that with hashes, which also have the goal of distributing things as evenly/randomly/unpredictably/"far" as possible onto their codomain. – user541686 Jul 06 '12 at 07:27
  • I think you just misinterpreted what I meant with "as far as possible" - I just meant that collisions should be as rare as possible, though of course they're unavoidable. I'll change the wording. – Michael Borgwardt Nov 24 '16 at 09:53
  • @Mehrdad - at first that made no sense to me. If a checksum does *not* have a good distribution over possible checksum values, that means there are some checksum values which are returned for many more input values (than for other checksums). But, that lessens the usefulness of the checksum? [It increases the odds that perturbed data will return the same result, right?] Hmm, I am wrong, you are right: checksum only has to be good at detecting *likely* perturbations. That may not require an even distribution over all values. – ToolmakerSteve Mar 01 '18 at 04:36
  • 1
    @ToolmakerSteve: It's been over 5 years, but yeah, I think that's what I was referring to. Checksums aren't meant to protect against adversaries. Even if you can find a 1 KB string that produces the same checksum as a 1 MB string, that's not really a problem for a checksum since the probability of that happening by accident is likely to be practically zero. – user541686 Mar 01 '18 at 05:13
18

Hashcodes and checksums are both used to create short numerical values from a data item. The difference is that a checksum value should change, even if only a small modification is made to the data item. For a hash value, the requirement is merely that real-world data items should have distinct hash values.

A clear example are strings. A checksum for a string should include each and every bit, and order matters. A hashcode on the other hand can often be implemented as a checksum of a limited-length prefix. That would mean that "aaaaaaaaaaba" would hash the same as "aaaaaaaaaaab", but hash algorithms can deal with such collisions.

Matthias Braun
  • 32,039
  • 22
  • 142
  • 171
MSalters
  • 173,980
  • 10
  • 155
  • 350
  • 1
    This answer is the one that rings the bell for me. So data integrity is not the focus of a hash. – daparic Jun 04 '20 at 19:37
9

Although hashing and checksums are similar in that they both create a value based on the contents of a file, hashing is not the same as creating a checksum. A checksum is intended to verify (check) the integrity of data and identify data-transmission errors, while a hash is designed to create a unique digital fingerprint of the data.

Source: CompTIA ® Security+ Guide to Network Security Fundamentals - Fifth Edition - Mark Ciampa -Page 191

N Randhawa
  • 8,773
  • 3
  • 43
  • 47
8

Wikipedia puts it well:

Checksum functions are related to hash functions, fingerprints, randomisation functions, and cryptographic hash functions. However, each of those concepts has different applications and therefore different design goals. Check digits and parity bits are special cases of checksums, appropriate for small blocks of data (such as Social Security numbers, bank account numbers, computer words, single bytes, etc.). Some error-correcting codes are based on special checksums that not only detect common errors but also allow the original data to be recovered in certain cases.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 33
    After reading that, I'm still wondering what the difference is. – kirk.burleson Jul 27 '10 at 13:50
  • @kirk.burleson - I would say that they are the same *principle*, but in practice one always makes *tradeoffs*. In different situations, different tradeoffs apply, so different approaches are used. Not really a justification for there being two different words, just saying that if you search for good techniques for checksums, you may find a different set of algorithms than when searching for hash codes. – ToolmakerSteve Mar 01 '18 at 04:45
6

A checksum protects against accidental changes.

A cryptographic hash protects against a very motivated attacker.

When you send bits on the wire, it may accidentally happen that some bits are either flipped, or deleted, or inserted. To allow the receiver to detect (or sometimes correct) accidents like this, the sender uses a checksum.

But if you assume there is someone actively and intelligently modifying the message on the wire and you want to protect against this sort of attacker, then use a cryptographic hash (I am ignoring cryptographically signing the hash, or using a secondary channel or such, since the question does not seem to elude to this).

user3464863
  • 61
  • 1
  • 2
  • 3
    "cryptographic hash" increases the confusion between "hash" and "checksum". "cryptographic checksum" is better because it does not. – MarcH Mar 17 '16 at 16:15
5

The difference between hash-code and checksum functions is, they are being designed for different purposes.

  • A checksum is used to find out if something in the input has changed.

  • A hash-code is used to find out if something in the input has changed and to have as much "distance" between individual hash-code values as possible.

    Also, there might be further requirements for a hash-function, in opposition to this rule, like the ability to form trees/clusters/buckets of hash-code values early.

    And if you add some shared initial randomization, you get to the concept for modern encryption/key-exchanges.


About Probability:

For example, lets assume that the input data actually always changes (100% of the time). And lets assume you have a "perfect" hash/checksum function, that generates a 1-bit hash/checksum value. Therefore, you will get different hash/checksum values, 50% of the time, for random input-data.

  • If exactly 1 bit in your random input data has changed, you will be able to detect that 100% of the time, no matter how large the input data is.

  • If 2 bits in your random input data have changed, your probability of detecting "a change" is divided by 2, because both changes could neutralize each other, and no hash/checksum function would detect that 2 bits are actually different in the input data.

    ...

This means, If the number of bits in your input data is multiple times larger than the number of bits in your hash/checksum value, your probability of actually getting different hash/checksum values, for different input values, gets reduced and is not a constant.

Sascha Wedler
  • 395
  • 3
  • 7
4

These days they are interchangable, but in days of yore a checksum was a very simple techique where you'd add all the data up (usually in bytes) and tack a byte on the end with that value in.. then you'd hopefully know if any of the original data had been corrupted. Similar to a check bit, but with bytes.

Steven Robbins
  • 26,441
  • 7
  • 76
  • 90
2

I tend to use the word checksum when referring to the code (numeric or otherwise) created for a file or piece of data that can be used to check that the file or data has not been corrupted. The most common usage I come across is to check that files sent across the network have not been altered (deliberately or otherwise).

Ian1971
  • 3,666
  • 7
  • 33
  • 61
  • 1
    Because checksums are not made to be difficult to reverse, this suggests that they wouldn't be good for checking whether something was deliberately altered. – benblasdell Oct 17 '12 at 18:59
0

In Redis cluster data sharding, it uses a hash slot to decide which node it goes. Take for example the modulo operation below:

123 % 9 = 6
122 % 9 = 5
141 % 9 = 6

The 6 comes up twice across differing inputs. The purpose of the hash is simply to map an input value to an output value and uniqueness is not part of the deal. So two different inputs that produces the same output is fine in the world of hashes.

A checksum, on the other hand, must differ the output even if one bit in the input changes because its purpose is not to map, but to detect data corruption. So two different inputs that produces the same output is not acceptable in a checksum.

daparic
  • 3,794
  • 2
  • 36
  • 38
-1

Hash code vs Check sum

  • hash code(Sip Hash) usually is used for hash table based structures(Dictionary, Set, HashMap...)[Swift dictionary, Set] where basic operations has a constant time - O(1)
  • check sum(MD5, SHA) is used to indicate data integrity. For example check sum is calculated for creating Digital signature[About]

The main difference is that check sum must be unique while hash code can be the same for different objects. For example in Java or Swift you hash code is limited by Int. Usually it used in conjunction with equals function. Two different objects can have the same hash code.

[Java hash code]

yoAlex5
  • 29,217
  • 8
  • 193
  • 205
-4

A checksum is simply a number generated from the data field by oring(by logical addition hence sum). The checksum has the capability to detect a corruption of any bit or number of bits within the data field from which it is generated ie it checks for errors that is all, it can not correct them. A checksum is a hash because the size of the checksum is smaller than the original data. Yes you will have collisions because the checksum is not at all sensitive to bit position in the data field.

A cyclic redundancy check ( CRC) is something quite different , more complex and is NOT called a checksum. It is the application of a polynomial series which has the capability of correcting any chosen number of individual corrupted bits within the data field from which it was generated. The creation of a CRC results in a number greater in size than the original datafield (unlike the checksum) - hence the name including the word "redundancy" and the price you pay for the error correcting capability. A CRC is therefore NOT a hash and must not be confused or named as a checksum , because the redundancy necessarily adds to the size of the original data.

user229044
  • 232,980
  • 40
  • 330
  • 338