I know the length and restricted character set of some CRC32 hashes. Does this make it easier to reverse them?

Question

We have a bunch of CRC32 hashes which would be really nice to know the input of. Some of them are short enough that brute-forcing them is feasible; others are not. The CRC32 algorithm being used is equal to the one in Python's binascii (its implementation is spelled out at https://rosettacode.org/wiki/CRC-32#Python).

I've read these pages:

...and it seems to me that there's something hidden somewhere that can reduce the amount of effect to reverse these things that I just can't puzzle out.

The main reason I think we can do better than full brute force is that we know two things about the hash inputs:

We know the length of every input. I believe this means that as soon as we find a guess that equals length and has a reversed CRC value of 0, that must be correct (might not help much though). Maybe there's some other property of length caused by the algorithm that could cut down on effort that I don't see.
We also know that every input has a restricted character set of [A-Za-z_0-9] (only letters, numbers, and underscores). In addition, we know that numbers are rare, and A-Z do not appear to ever be mixed with a-z, so we can often get away with just [a-z_] in 90% of cases.
- Also, the majority of these are snake_case English words/phrases (e.g. "weight" or "turn_around"). So we can also filter out any guesses that contain "qxp" or such.

Since the above links discuss how you can add any 4 chars to an input and leave the hash unchanged, one idea I thought of is to brute force the last 4 chars (most of which are invalid because they're not in the restricted charset), filter out the ones that are clearly not right (because of illegal 3-letter English combos and such), and come up with a "short"list of potentially valid last 4 chars. Then we repeat until the whole thing has been whittled down, which should be faster than pure brute force. But I can't find the leap in logic to figure out how.

If I'm going off in the wrong direction to solve this, that would be good to know as well.

This question is a much better fit for [security.se] than Stack Overflow. — Charles Duffy, Aug 06 '19 at 23:08
...that said, this *does* indeed smell like a place where some clever optimizations would do a lot of good. How long are your input lengths in practice? My mind goes to meet-in-the-middle attacks pretty quickly, but I'm not a cryptanalyst, nor do I play one on TV. — Charles Duffy, Aug 06 '19 at 23:18
There is no such thing as a "potentially valid last 4 chars". ANY final 4 chars could give rise to any particular CRC value, given 4 freely changeable bytes earlier on. — jasonharper, Aug 06 '19 at 23:47
@CharlesDuffy: Took a sample of values and their lengths are: min 3, max 44, average 18, stdev 7.5 — Toomai, Aug 07 '19 at 11:04
@jasonharper: Isn't that assuming you have the entire ASCII space to work with? (the premise being I don't) — Toomai, Aug 07 '19 at 11:09
Yes, you'd need more changeable locations with a limited character set - 6 or 7 perhaps in your case. My point was that the final four locations have no special meaning. The whole point of CRC (or any other hash function) is that *every* input bit has equal weight in the output. — jasonharper, Aug 07 '19 at 13:17
***CRC32* returns a 32 bit integer**. Your ***#2.*** bullet is incorrect. *"0x414fa339"* is the *hex* reperesentation of *1095738169* for example. There will be no letters beyond ***F***. Check https://stackoverflow.com/questions/54532010/zipfile-badzipfile-bad-crc-32-when-extracting-a-password-protected-zip-zip/55063500#55063500 for a related weakness. — CristiFati, Aug 12 '19 at 12:10
@CristiFati: One of us made a mistake if that's how you read the point; it's talking about the valid input space, not the hex of the final output. — Toomai, Aug 13 '19 at 10:34
@Toomai: after reading the question again, I figured out what you meant. I corrected the paragraph, so there's no room for interpretation. — CristiFati, Aug 13 '19 at 10:41

Renat · Answer 1 · 2019-08-08T08:13:37.050

the majority of these are snake_case English words/phrases (e.g. "weight" or "turn_around") - These ones could be brute-forced by using dictionary (e.g from this question) and utilities. Assuming total amount of English words up to 1M, trying (1M)^2 CRC32 looks feasible and quite fast.

Given a text file with all dictionary words, enumerating all word_word and comparing with CRC hashes could be done with e.g. Hashcat tool with instruction from here as:

hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt dictionary.txt dictionary.txt

and just testing against each word in dictionary as:

hashcat64.bin -m 11500 -a 0 hashes.txt dictionary.txt

For phrases longer than 2 words, each phrase length would be an individual case as e.g. Hashcat has no option to permutate 3 or more dictionaries (ref) . For 3-words phrases you need to generate a file with 2-words combination first (e.g. as here but in form of {}_{} ). Then combine it with 1-word dictionary: hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt two_words_dictionary.txt dictionary.txt . Next, 4-words phrases could be brute forced as: hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt two_words_dictionary.txt two_words_dictionary.txt , etc. (Another option would be to pipe combinations to Hashcat as combine_scrip ... | hashcat64.bin -m 11500 -a 0 hashes.txt but as CRC32 is very fast to check, pipe would be a bottleneck here, using dictionary files would be much faster than piping)

Of course n-words permutation increases complexity exponentially over n (with huge base). But as dictionary is limited to some subset instead of all English words, it depends on the size of dictionary how deep is practical to go with brute force.

Hm this has potential. It suggests I was thinking about the problem in the wrong direction, trying to go by-char instead of by-word. Can this be extended to an arbitrary number of words per target, instead of just two? (would not be using the entire English dictionary, but a targetted subset) — Toomai, Aug 08 '19 at 00:05
@Toomai, I've updated my answer, btw it's good that dictionary is scoped, cause this allows for brute force for some n-words phrases. — Renat, Aug 08 '19 at 08:15
Giving this answer the bounty because it should be more immediately useful than the other one. — Toomai, Aug 13 '19 at 10:36

score 2 · Answer 2 · answered Aug 11 '19 at 19:19

Let me take another direction on the question and talk about the nature of CRC. As you know Cyclic Redundancy Check is something calculated by dividing the message (considered as a polynomial in GF(2)) by some primitive value, and it is by nature linear (a concept borrowed from coding theory). Let me explain what I mean by linear. Let's assume we have three messages A, B, C. then if we have

CRC(A) = CRC(B)

then we can say

CRC(A^C)=CRC(B^C)

(meaning that CRC will be changed based on XOR). Note that CRC is not a hash and its behaviour can be predicted. So you don't need a complicated tool like hashcat if your space is too big.

So theoretically you can find the linear space that CRC(x) = b, by setting

x = b0 + nullspace.

b0 is some string that satisfies

CRC(b0) = expectedCRC.

(Another note, because these systems usually have the initial condition and final xor implemented in them and it means that CRC(0)!=0).

Then you can reorder the nullspace to be localized. And then knowing your space to contain only ASCII characters and conditional sequence of characters you can search your space easily.

knowing that your space is pow(2,32) and your possible input is about pow(10,12) I would say there are too many texts that map into the same CRC.

I know the length and restricted character set of some CRC32 hashes. Does this make it easier to reverse them?

2 Answers2