What is the fastest checksumming algorithm for a buffer of size 4096?

Question

I'm looking for a hash/checksum algorithm that is as fast as possible while still being able to detect changes to a 4096 byte memory page. Because the size is fixed I hope it should be possible to attain an optimized algorithm exactly for this purpose, since the size is guaranteed to not change.

What I'm doing is checksumming some memory pages, doing an operation, then checksumming the pages again to see if they've changed. Due to space reasons it is not feasible to simply compare bytewise with a copy of the old bytes. I don't need know where within the page the change occurred, just if a change happened, so comparing the checksums is enough.

I've tried CRC32 in hardware and xxHash, both with good results. Still, as far as I know they are not tailored for a fixed size buffer.

Edit: Here is the code I'm using for CRC32 in hardware. For some reason it's slower than xxHash.

// Warning! Not padding, so don't use if length isn't dividable by sizeof(uint32_t).
uint32_t sse42_crc32_32bit(const uint32_t* buffer, const uint32_t length)
{
    uint32_t crc = 0;
    const uint32_t numRounds = length / sizeof(uint32_t);

    for (uint32_t i = 0; i < numRounds; ++i)
    {
        crc = _mm_crc32_u32(crc, buffer[i]);
    }

    return crc;
}

Why would a hash be tailored to a particular size? That's a really unusual requirement. Why not benchmark the ones you can use and see which performs best under those circumstances. — tadman, Jul 06 '17 at 00:07
@tadman I've already done that. I was just wondering if it's generally a better idea to tailor the hash function towards a fixed size for speed gains if you know the input will always be of this size. — Mikubyte, Jul 06 '17 at 00:09
Unless you're talking about really tiny values, like 64 bytes or less, you really don't have a lot of options here to optimize. For tiny values you can unroll the loops, but for 4096 bytes that would make a pretty huge function that's going to shove a lot of code out of the CPU cache, hurting performance. — tadman, Jul 06 '17 at 00:10
@tadman I don't know much about the CPU cache, but if I used a dedicated core to do the checksumming, would that help on evicting code from the cache? I'd assume the checksumming code would stay cache hot in that case? — Mikubyte, Jul 06 '17 at 00:15
The more you unroll the bigger your code will get and at some point it will get too big to even fit in the cache, where performance takes a hit. You can't really dedicate a core, your OS will decide where that thread runs, and it can get shuffled around a lot. A smaller function interferes less with other code, but *may* run marginally slower. You should benchmark to find out the exact characteristics of any code you're intending to use. It's a dark art sometimes: each model of CPU has different cache configurations. — tadman, Jul 06 '17 at 00:18
Do you expect mostly hits or mostly misses? If you expect the buffer to change a lot then you could employ something really fast (like a straight sum) followed by something slower and more reliable to weed out false negatives? — Galik, Jul 06 '17 at 00:19
@tadman I'm only using a single CPU by design, so I can take one of the free cores by changing the thread's affinity if necessary. — Mikubyte, Jul 06 '17 at 00:25
@Galik I expect the vast majority of pages to not change. Also not sure what you mean by using something slower to weed out false negatives. I can't do a bytewise comparison since I'll have nothing to compare against. If the page did change, I will only have the new page and not the old. — Mikubyte, Jul 06 '17 at 00:26
Are you inspecting memory the process doesn't own, or is this testing vs. memory inside of the process? Dirty tracking via a proxy function call might be cheaper here, it really depends on your usage patterns. — tadman, Jul 06 '17 at 00:28
@tadman So basically I'm comparing the address space of the process I'm in before and after a syscall. I'm using `GetWriteWatch()` on all eligible allocations, and checksum the rest. I can't use a driver, so dirty tracking in the page table is not an option. — Mikubyte, Jul 06 '17 at 00:33
Is it imperative you track every change or can you afford to miss a few? — Galik, Jul 06 '17 at 00:34
If you want this to be really performant you might want to shift this code into the kernel as a module, you can do all sorts of wild and crazy things in that context, but that may be a bit too deep for a prototype. Here's a really dumb idea though: Can you force-`fork` the process where the new one is suspended, do your syscall, then compare the pages between the unforked one and the forked one? With copy-on-write done in the OS the only memory consumed will be on modified pages. — tadman, Jul 06 '17 at 00:37
@Galik It is extremely important. If I miss one bit of change it's game over. — Mikubyte, Jul 06 '17 at 00:39
@tadman Sadly kernel drivers are out of the question since the premise of the prototype is to showcase how far it is possible to push a user-mode only implementation. I also doubt forking would yield good performance, but it's an interesting idea which I will consider. How would you discern a modified page from an unmodified one in that case, though? I mean, without checksumming, since that's what I'm assuming you meant with this idea, otherwise there'd be little point. — Mikubyte, Jul 06 '17 at 00:41
It's worth a shot, `fork` is pretty cheap all things considered, but your CRC idea might be viable as well. Debuggers can do some pretty crazy things because there's OS hooks for that sort of functionality, so I bet you can go pretty far in pure user-mode code. — tadman, Jul 06 '17 at 00:43
CRC32c will detect [all changes with Hamming distance 4 or less up to roughly 2³¹ bits](https://users.ece.cmu.edu/~koopman/crc/c32/0x8f6e37a0_len.txt). Most typical hash functions like xxHash do not – and cannot, by design – give you this guarantee. — , Jul 06 '17 at 00:43
On my machine, CRC32c runs at 26 GB/s for 4096-byte blocks, i.e. faster than RAM. Are you sure you need something faster than that? — , Jul 06 '17 at 00:48
@Fanael That's the conclusion I ended up with too, but for some reason the crc32 I'm using is really slow. I'll update the question and include the code I'm using. — Mikubyte, Jul 06 '17 at 00:52
You are using _mm_crc32_u32 in a slow way, purely serially. See eg [this](http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411) for the intended usage. E: in your case you might as well compute separate CRCs and compare them all, which is easier than merging them and actually has fewer false negatives. — harold, Jul 06 '17 at 00:56
@Mikubyte: `_mm_crc32_u64` will give you better bandwidth for obvious reasons; and you should really run three or four `crc32c`s in parallel [by using some linear algebra](https://stackoverflow.com/a/17646775), because while this instruction has a single-cycle throughput, its latency is three cycles. — , Jul 06 '17 at 00:57
@Fanael Right. I was just showing the 32-bit version to see if I was doing something incorrectly using the "basic" version. I'm using the 64-bit version by doing a far jump to 64-bit (the process is WOW64), then doing the identical 64-bit version of that code and jumping back. — Mikubyte, Jul 06 '17 at 00:59
Following on from the fork suggestion from @tadman would it be possible to compare the page allocation tables between the parent and child process to see which pages have been written to and thus allocated a new page? Since the kernel is already tracking what has changed it should be almost free to get the information from it. — Harun, Jul 06 '17 at 01:07
Ah, I beg your pardon, I see you have tagged the question as Windows. I'm thinking in terms of the posixy process forking model, where you very cheaply get two copies of the process in the same state, sharing the same physical pages until they change. — Harun, Jul 06 '17 at 01:12
@Fanael I'm not entirely sure what you mean with CRC32's Hamming distance. For 4096 bytes, does it mean it will detect every bit change since 4096 < 2^31? — Mikubyte, Jul 06 '17 at 01:12
@Harun That's okay. It sounds like you'd need a kernel module for what you're suggesting anyway (which I can't do as mentioned earlier)? — Mikubyte, Jul 06 '17 at 01:13
Hamming distance is the number of positions in which the symbols of two strings differ. Since we're dealing with binary strings here, it means that CRC32c will detect every change of four or fewer bits. — , Jul 06 '17 at 01:14
@Fanael So what you're saying is that since 4096 bytes = 32768 bits > 4 bits it is insufficient to detect changes in a buffer of such size? Or are you saying that if 4 or less bits change in the full 4096 bytes, those changes will be detected, so if a 5th bit changed that wouldn't be detected? — Mikubyte, Jul 06 '17 at 01:16
@Mikubyte: you cannot detect changes reliably with 32 bit crc. Look at my answer. If you need 100% reliability, than you mustn't use any kind of hash. Even, if that hash is 32767 bit long, there is a little possibility that it doesn't find a change — geza, Jul 06 '17 at 01:18
If four or less bits change, the resulting CRC is guaranteed to be different. If more bits change, there's no such guarantee. — , Jul 06 '17 at 01:18
@Mikubyte: you have 2^32768 different inputs, but 2^32 different outputs. There will be 2^(32768-32) different inputs, which will map to the same hash value. — geza, Jul 06 '17 at 01:20
"What I'm doing is checksumming some memory pages, doing an operation, then checksumming the pages again to see if they've changed." Is there any reason you can't use [`GetWriteWatch`](https://msdn.microsoft.com/pl-pl/library/windows/desktop/aa366573(v=vs.85).aspx), or a scheme based on catching writes by using `VirtualProtect` to make pages temporarily read-only? — , Jul 06 '17 at 01:20
@Fanael The pages I can't use write watching on is mapped sections views, e.g. images. I'm already using write watching on the rest. It must be enabled through VirtualAlloc. I've avoided reallocating mapped views with write watching because the WinAPI still think they're views, so it'll use "view functions" on them. — Mikubyte, Jul 06 '17 at 01:21

geza · Answer 1 · 2017-07-06T01:23:24.903

2

There is farmHash128, which is faster than xxHash64. I'm not sure about its quality, though.

If you use xxHash64, I think that you can unroll it a little bit (for example, 8 times), and it will be a little bit faster. And if the page is not in cache, prefetching may help.

Note, however, this

"If I miss one bit of change it's game over."

is a risky game to play. xxHash64's 64-bit of output is surely inadequate for this. You will surely have hash collisions. I'd say that you'll need to use a 256-bit hash at least for this purpose. It won't detect changes 100%, but close. The usual wisdom is to use hash size which has a lower collision probability than the probability of system failure (multiplied by 10^-X, where X is a "smallish" positive number, like 5).

edited Jul 06 '17 at 01:23

answered Jul 06 '17 at 00:50

geza

28,403
6
61
135

I get what you're saying but I just assumed for whatever reason that people used CRC32 because it was reliable, but what I'm getting is that there's essentially a risk involved. In that case, how can the data be trusted? – Mikubyte Jul 06 '17 at 01:27
@Mikubyte Keep in mind CRC32 is intended to detect tiny things like bit-flip errors, not huge rewrites. For something far more collision resistant you'd need to go with a heavyweight like SHA512. – tadman Jul 06 '17 at 01:29
@tadman That's what I need too; I just need to know if 1 bit changed. Doesn't matter which, so if more than one changed that makes no difference. I just need to know whether I should take action or not. – Mikubyte Jul 06 '17 at 01:30
1

@Mikubyte Unless you keep all the bytes, you cannot 100% guarantee that you will not miss a change. So you have two choices: 1) Keep/compare all the bytes. 2) Decide on an acceptable probability of a miss that is less than 0% and tell us what that probability is. For example, if you can't tolerate a 1 in 2^128 chance of missing a change, then you need to use more than 128 bits. – David Schwartz Jul 06 '17 at 01:30
@Mikubyte I get what you're saying, but CRC32 is a really old, quick-and-dirty type system meant for tracking bit-level errors in transmissions or memory corruption issues, but it's also easily confused. Finding collisions is trivial, and often accidental, since the 32-bit space is tiny. Try with SHA512 and CRC32 in parallel for testing to see if one catches stuff the other misses. – tadman Jul 06 '17 at 01:32
1

@Mikubyte And, by the way, if you really mean that "If I miss one bit of change it's game over" then you can't use computers, since they have a non-zero probability of error due to power fluctuations, cosmic rays, or nuclear decay. You *have* *to* decide what probability of error you can tolerate to decide if the devices and algorithms you are using are suitable. An algorithm with a probability of missing a change comparable to the reliability of your hardware has to be suitable or your hardware isn't. – David Schwartz Jul 06 '17 at 01:33
@tadman Right, so what you're saying is that if I had 0xAA and it changed to 0xAB and both happened to have the same checksum I'd still think it was 0xAA. I still don't understand how you can use CRC32 if there's a chance for errors, however. I'm sorry, I think I'm confusing different things here. – Mikubyte Jul 06 '17 at 01:36
As geza and David are saying the smaller your hash output the larger the chance of a collision. A 512-bit hash is considerably less likely to collide than a 32-bit one. I think if you test it out you'll see first-hand how bad it can be on CRC32. – tadman Jul 06 '17 at 01:40
@tadman I'm with you on that part, but if it's that bad, then how can we trust network protocols and such (assuming they'd use CRC32 in this case). You'd compare checksums, see everything is "okay", then go ahead and use corrupt data. That sounds dangerous. – Mikubyte Jul 06 '17 at 01:42
2

We trust protocols whose security model is appropriate to our threat model. We don't know what your threat model is though. Why do you think modern computers are suitable for your application given that they can make mistakes due to power fluctuations, cosmic rays, radioactive decay, manufacturing defects, and so on? What probability of an error can your application tolerate? That's what we need to know. – David Schwartz Jul 06 '17 at 01:44
@David Schwartz Imagine the address space. You do a syscall and some bytes change at unknown locations. There is now a chance that any of those bytes may be used as a branching condition at any point in the future. That's the risk I'm up against. – Mikubyte Jul 06 '17 at 01:46
Okay, so given the consequences if you get it wrong, what probability of error can you tolerate? If zero, you're out of luck, no perfect technology exists. – David Schwartz Jul 06 '17 at 01:46
@Mikubyte It's because the sorts of errors you see in a network transmission are typically single bit-flip errors, they're easy to spot with CRC32, although there is an acceptably non-zero chance they'll go undetected. The more bits you change, the more likely you are to hit a collision. – tadman Jul 06 '17 at 01:48
@Mikubyte: errors are rare, and it has a little probabilty that a CRC doesn't detect it. But, undetected errors happen. One time, I copied a file through local network (university) to a 1.44 disk. When I copied the file to my computer at home (no errors reported anywhere yet), I cannot decompress it (CRC error). Then I took the same file again to home, and compared it to the one which I cannot decompress. There was some bit errors in it. Network protocols or disk CRC (I dunno which caused the error) didn't detect it, but CRC algorithm in RAR did. It only happened once in my whole life. – geza Jul 06 '17 at 01:48
@David Schwartz The smaller, the better. However, it's possible to implement safety mechanisms that will detect when something went wrong. The problem is that when something does happen to go wrong the whole result must be discarded and it is not possible to deterministcally get the same inputs if you were to start over. – Mikubyte Jul 06 '17 at 01:49
As you said there are 2^(32768-32) possible inputs. How would I convert that to probability to fail? – Mikubyte Jul 06 '17 at 02:02
@Mikubyte Although you might get some useful data with a prototype based on hashes or checksums, what you probably need to get 100% accurate data is a virtual machine you can stub into and intercept the memory writes directly. KVM is open-source, so you can probably modify it to get the output you need if such a thing does not already exist. – tadman Jul 06 '17 at 02:02
@tadman I would, but my prototype is meant to be an easily accessible, purely user mode implementation. Basically just run the executable, no setup. Another goal is speed, so this far I've avoided the Debug API also and instead hook in-process to avoid the context switch to the debugger. I'm assuming that this latter part will be faster, but for all I know it may be faster to do the debugging bit. I haven't compared. Just seems reasonable to intercept it directly in user space and continue execution instead of going back and forth. – Mikubyte Jul 06 '17 at 02:05
Do something really dumb that works (e.g. clone memory), then something smarter but unlikely to fail (SHA512) and then something faster. Compare each version to the others to see if your performance improvements start missing changes the others catch. You need a baseline here that does exactly what you want even if it's unacceptably slow. – tadman Jul 06 '17 at 02:07
Would it make sense to checksum at an even smaller granularity than page size? For example split the page into 16 blocks of 256 bytes, then checksum those? To reduce the failure rate, I mean. – Mikubyte Jul 06 '17 at 02:24
@Mikubyte Decide what probability of failure you can tolerate and a scheme can be designed around it. Smaller granularity means more opportunities to fail and bloating of checksum size. – David Schwartz Jul 06 '17 at 03:27
@DavidSchwartz Wouldn't it be the other way around? If you give 2^(256-32) bits input and get 32 bits output that should significantly reduce the size, shouldn't it? I meant storing all 16 checksums per page, by the way, not merging them or anything. – Mikubyte Jul 06 '17 at 14:58
The most important thing here is **collecting data**. We can debate hypothetical solutions until the heat death of the universe, but as soon as you have a prototype you can get data from and explore options I think the better solutions will become immediately obvious. – tadman Jul 06 '17 at 16:14
@Mikubyte Seems I can't edit my comment, but I meant 2^(2048-32). – Mikubyte Jul 06 '17 at 16:20
@Mikubyte That's disaster all around. You have 16 times more checksum storage. Your odds of a false negative in a comparison are still 1 in 2^32 per comparison. And you do 16 times as many comparisons (so more opportunity for collisions). If, instead, you used a single 512-bit hash, you would need the same amount of storage space, but your odds of a false negative would be 1 in 2^512 for each comparison (pretty much it would never happen), and you'd be doing only 1/16th as many comparisons (so far fewer changes for it to happen). – David Schwartz Jul 06 '17 at 17:52
@Mikubyte Even in well-designed 32-bit hashes, collisions will happen all the time. No collision has ever been found in any well-designed 512-bit hash. (And there are many of them that are widely used.) There are even well-studied 160-bit hashes for which no collisions are known and if you found one, you'd be a minor celebrity. – David Schwartz Jul 06 '17 at 17:53

What is the fastest checksumming algorithm for a buffer of size 4096?

1 Answers1