Why is this CRC32 implementation in C# so slow?

Question

I'm using the following function to compute the CRC32 of a file in a VS2008, .NET 3.5 project:

public UInt32 ComputeHash(System.IO.Stream stream)
{
    unchecked
    {
        const int BUFFER_SIZE = 1024;

        UInt32 crc32Result = 0xFFFFFFFF;
        byte[] buffer = new byte[BUFFER_SIZE];
        int count = stream.Read(buffer, 0, BUFFER_SIZE);

        while (count > 0)
        {
            for (int i = 0; i < count; i++)
            {
                crc32Result = ((crc32Result) >> 8) ^ _crc32Table[(buffer[i]) ^ (crc32Result) & _LOOKUP_TABLE_MAX_INDEX];
            }
            count = stream.Read(buffer, 0, BUFFER_SIZE);
        }

        return ~crc32Result;
    }
}

For the sake of brevity, I have left out the function that builds the lookup table (_crc32Table). The table is an array of UInt32, is built when the class is instantiated, and contains 256 values (256 is also the value of _LOOKUP_TABLE_MAX_INDEX + 1).

I have run some benchmarks comparing this to the MD5CryptoServiceProvider and SHA1CryptoServiceProvider ComputeHash functions and they are much faster. The MD5 function is over twice as fast and the SHA1 hash is about 35% faster. I was told CRC32 is fast, but that's not what I'm seeing.

Am I mistaken in my assumptions? Is this to be expected or is there a flaw in this algorithm?

Profile your implementation, find the hot spot and you'll know where the expense is. Anything else is guessing. — Eric Lippert, Jun 03 '09 at 17:13
Good advice in general, but this is a CRC32 algorithm. the hotspot is going to be all the bit manipulation. (Yes, I'm guessing, but willing to bet pretty big that I'm right!) The question is, is there an optimized implementation of the CRC32 algorithm that could work faster? — Cheeso, Jun 03 '09 at 17:29
@Cheeso "Yes, I'm guessing, but willing to bet pretty big that I'm right!": http://stackoverflow.com/questions/888224/what-is-your-longest-held-programming-assumption-that-turned-out-to-be-incorrect/888766#888766 — lothar, Jun 03 '09 at 17:35
your block size of 1Kb may be wrong -- but without benchmarking you're pissing in the wind. _measure_ then modify — ShuggyCoUk, Jun 03 '09 at 19:27
It _could_ be the bit manipulations. It could also be the array bounds checks. It could also be the call to the stream methods. It could be that the benchmark is only testing the FIRST call to this method and therefore is measuring the jitter cost, even though that cost is only imposed once; sometimes jitting the code is more than half the total cost of the first call, particularly methods which the jitter spends time optimizing. It could be a lot of things. I have no idea which it is. Why make a bet and risk guessing wrong when you can simply use a tool to find out the correct answer? — Eric Lippert, Jun 03 '09 at 19:53

score 6 · Answer 1 · answered Jun 03 '09 at 16:51

6

You are comparing your code to built in functions and asking why they are faster. What you need to do is find the source for the built in functions. How do they work? See what's different.

Betcha the built in functions call out to a native library and cheat by not having to run inside the managed memory framework.

answered Jun 03 '09 at 16:51

Karl

8,967
5
29
31

That's what I was thinking. It's the penalty of managed code. – raven Jun 03 '09 at 17:03
3

@raven, not necessarily. Remember, built in functions like this are engineered over years and heavily optimized before they are shipped. Like Eric Lippert suggested in his comment, profile your code. – Rob Jun 03 '09 at 17:17
I suspect it would be due to the file not having to be copied from unmanaged memory into managed memory. – Aron May 08 '16 at 18:22

Adam M. · Answer 2 · 2016-05-08T17:37:19.693

Profiling may help determine how much time is taken by the IO call (Read) versus the CRC calculation. Code is often IO-bound. However, as someone who's implemented a pretty fast CRC function in C#, I can give some pointers as to why it would be slower than MD5.

You're reading memory one byte at a time. Implement slicing-by-four so you can read four bytes at a time, or perhaps slicing-by-eight so you can read eight bytes at a time (but only if the code actually runs in 64-bit mode - you should fall back to slicing-by-four in 32-bit mode, which you should test using if(sizeof(IntPtr) < 8) or similar).
You're processing one byte per loop iteration, and thus paying the loop overhead for every byte. Implement slicing-by-N or else consider loop unrolling. (Doing both may be unnecessary.)
You're incurring two array bounds-checks per byte. You can use 'unsafe' code to avoid bounds checks. With unsafe code you should also ensure that you align your read pointers, although since you're only accessing .NET arrays you can probably assume they're already aligned to the machine word size. Note that unsafe code is unsafe, so be careful!
MD5 was designed to be a very fast algorithm and doesn't have the problems listed above. It reads multiple bytes at a time and processes them in parallel, and is implemented in unmanaged code.
This is minor, but your loop construct isn't optimal. Since you know count != 0, a do/while loop that pre-decrements count (i.e. --count) and compares to zero is better than a for loop that compares two variables. With your code, this would save a couple instructions and perhaps a memory read per byte.

If you implement slicing-by-N, pack all the lookup tables into a single large table so they can all be accessed through the same pointer. You can also use the same table for slicing-by-4 and slicing-by-8. Also be aware that a typical slicing-by-N implementation assumes a particular machine endianness, so you may need a separate version for big-endian machines, which you can check for using BitConverter.IsLittleEndian.

Cheeso · Answer 3 · 2009-06-03T17:25:54.640

0

maybe: Are you counting the computation of the lookup table in your observation of the CRC throughput? Normally the lookup table is computed once and cached. If you don't cache it you will be computing it every time you calculate a CRC. Also if you only measure one CRC, then you may have included the table computation cost in the CRC computation cost. Best to measure many iterations of each hash.

Addendum: When I measured, I saw a factor of 2.6x, comparing your CRC32 to the MD5 hash, when the app was compiled with /debug+ and /optimize-. Using /debug- and /optimize+, I saw a factor of 1.6x. The absolute MD5 performance did not vary when I changed the compile flags. With no debug, CRC was still slower, but it was much closer.

edited Jun 03 '09 at 17:25

answered Jun 03 '09 at 16:41

Cheeso

189,189
101
473
713

As I said in my question, "The table is an array of UInt32, is built when the class is instantiated...". The table exists before the hashing begins. – raven Jun 03 '09 at 16:49
2

of course, but do you re-compute the table for every hash? Is it a static table or an instance table. And do you compute many iterations of the hash and average the response time? Or just one. If it is an instance variable and you create a new instance of the Crc32 class for every CRC32 calculated, then you pay the init cost every time. If you measure only one hash, then you pay the init cost. No? – Cheeso Jun 03 '09 at 16:51
I see what you're saying. The timing doesn't start until after instantiation of the class, so table creation time is not a factor. Even if it were included in the time calculation, it wouldn't affect the results much. I'm hashing quite large files (hundreds of megabytes) and the CRC32 calculation is taking 7-8 seconds as compared to the milliseconds it takes to create the table. Typical example: 699 MB file, CRC32 7.3 s, MD5 3.7 s, SHA1 4.8 s. – raven Jun 03 '09 at 17:13
Ah, got it. If it is 7s of runtime per cycle, the init cost will be ~invisible. Did you try /debug- on the compiler options? – Cheeso Jun 03 '09 at 17:27
1

All /debug- does is stop emitting debug symbol files. It does not have any effect on code generation. It's /optimize+ that changes code generation. – Eric Lippert Jun 03 '09 at 19:56

score 0 · Answer 4 · answered Jun 03 '09 at 17:11

I'm not too familiar with the optimizations that are automatically being performed when this code executes but you have a couple of options if profiling isn't working for you.

I can suggest trying out unsafe code and using pointer arithmetic for the buffer[i] and _crc32Table lookups in case it isn't already being optimized.

The only other place I can see you running into performance issues is with the Stream.Read calls. Have you experimented with different BUFFER_SIZE values?

Using a larger byte buffer and possibly doing some manual loop unrolling could help you out too if they aren't being optimized automatically.

Why is this CRC32 implementation in C# so slow?

4 Answers4

Linked