4

I have to calculate crc32 on a lot of files, and also huge files (several GB). I tried several algo found on the web like Damieng or this one, and it works, but it is slow (more than 1 min). I found this benchmark on various crc32 algo and found that sse 4.2 have hardware accelerated crc32 method.

I didn't find anywhere c# code sample to use SSE crc32 code.

1 - Is it possible ?

2 - How can I detect if the current cpu is SSE4.2-enabled ? (to switch the crc32 method)

(please code sample if possible)

John Conde
  • 217,595
  • 99
  • 455
  • 496
Eric Bole-Feysot
  • 13,949
  • 7
  • 47
  • 53
  • 2
    Have you profiled your code? I seriously doubt your problem is due to a slow CRC algorithm, more likely to be the I/O overhead of reading several GB of data. – James Aug 11 '11 at 14:27
  • @James Indeed, either that or you can utilise `unsafe` code areas with pointer optimisations to get some benefit - though this is just wild conjecture :-) – Adam Houldsworth Aug 11 '11 at 14:28
  • 2
    Calculating the CRC32 of a several-GB file needs reading the whole file ... and processing it sequentially. Try to measure how long simply reading the whole file (and adding the bytes or 32-bit blocks, to avoid that your system optimizes the reading away) takes, compare this with your CRC32 calculation. – Paŭlo Ebermann Aug 11 '11 at 15:48
  • The cryptography tag along with CRC32 is downright frightening, I really hope it's an error (in which case please remove the tag), otherwise you have a whole lot more problem than a slow calculation. – Bruno Rohée Aug 12 '11 at 11:47
  • @Bruno You're right, but the crc32 algorithm fits perfectly in the .net System.Security.Cryptography pattern. That's why I added the tag. Anyway, if it annoys someone else, I could remove it. I use it for files identification. – Eric Bole-Feysot Aug 12 '11 at 11:52

2 Answers2

5

Nowadays we've been spoiled with the System.Runtime.Intrinsics.X86 namespace available in .NET Core 3.0. Here's the full implementation of the CRC32-C algorithm using SSE 4.2:

using System;
using System.Runtime.Intrinsics.X86;
using System.Security.Cryptography;

/// <summary>
/// The hardware implementation of the CRC32-C polynomial 
/// implemented on Intel CPUs supporting SSE4.2.
/// </summary>
public class Crc32HardwareAlgorithm : HashAlgorithm
{
    /// <summary>
    /// the current CRC value, bit-flipped
    /// </summary>
    private uint _crc;

    /// <summary>
    /// We can further optimize the algorithm when X64 is available.
    /// </summary>
    private bool _x64Available;

    /// <summary>
    /// Default constructor
    /// </summary>
    public Crc32HardwareAlgorithm()
    {
        if (!Sse42.IsSupported)
        {
            throw new NotSupportedException("SSE4.2 is not supported");
        }

        _x64Available = Sse42.X64.IsSupported;

        // The size, in bits, of the computed hash code.
        this.HashSizeValue = 32;
        this.Reset();
    }

    /// <summary>When overridden in a derived class, routes data written to the object into the hash algorithm for computing the hash.</summary>
    /// <param name="array">The input to compute the hash code for.</param>
    /// <param name="ibStart">The offset into the byte array from which to begin using data.</param>
    /// <param name="cbSize">The number of bytes in the byte array to use as data.</param>
    protected override void HashCore(byte[] array, int ibStart, int cbSize)
    {
        if (_x64Available)
        {
            while (cbSize >= 8)
            {
                _crc = (uint)Sse42.X64.Crc32(_crc, BitConverter.ToUInt64(array, ibStart));
                ibStart += 8;
                cbSize -= 8;
            }
        }

        while (cbSize > 0)
        {
            _crc = Sse42.Crc32(_crc, array[ibStart]);
            ibStart++;
            cbSize--;
        }
    }

    /// <summary>When overridden in a derived class, finalizes the hash computation after the last data is processed by the cryptographic stream object.</summary>
    /// <returns>The computed hash code.</returns>
    protected override byte[] HashFinal()
    {
        uint outputCrcValue = ~_crc;

        return BitConverter.GetBytes(outputCrcValue);
    }

    /// <summary>Initializes an implementation of the <see cref="T:System.Security.Cryptography.HashAlgorithm"></see> class.</summary>
    public override void Initialize()
    {
        this.Reset();
    }

    private void Reset()
    {
        _crc = uint.MaxValue;
    }
}
MoonStom
  • 2,847
  • 1
  • 26
  • 21
  • Does this only read the array 1 byte at a time? The [x86 CRC32 instruction](https://www.felixcloutier.com/x86/crc32) is available with 1,2,4, or 8 byte source size. In C++, `_mm_crc32_u64`. This would go 8x faster than using it one byte at a time; the latency is fixed at 3 cycles regardless of input size. (32-bit mode can only do up to 4 bytes at a time) – Peter Cordes Nov 23 '19 at 04:52
  • 1
    You're right. I just updated the answer with the X64 version of it. It's slightly faster now. – MoonStom Nov 23 '19 at 05:11
  • Only "slightly"? Are you testing on tiny inputs? Or is your benchmark dominated by warm-up time? Or does that not compile efficiently for some reason? For medium or larger buffers (like 1kiB should be plenty to hide startup / cleanup overhead), that should be 8x faster. – Peter Cordes Nov 23 '19 at 05:25
  • @PeterCordes I just realized something funny. With 1GB of data and on my I7 CPU, with the polynomial tables I get the hash in 1080ms, X32 SSE in 2660ms and X64 SSE in 353ms. And the results are consistent. You'd think the hardware version is always faster :) – MoonStom Nov 23 '19 at 06:45
  • I'm pretty sure in asm the CRC32 instruction 8 bytes at a time is the fastest possible way, certainly on Intel CPUs. If that's not the case for your C# code then it didn't compile efficiently. Would it help to use a local tmp var instead of the `_crc` class member inside the loop to make sure it can be optimized into a register? Are you sure you're using Release mode and doing enough of a warm-up run? – Peter Cordes Nov 23 '19 at 06:58
  • Wait, when you say "X32 SSE", do you mean running this exact SSE4.2 code in 32-bit mode? It will have to emulate `Sse42.X64.Crc32(..., ..., UInt64)` because like I said, 32-bit only has direct HW support for chunks up to 32-bit. 64-bit operand-size isn't available in 32-bit mode. I'm surprised it can run at all. Your results show a good speedup for 64-bit mode. In absolute terms, 8-byte at a time `crc32 rax, [rdi]` has 3 cycle latency, so your throughput should be 8 bytes per 3 clock cycles. Testing repeated passes over the same 8k buffer would avoid cache miss bottlenecks, BTW. – Peter Cordes Nov 23 '19 at 07:13
  • @PeterCordes I ran the tests in Release mode, 64 bit. When I wrote X32 I meant the uint variant, which appears to be slower than the software version with lookup tables. I double checked my results, made sure the tests were warmed up, running on the same byte array and multiple times in the same go. I also tried with streams, because this code will run against files at the end of the day, and the results weren't far behind. Internally the HashAlgorithm class uses a 4KB buffer for streams. You're free to run your own tests and share your results. – MoonStom Nov 24 '19 at 02:29
  • I don't have Windows or .NET installed. All I can tell you is what's efficient in asm. If a 32-bit operand-size version of this loop wasn't twice the time of 64-bit, then your compiler did a terrible job. IDK why, but that's obviously the case. It's certainly plausible, though. This depends on `BitConverter.ToUInt64` or `32` basically optimizing into a simple load or memory source operand for a `crc32` instruction. If it doesn't then you have a problem. – Peter Cordes Nov 24 '19 at 02:35
  • The reason why 32-bit is so much slower is because if x64 is not available, it runs the `while (cbSize > 0)` portion, which means the entire CRC is calculated 1 byte at a time. To fix this, add the following just before `while (cbSize > 0)` (sorry, can't do multiple lines): `while (cbSize >= 4) { _crc = Sse42.Crc32(_crc, BitConverter.ToUInt32(array, ibStart)); ibStart += 4; cbSize -= 4; }` – XJDHDR Mar 27 '22 at 08:10
  • That said, in my testing, I found that even using the x64 Intrinsics method is only around 25% faster than using a pure C# slice-by-16 algorithm. My fix only increases x86 mode to around 50% slower than the software algorithm. What did make a huge difference was calling the methods in these C++ DLLs, which are around 7.5x faster than the x64 Intrinsics method: https://crc32c.machinezoo.com/#cpp – XJDHDR Mar 27 '22 at 08:25
1

I believe Mono allows access to CPU instructions via the Mono.Simd namespace:

http://tirania.org/blog/archive/2008/Nov-03.html

Stackoverflow related question:

Using SSE in c# is it possible?

The Mono code is open source. It looks like you can't just add this to a .NET project to get the benefit though as it appears to need the Mono runtime:

Calling mono c# code from Microsoft .net?

That said, it will work, but it will be software emulated.

Community
  • 1
  • 1
Adam Houldsworth
  • 63,413
  • 11
  • 150
  • 187