9

Why in the world was _mm_crc32_u64(...) defined like this?

unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v );

The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination operand on the instruction (for uniformity), but if I want to process data quickly, I want a source operand as large as possible (i.e. 64-bits if I have that much data left, smaller for the tail ends) and always a 32-bit destination operand. But the intrinsics don't allow a 64-bit source and 32-bit destination. Note the other intrinsics:

unsigned int _mm_crc32_u8 ( unsigned int crc, unsigned char v ); 

The type of "crc" is not an 8-bit type, nor is the return type, they are 32-bits. Why is there no

unsigned int _mm_crc32_u64 ( unsigned int crc, unsigned __int64 v );

? The Intel instruction supports this, and that is the intrinsic that makes the most sense.

Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks. My guess is something like this:

#define CRC32(D32,S) __asm__("crc32 %0, %1" : "+xrm" (D32) : ">xrm" (S))

for GCC, and

#define CRC32(D32,S) __asm { crc32 D32, S }

for VisualStudio. Unfortunately I have little understanding of how constraints work, and little experience with the syntax and semantics of assembly level programming.

Small edit: note the macros I've defined:

#define GET_INT64(P) *(reinterpret_cast<const uint64* &>(P))++
#define GET_INT32(P) *(reinterpret_cast<const uint32* &>(P))++
#define GET_INT16(P) *(reinterpret_cast<const uint16* &>(P))++
#define GET_INT8(P)  *(reinterpret_cast<const uint8 * &>(P))++


#define DO1_HW(CR,P) CR =  _mm_crc32_u8 (CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR =  _mm_crc32_u16(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR =  _mm_crc32_u32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = (_mm_crc32_u64((uint64)CR, GET_INT64(P))) & 0xFFFFFFFF;

Notice how different the last macro statement is. The lack of uniformity is certainly and indication that the intrinsic has not been defined sensibly. While it is not necessary to put in the explicit (uint64) cast in the last macro, it is implicit and does happen. Disassembling the generated code shows code for both casts 32->64 and 64->32, both of which are unnecessary.

Put another way, it's _mm_crc32_u64, not _mm_crc64_u64, but they've implemented it as if it were the latter.

If I could get the definition of CRC32 above correct, then I would want to change my macros to

#define DO1_HW(CR,P) CR = CRC32(CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = CRC32(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = CRC32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = CRC32(CR, GET_INT64(P))
David I. McIntosh
  • 2,038
  • 4
  • 23
  • 45
  • `Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.` --> What have you tried??? ... ???????? And note that "8-bits" is not a type. – Sebastian Mach Apr 05 '13 at 18:58
  • And note there is no need to be that rude. If you are more "brilliant" than the "silly" person you are calling out for (as in "Who was the "brilliant" person who defined"): Why don't you try to contact the copyright owner of the code? – Sebastian Mach Apr 05 '13 at 19:03
  • OK, I'll tone it down, but the "owner" of the code is Microsoft, and when was the last time you had success contacting Microsoft? In any event, it's not a question of "trying" something really - the intrinsic works, and the code above works. The issue is that I need maximal performance, and the intrinsic does not allow this, and for no good reason. The question "Why was (itA) defined like this?" is rhetorical - it should have been defined differently. The point of my post was to see if anyone had tested code to do it properly, code that has been tested multi-platform. – David I. McIntosh Apr 05 '13 at 19:11
  • While I can write the code, I cannot test it on all the platforms where people may be using my code, hence I was hoping someone who is better at low-level programming than I had some useful code. – David I. McIntosh Apr 05 '13 at 19:13
  • Actually you asked "Who wrote it", not "Why was it written like that". And I never tried to contact Microsoft, because I don't use any microsoft products for work; however, did you? – Sebastian Mach Apr 05 '13 at 19:22
  • Indeed, you did change the meaning of parts of my post with your edit. And are you suggesting that I should attempt to contact Microsoft and suggest that they change a standard intrinsic that they have defined and released, and that every else has now also implemented and defined the same way for conformity? – David I. McIntosh Apr 05 '13 at 19:35
  • And lastly, in my original statement that you changed and then publicly derided "The type...is not 8-bits,...", I am not claiming that "8-bits" is a type. The English construction used is one of implicit information, where most readers would understand the statement to mean "The type .. is not an 8-bit type, ..." where the second reference to the concept of "type" is dropped and understood from the first use. I changed it back from your "not composed of 8-bits", which is a pit too pedantic for my liking, and it is my post after all. ;-) – David I. McIntosh Apr 05 '13 at 19:37
  • And also, is there a prohibition against humour on this site? You removed my original line "and in your head you should be hearing the little jingle 'One of these things is not like the others, one of these things just doesn't belong.'", but did not replace it with anything that conveyed the same meaning, which I don't think you should do. If you are going to edit other's posts, at least don't change the important ideas withing the posts. – David I. McIntosh Apr 05 '13 at 19:45
  • "And are you suggesting that I should attempt to contact Microsoft and suggest that they change a standard intrinsic" -> Nope, just ask them for the reason. Anyways, seems I am not sensitive to your kind of humor, which consists of exaggeration and insult. Must say I do not understand what you are looking for here: Blaming someone? Asking for reasons? As you say, changing from "who" to "why" changes intended meaning, so I am really confused now. – Sebastian Mach Apr 05 '13 at 21:53
  • Oh wait: Let me refine my question for what you are actually looking for: Blaming someone? Asking for reasons? Or give me teh codez? And btw, what have you tried already? Just asking so we don't put effort into doing something you have done already. – Sebastian Mach Apr 05 '13 at 21:54
  • Public close for vote: `It's difficult to tell what is being asked here.` – Sebastian Mach Apr 05 '13 at 21:56
  • @DavidI.McIntosh - my experience is of few years ago - but at that time, many of the Microsoft Devs used to answer questions put in the Microsoft Forums - so it wasn't that difficult to get answers from MS. Don't know what the situation is like now - but you can try getting answers from MS there - http://social.msdn.microsoft.com/Forums/en-US/categories – user93353 Apr 05 '13 at 22:29
  • Sigh. Ok, first, I never asked "why was it designed thusly?" in my first post - that was added in an edit of my post, over which I had no control. I don't want to know why. The intrinsic was designed incorrectly. I was simply giving background information for my query, which was, and is, does anyone have tested code that will do what I want. Secondly, I did not insult anyone. Microsoft's intrisic is badly designed. I don't know who designed it, I strongly suspect it was not one individual. I don't really care who designed it. I just want sensible, tested code. – David I. McIntosh Apr 06 '13 at 01:28
  • Thirdly, how the heck is "and in your head you should be hearing the little jingle 'One of these things is not like the others, one of these things just doesn't belong.'" insulting??? I am merely using some silliness to point out the lack of homogeneity in the macros _I_ defined. How does that insult anyone? The inconsistency is necessitated by poor design of the Microsoft intrinsic, but how can you _possibly_ interpret my line as insulting to anyone????? – David I. McIntosh Apr 06 '13 at 01:32
  • Fourthly, I haven't tried anything. I'm not an assembly-level programmer. I suspect I'd need about one line of __asm__ code for GCC and one line of _asm for VS, but I have little experience with "constraints" and the like, and so I was hoping that someone who does understand these things better than I would have a suggestion. Just a simple question. Why are you giving me such a hard time about it and trying to shut down my query???? – David I. McIntosh Apr 06 '13 at 01:45
  • Lastly, please point out where the "exaggeration" is in my post? Is it my statement that CRC32 is "always" a 32-bit result? Is it my statement that one of the macros I presented above is not consistent with the others? Just which statement above is not factual? – David I. McIntosh Apr 06 '13 at 01:49

2 Answers2

12

The 4 intrinsic functions provided really do allow all possible uses of the Intel defined CRC32 instruction. The instruction output always 32-bits because the instruction is hard-coded to use a specific 32-bit CRC polynomial. However, the instruction allows your code to feed input data to it 8, 16, 32, or 64 bits at a time. Processing 64-bits at a time should maximize throughput. Processing 32-bits at a time is the best you can do if restricted to 32-bit build. Processing 8 or 16 bits at a time could simplify your code logic if the input byte count is odd or or not a multiple of 4/8.

#include <stdio.h>
#include <stdint.h>
#include <intrin.h>

int main (int argc, char *argv [])
    {
    int index;
    uint8_t *data8;
    uint16_t *data16;
    uint32_t *data32;
    uint64_t *data64;
    uint32_t total1, total2, total3;
    uint64_t total4;
    uint64_t input [] = {0x1122334455667788, 0x1111222233334444};

    total1 = total2 = total3 = total4 = 0;
    data8  = (void *) input;
    data16 = (void *) input;
    data32 = (void *) input;
    data64 = (void *) input;

    for (index = 0; index < sizeof input / sizeof *data8; index++)
        total1 = _mm_crc32_u8 (total1, *data8++);

    for (index = 0; index < sizeof input / sizeof *data16; index++)
        total2 = _mm_crc32_u16 (total2, *data16++);

    for (index = 0; index < sizeof input / sizeof *data32; index++)
        total3 = _mm_crc32_u32 (total3, *data32++);

    for (index = 0; index < sizeof input / sizeof *data64; index++)
        total4 = _mm_crc32_u64 (total4, *data64++);

    printf ("CRC32 result using 8-bit chunks: %08X\n", total1);
    printf ("CRC32 result using 16-bit chunks: %08X\n", total2);
    printf ("CRC32 result using 32-bit chunks: %08X\n", total3);
    printf ("CRC32 result using 64-bit chunks: %08X\n", total4);
    return 0;
    }
  • 2
    Nope. Notice your declaration for total4 differs from the declaration for total1, total2 and total3. If we are to make mixed use of _mm_crc32_u64, _mm_crc32_u32, _mm_crc32_u16 and _mm_crc32_u8, we need to do data type conversions between use of _mm_crc32_u64 and all the others. Admittedly they are trivial, but they are also completely unnecessary - as I said, there is NO point in using a 64-bit destination data type. – David I. McIntosh Apr 03 '13 at 16:37
  • To be more specific, given `const uint8_t *data; unsigned long total = 0xFFFFFFFFUL; int nSize = sizeof input data;`, I can do this: `//Align memory on 4-byte boundary for(; nSize>0 && (data&3)!=0; --nSize) total = _mm_crc32_u8(total, *data++); for( ; nSize>=4; nSize -= 4 ) total = _mm_crc32_u32(total, *(reinterpret_cast(data))++); if( nSize>=2 ) { total = _mm_crc32_u16(total, *(reinterpret_cast(data))++); nSize -=2; } if( nSize>0 ) total = _mm_crc32_u8(total, *data++);` – David I. McIntosh Apr 03 '13 at 16:57
  • But I can't do this: `for(; nSize>0 && (data&3)!=0; --nSize) total = _mm_crc32_u8 (total, *data++); for( ; nSize>=8; nSize -= 8 ) total = _mm_crc32_u64(total, *(reinterpret_cast(data))++); if( nSize>=4 ) { total = _mm_crc32_u32(total, *(reinterpret_cast(data))++); nSize -= 4; } if( nSize>=2 ) { total = _mm_crc32_u16(total, *(reinterpret_cast(data))++); nSize -=2; } if( nSize>0 ) total = _mm_crc32_u8(total, *data++);` – David I. McIntosh Apr 04 '13 at 15:12
  • without incurring a cost before the first for-loop of transforming my 32-bit "total" to a 64-bit "total64", which is completely unnecessary and silly. I.e. the 64-bit loop needs to be: `for( ; nSize>=8; nSize -= 8 ) total = _mm_crc32_u64(total, *(reinterpret_cast(data))++)&0xFFFFFFFF;` and there is also an implicit conversion of the first parameter to _mm_crc32_u64 from 32 bit to 64 bit. – David I. McIntosh Apr 04 '13 at 15:14
  • @DavidI.McIntosh: Why do you think that case would have any cost at all? x86-64 zero-extends for free, so unless your compiler sucks at optimizing there's no real cost to a 64-bit type for the accumulator / retval. (The compiler might not "know" that the high 32 bits are zero., but that only matters if you explicitly wrote `1 + (uint64_t)(uint32_t)retval`, it might spend an instruction zero-extending. Normally would just invert the result to post-process and then store it to memory. – Peter Cordes Jun 18 '18 at 01:20
  • And BTW, this loop is missing pre/post processing, so it's not actually computing a CRC32C. See Mark Adler's [Implementing SSE 4.2's CRC32C in software](https://stackoverflow.com/a/17646775) – Peter Cordes Jun 18 '18 at 01:20
4

Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.

My friend and I wrote a c++ sse intrinsics wrapper which contains the more preferred usage of the crc32 instruction with 64bit src.

http://code.google.com/p/sse-intrinsics/

See the i_crc32() instruction. (sadly there are even more flaws with intel's sse intrinsic specifications on other instructions, see this page for more examples of flawed intrinsic design)

  • Thanks very much. This is exactly the sort of thing I was looking for! I will look and see if it give me what I need. Thanks again. – David I. McIntosh Apr 16 '13 at 17:39
  • Your header file has the comment "(and yes, the 64-bit CRC32 generates an effective 32-bit result)". Are you saying the declaration `unsigned __int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v );` in the VisualStudio header files is incorrect and/or misleading? Because I notice your USE of the _mm_crc32_u64 intrinsic is as if it had been declared as I was claiming it should have been, i.e. as if it were `unsigned __int32 _mm_crc32_u64( unsigned __int32 crc, unsigned __int64 v );`. thanks. – David I. McIntosh Apr 16 '13 at 20:07
  • 3
    Basically the x64 crc32 instruction which uses the 64bit gpr registers as operands leave the upper 32bits as 0 for the result, and only the lower 32bits contain the legit data. The return type was made "__int64" in the intrinsic because the result is returned in a 64bit gpr in the real asm instruction. – cottonvibes Jun 26 '13 at 19:13
  • The code is no longer available for casual browsing because Google Code is effectively shutdown. Perhaps you can add the relevant portions to your answer. – jww Apr 24 '16 at 12:59
  • If anyone comes across this, here is the archive for the repository: https://code.google.com/archive/p/sse-intrinsics/ – Joshua Estes Aug 28 '23 at 09:41