31

I'm trying to learn to code using intrinsics and below is a code which does addition

compiler used: icc

#include<stdio.h>
#include<emmintrin.h>
int main()
{
        __m128i a = _mm_set_epi32(1,2,3,4);
        __m128i b = _mm_set_epi32(1,2,3,4);
        __m128i c;
        c = _mm_add_epi32(a,b);
        printf("%d\n",c[2]);
        return 0;
}

I get the below error:

test.c(9): error: expression must have pointer-to-object type
        printf("%d\n",c[2]);

How do I print the values in the variable c which is of type __m128i

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
arunmoezhi
  • 3,082
  • 6
  • 35
  • 54
  • 2
    Also note that `__m128i` doesn't have any info on the type that is being stored. It could be 8-bit ints, 16-bit ints, 32-bit, etc... Some compilers support the `.m128i_i32` field extensions. But it's definitely not standard and not in GCC. – Mysticial Nov 06 '12 at 18:59
  • 1
    related to the title: [how to print __uint128_t number using gcc?](http://stackoverflow.com/q/11656241/4279) – jfs Nov 06 '12 at 19:00
  • 1
    Note that some compilers have built-in printf support for SIMD types, e.g. Apple's versions of gcc, clang, etc, all support `%vld` for printing an `__m128i` as 4 x 32 bit ints. – Paul R Nov 06 '12 at 19:12
  • I'm using intel compiler – arunmoezhi Nov 06 '12 at 19:15
  • Is there a way to do masked addition. Say I would like to store only the alternate elements (c[0],c[2])? – arunmoezhi Nov 06 '12 at 19:18
  • `0` is the identity element for addition. So mask one of the input operands, and the corresponding elements of `c = a + (b & mask)` will be `c = a + 0 = a`. – Peter Cordes Apr 17 '16 at 02:19

4 Answers4

28

Use this function to print them:

#include <stdint.h>
#include <string.h>

void print128_num(__m128i var)
{
    uint16_t val[8];
    memcpy(val, &var, sizeof(val));
    printf("Numerical: %i %i %i %i %i %i %i %i \n", 
           val[0], val[1], val[2], val[3], val[4], val[5], 
           val[6], val[7]);
}

You split 128bits into 16-bits(or 32-bits) before printing them.

This is a way of 64-bit splitting and printing if you have 64-bit support available:

#include <inttypes.h>

void print128_num(__m128i var) 
{
    int64_t v64val[2];
    memcpy(v64val, &var, sizeof(v64val));
    printf("%.16llx %.16llx\n", v64val[1], v64val[0]);
}

Note: casting the &var directly to an int* or uint16_t* would also work MSVC, but this violates strict aliasing and is undefined behaviour. Using memcpy is the standard compliant way to do the same and with minimal optimization the compiler will generate the exact same binary code.

bcmpinc
  • 3,202
  • 29
  • 36
askmish
  • 6,464
  • 23
  • 42
  • 1
    Replace `llx` with `lld` if u want int. – askmish Nov 06 '12 at 18:52
  • it works. I used uint32_t to print the 32-bit integers. But the output is reversed. Instead of `2,4,6,8` i get `8,6,4,2`. Does `_mm_add_epi32` store the values in reverse order? – arunmoezhi Nov 06 '12 at 19:00
  • 3
    @NateEldredge: Probably not. A `_mm_extract_epi32`, or store to a local array are more normal. You could also assign to a `union` of a `__m128i` and an array. This is fine for testing / debug-prints *if* it happens to work when you try it. A debugger will show you what's in your vectors more easily than debug-prints, though. – Peter Cordes Apr 17 '16 at 02:17
  • 1
    also : `__m128i bp = _mm_set_epi32(0xFF, 0xfe,0xfa,0xfb); std::cout << std::setfill('0') << std::hex< – Алексей Неудачин Oct 25 '18 at 14:06
  • How about `int *val = (int*)&var`? Then you wouldn't need the `memcpy`. – Nanashi No Gombe Jun 15 '20 at 21:49
  • @NanashiNoGombe: That's a strict-aliasing violation, pointing an `int*` into an object of a different type (`__m128i`). See the edit history, and my answer. It's weird to use `memcpy` instead of `_mm_storeu_si128`, but either way compilers will hopefully optimize to the same asm, and not actually call a `memcpy` library function (except in a debug build.) – Peter Cordes May 23 '22 at 14:10
20
  • Portable across gcc/clang/ICC/MSVC, C and C++.
  • fully safe with all optimization levels: no strict-aliasing violation UB
  • print in hex as u8, u16, u32, or u64 elements (based on @AG1's answer)
  • Prints in memory order (least-significant element first, like _mm_setr_epiX). Reverse the array indices if you prefer printing in the same order Intel's manuals use, where the most significant element is on the left (like _mm_set_epiX). Related: Convention for displaying vector registers

Using a __m128i* to load from an array of int is safe because the __m128 types are defined to allow aliasing just like ISO C unsigned char*. (e.g. in gcc's headers, the definition includes __attribute__((may_alias)).)

The reverse isn't safe (pointing an int* onto part of a __m128i object). MSVC guarantees that's safe, but GCC/clang don't. (-fstrict-aliasing is on by default). It sometimes works with GCC/clang, but why risk it? It sometimes even interferes with optimization; see this Q&A. See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

See GCC AVX _m256i cast to int array leads to wrong values for a real-world example of GCC breaking code which points an int* at a __m256i.


(uint32_t*) &my_vector violates the C and C++ aliasing rules, and is not guaranteed to work the way you'd expect. Storing to a local array and then accessing it is guaranteed to be safe. It even optimizes away with most compilers, so you get movq / pextrq directly from xmm to integer registers instead of an actual store/reload, for example.

Source + asm output on the Godbolt compiler explorer: proof it compiles with MSVC and so on.

#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>

#ifndef __cplusplus
#include <stdalign.h>   // C11 defines _Alignas().  This header defines alignas()
#endif

void p128_hex_u8(__m128i in) {
    alignas(16) uint8_t v[16];
    _mm_store_si128((__m128i*)v, in);
    printf("v16_u8: %x %x %x %x | %x %x %x %x | %x %x %x %x | %x %x %x %x\n",
           v[0], v[1],  v[2],  v[3],  v[4],  v[5],  v[6],  v[7],
           v[8], v[9], v[10], v[11], v[12], v[13], v[14], v[15]);
}

void p128_hex_u16(__m128i in) {
    alignas(16) uint16_t v[8];
    _mm_store_si128((__m128i*)v, in);
    printf("v8_u16: %x %x %x %x,  %x %x %x %x\n", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}

void p128_hex_u32(__m128i in) {
    alignas(16) uint32_t v[4];
    _mm_store_si128((__m128i*)v, in);
    printf("v4_u32: %x %x %x %x\n", v[0], v[1], v[2], v[3]);
}

void p128_hex_u64(__m128i in) {
    alignas(16) unsigned long long v[2];  // uint64_t might give format-string warnings with %llx; it's just long in some ABIs
    _mm_store_si128((__m128i*)v, in);
    printf("v2_u64: %llx %llx\n", v[0], v[1]);
}

If you need portability to C99 or C++03 or earlier (i.e. without C11 / C++11), remove the alignas() and use storeu instead of store. Or use __attribute__((aligned(16))) or __declspec( align(16) ) instead.

(If you're writing code with intrinsics, you should be using a recent compiler version. Newer compilers usually make better asm than older compilers, including for SSE/AVX intrinsics. But maybe you want to use gcc-6.3 with -std=gnu++03 C++03 mode for a codebase that isn't ready for C++11 or something.)


Sample output from calling all 4 functions on

// source used:
__m128i vec = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7,
                            8, 9, 10, 11, 12, 13, 14, 15, 16);

// output:

v2_u64: 0x807060504030201 0x100f0e0d0c0b0a09
v4_u32: 0x4030201 0x8070605 0xc0b0a09 0x100f0e0d
v8_u16: 0x201 0x403 0x605 0x807  | 0xa09 0xc0b 0xe0d 0x100f
v16_u8: 0x1 0x2 0x3 0x4 | 0x5 0x6 0x7 0x8 | 0x9 0xa 0xb 0xc | 0xd 0xe 0xf 0x10

Adjust the format strings if you want to pad with leading zeros for consistent output width. See printf(3).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
5

I know this question is tagged C, but it was the best search result also when looking for a C++ solution to the same problem.

So, this could be a C++ implementation:

#include <string>
#include <cstring>
#include <sstream>

#if defined(__SSE2__)
template <typename T>
std::string __m128i_toString(const __m128i var) {
    std::stringstream sstr;
    T values[16/sizeof(T)];
    std::memcpy(values,&var,sizeof(values)); //See discussion below
    if (sizeof(T) == 1) {
        for (unsigned int i = 0; i < sizeof(__m128i); i++) { //C++11: Range for also possible
            sstr << (int) values[i] << " ";
        }
    } else {
        for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) { //C++11: Range for also possible
            sstr << values[i] << " ";
        }
    }
    return sstr.str();
}
#endif

Usage:

#include <iostream>
[..]
__m128i x
[..]
std::cout << __m128i_toString<uint8_t>(x) << std::endl;
std::cout << __m128i_toString<uint16_t>(x) << std::endl;
std::cout << __m128i_toString<uint32_t>(x) << std::endl;
std::cout << __m128i_toString<uint64_t>(x) << std::endl;

Result:

141 114 0 0 0 0 0 0 151 104 0 0 0 0 0 0
29325 0 0 0 26775 0 0 0
29325 0 26775 0
29325 26775

Note: there exists a simple way to avoid the if (size(T)==1), see https://stackoverflow.com/a/28414758/2436175

Antonio
  • 19,451
  • 13
  • 99
  • 197
  • You should use `alignas(16) T values[16/sizeof(T)];` and `_mm_storeu_si128( (__m128i*)values, var);` All the rest of the code works fine then. And simplifies, because you can use a range-for like `for(T v : values)`, I think. – Peter Cordes Oct 15 '17 at 07:13
  • @PeterCordes I see your point. I wonder if one could simply use a memcpy instead, that would spare the necessity of requiring an aligned buffer. – Antonio Oct 15 '17 at 22:19
  • See my answer. Use `storeu` instead of `store` if you don't have C++11 for `alignas`, or compiler-specific directives. It will probably still optimize away. (And BTW, modern Windows / Linux already align the stack by 16B, so it doesn't cost the compiler anything to align the buffer if it does actually store/reload.) – Peter Cordes Oct 15 '17 at 22:24
  • @PeterCordes Yet, isn't memcpy a valid alternative? – Antonio Oct 16 '17 at 02:02
  • Yes, that would work too. I expect that compilers will generally do better with a `_mm_storeu_si128`, though. For memcpy, I wouldn't be surprised if at least one compile would actually spill the `__m128` to the stack and then copy from there to an array. (Although maybe not, inlining trivial memcpy is something they're normally pretty good at.) Still, `_mm_storeu` is definitely the more idiomatic way to go about it, IMO. It does actually optimize away completely to `movq` / `pextrq` or whatever instructions with most compilers (even when store/reload would be better: many tiny elements) – Peter Cordes Oct 16 '17 at 02:06
  • @PeterCordes Thank you for the insight! One point is that perfect optimization is not really important in a streaming function like this. I will adapt my answer to use memcpy, which sticks more to my coding style, and I believe is easier to understand. – Antonio Oct 16 '17 at 14:33
  • I think `storeu` is easier to understand, since code with intrinsics will definitely use that. But ok, if you like `memcmp` for type-punning, then go ahead. – Peter Cordes Oct 17 '17 at 01:46
  • Is there a way to avoid needing to special-case `sizeof(T) == 1` to still print as integers, not `char` or `unsigned char`? That's the most clunky thing about this function :/ – Peter Cordes Oct 17 '17 at 01:47
  • Possible UB if `sizeof(T) == 3` or something. (A `class` with three `char` members, e.g. RGB24 pixels and an overloaded `operator<<`). Maybe round up with `values[ (16+sizeof(T)-1) / sizeof(T) ]`. (Probably worse alternative: `memcpy(..., sizeof(values))` to only copy the right number of bytes, but that's more likely to optimize poorly.) – Peter Cordes Oct 17 '17 at 01:53
  • Let's do sizeof values, the compiler will have the information to figure things out and optimize – Antonio Oct 17 '17 at 07:22
  • Yeah, but then you're not copying all the bytes, and the array isn't big enough to hold a whole `__m128i`. It's a harder optimization. BTW, `std::memcpy` is only defined in ``. gcc defines regular `memcpy` in `` (I guess because C defines it in `string.h`). I guess it's best to include `` if you want memcpy, since [C++ doesn't mention a non `std::` version of it](http://en.cppreference.com/w/cpp/string/byte/memcpy), at least on cppreference. – Peter Cordes Oct 17 '17 at 07:23
  • Probably the optimizer will use 16 instead of 15 bytes, and probably not memcpy at all. I prefer to keep it easy to understand – Antonio Oct 17 '17 at 07:28
  • As I suspected, your optimism was misplaced with `gcc7.2 -O3` and `clang5.0 -O3`. They both actually do 15-byte copies. Right-click on the memcpy line in https://godbolt.org/g/vRtzs3, and "scroll to assembly". (gcc with 8+4+2+1 byte stores, clang with 2 overlapping 8-byte stores). Not a big deal compared to how much code it takes to create a `std::string` (and feed it in turn to another `ostream` and destroy it, if you want to print it to `cout`), but still. Maybe there's a nicer way to round up the size? – Peter Cordes Oct 17 '17 at 07:54
  • I prefer to keep easy to understand. I added a note on how the if could be removed. – Antonio Oct 17 '17 at 07:58
  • Also, your function doesn't compile if `(int)values` is ill-formed, even if it's not used. Do you know of a better way to do that which doesn't require both sides of a compile-time-constant branch to compile? I guess maybe `enable_if` to make separate templates for sizeof(T)=1 or not. (There's still the issue of 1-byte classes that overload `operator<<`...). (In the Godbolt link, I just `#ifdef`ed out that branch so it would compile with my `RGB24` class.) – Peter Cordes Oct 17 '17 at 07:58
  • 1
    Yeah, it's only a performance problem if you do use it with an non-power-of-2 class, not `uint*_t`. It makes sense to keep it as-is for readability. (Especially since there's nothing high-performance about using `std::string` and a string-stream to print a vector.) If you were putting this in a library for people to use without looking at it, instead of an SO answer, you'd make different choices. – Peter Cordes Oct 17 '17 at 08:00
2
#include<stdio.h>
#include<emmintrin.h>
int main()
{
    __m128i a = _mm_set_epi32(1,2,3,4);
    __m128i b = _mm_set_epi32(1,2,3,4);
    __m128i c;

    const int32_t* q; 
    //add a pointer 
    c = _mm_add_epi32(a,b);

    q = (const int32_t*) &c;
    printf("%d\n",q[2]);
    //printf("%d\n",c[2]);
    return 0;
}

Try this code.

ismail
  • 46,010
  • 9
  • 86
  • 95
Lucien
  • 59
  • 3
  • @NateEldredge: I'm sure this is *not* strictly legal (unless you use `-fno-strict-aliasing` or something). I posted an answer that is safe. – Peter Cordes Oct 15 '17 at 07:01
  • @PeterCordes, regarding your comment that "this is not strictly legal", is there some way to get a compiler warning? I tried using `-Wstrict-aliasing` and don't receive any warning. I also tried `-fsanitize=undefined` to check for a runtime warning or error, but received neither. – dannyadam Nov 25 '20 at 21:08
  • 1
    @dannyadam: Interesting, but it seems those checks don't catch things that are clearly strict-aliasing violations: https://godbolt.org/z/qo4vre e.g. `return *(5 + (int*)arr);` for an `unsigned long long arr[10];` array. – Peter Cordes Nov 26 '20 at 03:45