0

Need to know whether Intel Xeon Phi scatter / gather can implement this use case

int *ptr[3];
int a , b , c ;
ptr[0] = &a;
ptr[1] = &b;
ptr[2] = &c;

Want to add 1 in a, b and c using vector instruction and ptr Need vector operation not on immediate value but on value at address,

Paul R
  • 208,748
  • 37
  • 389
  • 560
anshkun
  • 105
  • 1
  • 12
  • Can you make a,b, and c contiguous, by putting them in a struct for example? – Peter Cordes Mar 15 '18 at 07:21
  • gather load + scatter store is not worth it for only 3 scalars. Scatter has a high fixed cost (not cheap even with most elements masked off). http://agner.org/optimize/ has instruction tables, including for KNL. – Peter Cordes Mar 15 '18 at 07:32
  • No, Actually this array of pointer will hold address scattered which will be updated dynamically during program run so we want to store same value at all the addresses pointed by pointer – anshkun Mar 15 '18 at 08:26
  • Is it really only 3 pointers? Or do you have 8 or 16 pointers to scatter to? – Peter Cordes Mar 15 '18 at 08:28
  • Its not for three scalars we have array of pointer which will be pointing to addresses of 8 or 16 scalars. So we want function "func(array of pointer, val)" which will store/load value at address – anshkun Mar 15 '18 at 08:29
  • 1
    Then yes, use an AVX512 scatter of 8 or 16 elements (use [`VPSCATTERDD`](https://github.com/HJLebbink/asm-dude/wiki/VPSCATTERDD_VPSCATTERDQ_VPSCATTERQD_VPSCATTERQQ) via intrinsics), especially if you're doing a write-only or you know that none of the pointers overlap. Otherwise you need conflict-detection to make sure one scalar gets incremented twice with the gather / increment / scatter. – Peter Cordes Mar 15 '18 at 08:33
  • It means if I assign random addresses to all the 8 elements of array then I can update value of all the address in single instruction. Actually we have use case where 8/16 threads are reading value from specific address which will be uniquely allocated and we will maintain these 8/16 scalar variable address in array of pointers. Then we want to write on all the addresses with single instruction instead of looping each array index and write value at that address [i.e while (i >=0) {*arr[i] = val; i--;}. There is no fixed gap between address present in arr[0] and arr[1]. – anshkun Mar 15 '18 at 09:07
  • 1
    Gather and scatter aren't atomic. It's one asm instruction, but internally it runs as 16 separate store operations. (related: [Per-element atomicity of vector load/store and gather/scatter?](https://stackoverflow.com/questions/46012574/per-element-atomicity-of-vector-load-store-and-gather-scatter): it's likely that each separate load/store is atomic, but that's all). Gather/scatter may be somewhat faster than a scalar loop, **but it doesn't gain you anything in terms of correctness for lock-free code**. – Peter Cordes Mar 15 '18 at 09:10
  • Thank you Peter cordes, Locking is not a concern here, only requirement is to remove loop operation and all addresses are aligned to 8 byte. So it means even all the addresses are aligned but we cannot say stores to all locations will be atomic. We maintained bitmap representing each array index so it is possible vector store will only do store operation as per bit set in bitmask using vmask. – anshkun Mar 15 '18 at 09:38
  • But *why* do you want to remove the loop? If it's *only* for performance reasons, then that's fine. If it's because you think that will make them all happen simultaneously for observers in other threads, that's *not* what you get. A scatter isn't an atomic transaction to multiple locations (and neither is a gather). If you want that, use TSX (transactional memory). https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions and https://www.realworldtech.com/haswell-tm/ – Peter Cordes Mar 15 '18 at 09:42
  • Need to confirm whether it is feasible, sorry for re-posting 1) vector store on scattered addresses without any fixed displacement, All the addresses stored in array of 8/16 pointers (uint64_t *ptr[8/16]; ) . here addresses at ptr[0 ..8 ] will be updated dynamically 2) We need store on specific index only on basis of bitmap 3 )If all the addresses in ptr[0...8] are 8 byte aligned still operation is not atomic 4) if addresses present in ptr[0..8] is mapped will be mapped to same physical page but different offset (not overlapping) then overlap problem will occur or not. – anshkun Mar 15 '18 at 09:49
  • yes for performance reason we are looking for actually I am confused that vector operation works with immediate values. Please correct whether we can use vector store similarly like [*ptr[0...8] = val] with single instruction, Here we are referring value at address. – anshkun Mar 15 '18 at 09:54
  • 1
    Oh right, you'll need to use `VPSCATTERQQ` with a base address of `0` (`NULL`) to use a vector of 64-bit pointers as "indices". But if they're all in the same page, you could instead store offsets with in the page and do 16 at once with `VPSCATTERDD` using 32-bit offsets relative to that page, or relative to the start of some less-than-4GiB shared memory region. 2) yes of course. 3) Each 4 or 8 byte store will be separately atomic (AFAIK), so the effect as far as other threads is equivalent to a loop doing atomic `memory_order_relaxed` stores. 4) if they don't overlap, then they don't overlap – Peter Cordes Mar 15 '18 at 09:55
  • Thanks Peter, Will go through VPSCATTERQQ/DD as well as transnational memory in more detail , Whether we can use transaction memory design for addresses mapped with device ? – anshkun Mar 15 '18 at 10:02
  • I don't think TSX works for MMIO; it works only in the cache coherency protocol between CPU cores. – Peter Cordes Mar 15 '18 at 10:09
  • 1
    @PeterCordes IIRC, there's no TSX on Knights Landing. – Mysticial Mar 15 '18 at 17:06

0 Answers0