Questions tagged [amd-gcn]

16 questions
4
votes
1 answer

Performance drop in matrix multiplication for certain sizes on AMD Polaris

I have an OpenCL code that multiplies 2 matrices (GEMM) with M=4096, N=4096 and K=16. (i.e. matrices 4096 x 16 floats) I run it on Polaris 560, 16CU GPU. Code: https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl I noticed very strange…
Artyom
  • 31,019
  • 21
  • 127
  • 215
3
votes
0 answers

Is It Possible to Access the Entire 64KB of the Global Data Sharea (GDS) on AMD's GCN2+ GPU's?

I am trying to use the Global Data Share (GDS) on AMD RX 480 for my application either on Linux or Windows. Although the GCN3 Specification Manual states that you can access the GDS without restrictions by setting an appropriate value to the m0…
meriken2ch
  • 409
  • 5
  • 15
2
votes
1 answer

SIMD-16 and SIMD-32 advantage/disadvantage?

So recently, AMD launched their new GPU architecture called rDNA in their new Navi GPU line up. After reading certain architecture deep-dive article and video, my understanding is this (feel free to correct if I am wrong): Small workloads that need…
CSDD
  • 339
  • 2
  • 14
2
votes
0 answers

How to read and write to Global Data Share in AMD GCN?

I'm trying to use GDS in AMD GPU, but I can not make it work. My GPU is AMD RX580. I used this OpenCL kernel: __kernel __attribute__((reqd_work_group_size(64, 1, 1))) void localVarExample(__global int *res) { int i = get_global_id(0); __local…
Michael Lukin
  • 829
  • 3
  • 9
  • 19
1
vote
0 answers

In OpenCL, can one take an array containing GCN Assembly and execute it (JIT)?

I'm relatively new to OpenCL and was wondering about this. I'd heard that it was possible to JIT on some AMD gpus via OpenCL. Now, if this were to work syntactically as it does in c++, I would just write something like: uint…
MNagy
  • 423
  • 7
  • 20
1
vote
1 answer

V_SUB_F64 in AMD's GCN and VEGA instruction set

Why there is no "V_SUB_F64" instruction in AMD's GCN and VEGA instruction set? How do they realise the double precision subtraction?
1
vote
2 answers

OpenCL (AMD GCN) global memory access pattern for vectorized data: strided vs. contiguous

I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why). The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means,…
qpdb
  • 41
  • 4
1
vote
0 answers

Do optimized kernels running on AMD GCN OpenCL only work with ~1024 bytes at a time?

I'm beginning to architect my first serious OpenCL program, and I want to make sure I understand how my AMD R9 290x is set up. (GCN 2.0 Architecture). So I'll just say what I understand, and hopefully someone out there can tell me where I'm right or…
Dragontamer5788
  • 1,957
  • 1
  • 12
  • 20
0
votes
1 answer

How to resolve _pickle.UnpicklingError

I was trying to download and run the 2s-AGCN code from this GitHub link: I have generated data successfully But when trying to train the model by running main.py, I am facing this error: [ Mon Feb 20 21:32:20 2023 ] Training epoch: 1 0%| …
0
votes
0 answers

Instruction execution in GPGPU

I am learning GPU hardware (AMD GCN architecture). I am confused a little bit about the instruction executions. Let me take an example: for(i=0;i<64;i++) c[i] = a[i] + b[i] for the above code. Assuming the warp/wavefront has 64 threads. Now a…
MGS
  • 9
  • 1
0
votes
1 answer

What is the best practice for memory access in this N-body problem solved on AMD Radeon RX580?

I compute trajectories of N particles which move in their gravitation force field. I wrote the following OpenCL kernel: #define G 100.0f #define EPS 1.0f float2 f (float2 r_me, __constant float *m, __global float2 *r, size_t s, size_t n) { …
0
votes
1 answer

Is uint2 operations faster than ulong in OpenCL on AMD GCN cards?

Which of the "+" calculation is faster? 1) uint2 a, b, c; c = a + b; 2) ulong a, b, c; c = a + b;
0
votes
0 answers

How to run two work groups per one compute unit on AMD GCN cards

Usually one compute unit can only run one work group. But AMD's doc says there can be more than one wavefronts running on the same compute unit. How can I do that? Is that an OpenCL function for that? Or I need to use assembly instruction? I want to…
0
votes
0 answers

GCM not receiving on ColorOS based devices

I have an android app which sends and receive FCM Messages(Google Firebase Messages). Its working perfectly as i can receive the messages when my app is not in forground or not using it as i have a service running on it. Unfortunately its not…
0
votes
0 answers

Avoid L1 cache pollution on GCN device

I have a kernel that writes results to a global buffer; these results are never read back into the kernel (they are processed by another kernel at a later time). So, I don't want this data sitting in the L1 cache if I can help it. Is there a way of…
Jacko
  • 12,665
  • 18
  • 75
  • 126
1
2