ARM Neon Matrix multiplication example

Question

I wanted to learn Neon. I took the example on the ARM website at:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0425/ch04s06s05.html

I thought I would get this to run and then start to experiment with it. KISS. The program compiles fine (GCC), however, when run I get a 'Segmentation fault' when the first VST instruction is encountered. Remove the VST instructions and the program goes its course. Using GDB everything seems to be working (register values etc), just the error it seems when it comes to the store-in-memory process.

Appreciate any guidance or help...

.global main
.func main
main:

.macro mul_col_f32 res_q, col0_d, col1_d
vmul.f32    \res_q, q8,  \col0_d[0] @ multiply col element 0 by matrix col 0
vmla.f32    \res_q, q9,  \col0_d[1] @ multiply-acc col element 1 by matrix col 1
vmla.f32    \res_q, q10, \col1_d[0] @ multiply-acc col element 2 by matrix col 2
vmla.f32    \res_q, q11, \col1_d[1] @ multiply-acc col element 3 by matrix col 3
.endm

LDR R0, =result0a
LDR R1, =result1a
LDR R2, =result2a

vld1.32  {d16-d19}, [r1]!   @ load the first eight elements of matrix 0
vld1.32  {d20-d23}, [r1]!   @ load the second eight elements of matrix 0
vld1.32  {d0-d3}, [r2]!         @ load the first eight elements of matrix 1
vld1.32  {d4-d7}, [r2]!         @ load the second eight elements of matrix 1

mul_col_f32 q12, d0, d1     @ matrix 0 * matrix 1 col 0
mul_col_f32 q13, d2, d3     @ matrix 0 * matrix 1 col 1
mul_col_f32 q14, d4, d5     @ matrix 0 * matrix 1 col 2
mul_col_f32 q15, d6, d7     @ matrix 0 * matrix 1 col 3

vst1.32  {d24-d27}, [r0]!   @ store first eight elements of result.
vst1.32  {d28-d31}, [r0]!   @ store second eight elements of result.

MOV R7, #1
SWI 0

result1a:   .word 0xFFFFFFFF    @ d16
result1b:   .word 0xEEEEEEEE    @ d16
result1c:   .word 0xDDDDDDDD    @ d17
result1d:   .word 0xCCCCCCCC    @ d17
result1e:   .word 0xBBBBBBBB    @ d18
result1f:   .word 0xAAAAAAAA    @ d18
result1g:   .word 0x99999999    @ d19
result1h:   .word 0x88888888    @ d19

result2a:   .word 0x77777777    @ d0
result2b:   .word 0x66666666    @ d0
result2c:   .word 0x55555555    @ d1
result2d:   .word 0x44444444    @ d1
result2e:   .word 0x33333333    @ d2
result2f:   .word 0x22222222    @ d2
result2g:   .word 0x11111111    @ d3
result2h:   .word 0x0F0F0F0F    @ d3

result0a:   .word 0x0       @ R0
result0b:   .word 0x0       @ R0
result0c:   .word 0x0       @ R0
result0d:   .word 0x0       @ R0
result0e:   .word 0x0       @ R0
result0f:   .word 0x0       @ R0
result0g:   .word 0x0       @ R0
result0h:   .word 0x0       @ R0

For all of your load and store operations you're loading and storing 16 words (64 bytes), but only defining room for 8 (32 bytes). This 'works' on reading the input, but when storing the output you're writing 32 bytes past the end of the data area you've defined. — BitBank, Feb 14 '17 at 12:57
Firstly, you should strongly consider using C/C++ and intrinsics rather than assembly. Assembly these days is mostly a 'read-only' language useful for debugging, so you can just look at the disassembly of the intrinsics code. Second, [DirectXMath](https://github.com/Microsoft/DirectXMath) includes intrinsics codepaths for ARM32 and ARM64. — Chuck Walbourn, Feb 14 '17 at 16:45
I have tried extending the data area, I have refined the source so that the result is placed at the start of free memory and therefore, at worse would over write the original data. The Segmentation Fault continues to occur. — Smithy, Feb 17 '17 at 06:08
using 'sudo' to run the file as superuser does not cause the segmentation error. However, the result is not computed - registers q12 to q15 all remain at zero. Initial values are loaded correctly. — Smithy, Feb 17 '17 at 09:51
@ChuckWalbourn I think you've never met a capable assembly programmer. Intrinsics generated machine codes are VERY lackluster, and intrinsics are way more confusing than the instructions themselves. It's so pointless. — Jake 'Alquimista' LEE, Oct 26 '17 at 05:20
It's probably true that many ARM architectures are still within the realm of 'hand optimization' to good advantage, but modern Intel architectures have so much out-of-order stuff going on that's is very difficult to make a general solution across all target systems with varying cache sizes and microarchitecture. The primary value of intrinsics is that it plays well with C/C++ code optimization leaving the register and instruction scheduling in the hands of the compiler rather than making a lot of assumptions about the context of the code. — Chuck Walbourn, Oct 26 '17 at 05:53
@Jake'Alquimista'LEE and Chuck: For x86 you can usually get the asm you want by coding with intrinsics, except sometimes when some compilers will pessimize something by using a different instruction. (clang has a good shuffle optimizer, but sometimes its choices aren't the best choices. See for example my [SSE horizontal sums](https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86) answer, where careful use of intrinsics gives portable / future proof code that compiles to what I want for current CPUs / compilers. — Peter Cordes, Oct 26 '17 at 06:04
@ChuckWalbourn: out-of-order execution makes the CPU *less* sensitive to instruction ordering. Optimal instruction choice (of which instruction, not just which order to put the instructions) can vary with microarchitecture, and sometimes compilers can choose different instructions for `-march=sandybridge` vs. `-march=skylake` vs. `-march=bdver2` (AMD Piledriver). Or especially `-march=knl` (Knight's Landing). Anyway, coding in asm isn't usually needed, but it's not really harder for x86. I agree with Jake that intrinsics are harder to write when you know what asm would be optimal. — Peter Cordes, Oct 26 '17 at 06:35

Jake 'Alquimista' LEE · Answer 1 · 2017-10-26T06:30:41.707

1

You are trying to write 8*8 = 64bytes while you allocated only 4*8 = 32bytes total.

In addition, you are most probably trying to write into a read-only area, declared by .text

Why don't you call your function from C/C++, passing the addresses you allocated via malloc?

Alternately, you could simply use the stack.

edited Oct 26 '17 at 06:30

answered Oct 26 '17 at 05:24

Jake 'Alquimista' LEE

6,197
2
17
25

ARM Neon Matrix multiplication example

1 Answers1