I'm trying to implement RGB image to Gray image conversion, using C++ and arm neon especially inline assembly.
I write a run-ok-but-not-that-fast version with inline assembly. It compiles OK and run OK. For each pixel line processing, it can be described as Loop[AB]
.
I try to make Loop[AB]
become A Loop[BA] B
as a trick for optimization. However, it failed to compile, complains:
error: invalid symbol redefinition
"L4End: \n"
^
<inline asm>:28:1: note: instantiated into assembly here
L4End:
^
1 error generated.
template<int b_idx>
void rgb2gray_opt6(const unsigned char* rgb, int rgb_height, int rgb_width, int rgb_linebytes, unsigned char* gray, int gray_linebytes, const Option& opt)
{
for (int i=0; i<rgb_height; i++)
{
const uchar* rgb_line = rgb + i*rgb_linebytes;
uchar* gray_line = gray + i*gray_linebytes;
#if __ARM_NEON
size_t nn = rgb_width >> 4;
int remain = rgb_width - (nn << 4);
#else
int remain = rgb_width;
#endif // __ARM_NEON
#if __ARM_NEON
#if 0 // this compiles OK, but may not that fast
__asm__ volatile(
"movi v0.8b, #77 \n"
"movi v1.8b, #151 \n"
"movi v2.8b, #28 \n"
"0: \n"
"prfm pldl1keep, [%[src], #512] \n"
"ld3 {v3.16b-v5.16b}, [%[src]], #48 \n"
"ext v6.16b, v3.16b, v3.16b, #8 \n"
"ext v7.16b, v4.16b, v4.16b, #8 \n"
"ext v8.16b, v5.16b, v5.16b, #8 \n"
"umull v9.8h, v3.8b, v0.8b \n"
"umull v10.8h, v6.8b, v0.8b \n"
"umull v11.8h, v4.8b, v1.8b \n"
"umlal v9.8h, v5.8b, v2.8b \n"
"umull v12.8h, v7.8b, v1.8b \n"
"umlal v10.8h, v8.8b, v2.8b \n"
"addhn v13.8b, v9.8h, v11.8h \n"
"addhn2 v13.16b, v10.8h, v12.8h \n"
"subs %[neonLen], %[neonLen], #1 \n"
"st1 {v13.16b}, [%[dst]], #16 \n"
"bne 0b \n"
:[src] "+r"(rgb_line),
[dst] "+r"(gray_line),
[neonLen] "+r"(nn)
:
:"cc", "memory", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13"
);
#else // this is for fast speed purpose, it make Loop[AB] becomes A Loop[BA] B, but failed to compile
asm volatile(
"movi v0.8b, #77 \n"
"movi v1.8b, #151 \n"
"movi v2.8b, #28 \n"
"prfm pldl1keep, [%0, #512] \n"
"ld3 {v3.16b-v5.16b}, [%0], #48 \n"
"ext v6.16b, v3.16b, v3.16b, #8 \n"
"ext v7.16b, v4.16b, v4.16b, #8 \n"
"ext v8.16b, v5.16b, v5.16b, #8 \n"
"subs %w2, %w2, #1 \n"
"ble L4End \n" //!!! This line compile error
"0: \n"
"prfm pldl1keep, [%0, #512] \n"
"ld3 {v3.16b-v5.16b}, [%0], #48 \n"
"ext v6.16b, v3.16b, v3.16b, #8 \n"
"ext v7.16b, v4.16b, v4.16b, #8 \n"
"ext v8.16b, v5.16b, v5.16b, #8 \n"
"umull v9.8h, v3.8b, v0.8b \n"
"umull v10.8h, v6.8b, v0.8b \n"
"umull v11.8h, v4.8b, v1.8b \n"
"umlal v9.8h, v5.8b, v2.8b \n"
"umull v12.8h, v7.8b, v1.8b \n"
"umlal v10.8h, v8.8b, v2.8b \n"
"addhn v13.8b, v9.8h, v11.8h \n"
"addhn2 v13.16b, v10.8h, v12.8h \n"
"subs %w2, %w2, #1 \n"
"st1 {v13.16b}, [%1], #16 \n"
"bne L4Loop \n"
"L4End: \n"
"umull v9.8h, v3.8b, v0.8b \n"
"umull v10.8h, v6.8b, v0.8b \n"
"umull v11.8h, v4.8b, v1.8b \n"
"umlal v9.8h, v5.8b, v2.8b \n"
"umull v12.8h, v7.8b, v1.8b \n"
"umlal v10.8h, v8.8b, v2.8b \n"
"addhn v13.8b, v9.8h, v11.8h \n"
"addhn2 v13.16b, v10.8h, v12.8h \n"
"st1 {v13.16b}, [%1], #16 \n"
: "=r"(rgb_line), // %0
"=r"(gray_line), // %1
"=r"(nn) // %2
: "0"(rgb_line),
"1"(gray_line),
"2"(nn)
: "cc", "memory", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13"
);
#endif // if 0
#endif // __ARM_NEON
for (int j=rgb_width-remain; j<remain; j++)
{
int src_idx = i*rgb_linebytes + j*3;
int r = rgb[src_idx + 2 - b_idx];
int g = rgb[src_idx + 1];
int b = rgb[src_idx + b_idx];
int dst_idx = i*gray_linebytes + j;
gray[dst_idx] = (77*r + 151*g + 28*b) >> 8;
}
}
}
Environment I build this program with Android NDK-r21b, which is bundled with a fairly new version of clang.
Note This question is about, how to resolve a compile error related to inline assembly, instead of how to optimize a function. The comments that talking about "bad optimization idea" is not related to the purpose of this question.
Update
Since this question is labeled as duplicate, I put my answer here, based on the comments (The only answer's content is still not verified, if have time will try).
- Is string literal labels can be used in inline assembly?
Some function may, some not. In my pasted code, template<int b_idx>
cause inline and duplicated label issue. Reomving template parameters resolve it.
- Is digital (local) labels ok?
Actually I tried digital labels first but don't know the difference between b
and f
. They means backward and forward. After knowing that, can easily correct error and compile ok.
- Still using literal labels?
Use some_label%=
instead of some_label
, %=
makes it unique.