Android NDK: vector.resize() is too slow, related to allocation?

Question

There is very odd thing that I faced in Android NDK.

I have a loop

#include <chrono>
#include <android/log.h>
#include <vector>

while (true)
    {
        const int sz = 2048*2048*3;
        std::vector<unsigned char> v;
        {
            auto startTime = std::chrono::system_clock::now();
            v.resize(sz);
            auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - startTime);
            __android_log_print(ANDROID_LOG_ERROR, "READFILE 1", "v.resize(%d) time : %lld\n", sz, duration.count());
        }
        {
            auto startTime = std::chrono::system_clock::now();
            v.resize(0);
            auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - startTime);
            __android_log_print(ANDROID_LOG_ERROR, "READFILE 2", "v.resize(0) time : %lld\n", duration.count());
        }
        {
            auto startTime = std::chrono::system_clock::now();
            v.resize(sz);
            auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - startTime);
            __android_log_print(ANDROID_LOG_ERROR, "READFILE 3", "v.resize(%d) time : %lld\n", sz, duration.count());
        }
    }

And there is a log that I get:

34.4171: v.resize(12582912) time : 845977
34.9682: v.resize(0) time : 550995
35.5293: v.resize(12582912) time : 561165
36.6121: v.resize(12582912) time : 530845
37.1612: v.resize(0) time : 548528
37.7183: v.resize(12582912) time : 556559
38.7811: v.resize(12582912) time : 515162
39.3312: v.resize(0) time : 550630
39.8883: v.resize(12582912) time : 556319
40.9711: v.resize(12582912) time : 530739
41.5182: v.resize(0) time : 546654
42.0733: v.resize(12582912) time : 554924
43.1321: v.resize(12582912) time : 511659
43.6802: v.resize(0) time : 547084
44.2373: v.resize(12582912) time : 557001
45.3201: v.resize(12582912) time : 530313

So, firstly

as you can see I get 550 milliseconds just for resize(0)... It should be maximum 1 MICRO second not MILLI
and secondly why it get again 550 millisecond for resize(size) if capacity of vector wasn't changed?

It is 2 very odd behavior.

You are welcome to take this snip of code and check for yourself if you don't believe me:) But just check in on Android NDK, not Visual Studio project, because there it is works like it should.

It is really looks like bug...

Or what am I doing wrong?

EDIT

I checked that if go down to resize() method I came to such loop

template <class _Tp, class _Allocator>
inline _LIBCPP_INLINE_VISIBILITY
void
__vector_base<_Tp, _Allocator>::__destruct_at_end(pointer __new_last) _NOEXCEPT
{
    pointer __soon_to_be_end = __end_;
    while (__new_last != __soon_to_be_end)
        __alloc_traits::destroy(__alloc(), _VSTD::__to_raw_pointer(--__soon_to_be_end));
    __end_ = __new_last;
}

So, it is means that there is a loop that goes over every element that in resize range and call destroy

And there is no problem IF you hold not trivial objects that has a destructor, BUT if you hold in vector(like in my case) int objects which are trivial and they don't have a destructor, so... it is very strange behaviour, how you can call destructor from object that actually don't have a destructor?

Is it looks like compiler bug?

_"It should be maximum 1 MACRO second"_ What is a m**a**crosecond? — Michael, Nov 07 '19 at 09:44
Check if `resize(0)` changes the capacity to 0. Technically it can as `resize(0)` would invalidate all iterators. — Richard Critten, Nov 07 '19 at 10:49
By the way, when measuring time you should probably be using `steady_clock` (or `high_resolution_clock` if `is_steady` is `true` for it). — Michael, Nov 07 '19 at 10:55
What level of optimization are you using? This seems like the sort of thing that you would see at -O0 but that might go away at -O2. — Andy Jewell, Nov 13 '19 at 13:22
@AndyJewell how can I check actual optimization level? According to my gradle file it is standard ` getDefaultProguardFile('proguard-android.txt')` — Sirop4ik, Nov 13 '19 at 14:04
@AlekseyTimoshchenko Sorry, I know nothing about android development specifically. Somewhere there is a command line that compiles your source code. That command line ought to have something like "-O2" for best results. — Andy Jewell, Nov 13 '19 at 19:23
@AlekseyTimoshchenko the proguard is used for the java code to find the c++ optimization check the CMakelist.txt or the gradle build script, most probable you are using the default but I am not able to find what this is. But it's different if you are running in release or debug. You can also check this https://stackoverflow.com/questions/41321496/android-studio-c-optimization-parameters-performance — madlymad, Nov 16 '19 at 16:38

score 1 · Answer 1 · edited Nov 17 '19 at 07:10

First and foremost, implementation for many library functionalities strongly rely on compiler optimizations. Deleting objects in container can call destroy which in turn for trivially destructible objects will do nothing. If it does nothing, then all logic will be optimized out by compiler. There's a lot of logic involved in destructing objects in STL, just take a look. Essentially destroy is called to ensure that it handles all cases including custom allocators. It has to compile, so for trivial types it has to resolve to something defined and doing nothing is still something defined. It's just to have code as clean as possible. Single responsibility, deallocator decides how and if objects needs to be destructed.

As for your main question, do you use optimizations? That's the first and most important question. Any code without optimizations is just guaranteed to work. Even complexity provided by reference can be different for not optimized code. You can clearly see that first reallocation took almost twice as much time, rest of them is quite stable.

Do you have much better times with other operations of this type? Did you try to compare to plain array performance?

Technically, the Android NDK uses LLVM libc++, not GNU stdlibc++ -- but the implementations are very similar. Here's the corresponding source for [resize](https://github.com/llvm-mirror/libcxx/blob/78d6a7767ed57b50122a161b91f59f19c9bd0d19/include/vector#L2016) and [__destruct_at_end](https://github.com/llvm-mirror/libcxx/blob/78d6a7767ed57b50122a161b91f59f19c9bd0d19/include/vector#L419). — Snild Dolkow, Nov 17 '19 at 07:13

score 1 · Answer 2 · answered Nov 18 '19 at 11:02

Thanks to @Snild Dolkow, @Maciej Załucki and @Andy Jewell

Eventually problem was in optimization level

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

If you use CMake, so use this code

https://stackoverflow.com/a/45333618/5709159

target_compile_options(opende PRIVATE
"$<$<CONFIG:RELEASE>:-O3>"
"$<$<CONFIG:DEBUG>:-O3>"
)

But choose optimization level that you need

if you use Application.mk use this code

https://stackoverflow.com/a/18433696/5709159

score 0 · Accepted Answer · answered Nov 17 '19 at 08:18

Adding to Maciej's answer and Andy's comment, let's check the code that is generated.

Using this Makefile:

CXX = $(NDKPATH)/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang++
CC = $(NDKPATH)/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang++
INC = -I$(NDKPATH)/cxx-stl/llvm-libc++/include/
LIB = -L$(NDKPATH)/cxx-stl/llvm-libc++/lib/
CXXFLAGS = -ggdb -O$(OPTLEVEL)

.PHONY: all clean dump

all: dump

dump: test
    $(NDKPATH)/toolchains/llvm/prebuilt/linux-x86_64/aarch64-linux-android/bin/objdump -d -C test | gawk '/<big|<small|::resize/ {p=1} /^$$/ {p=0} {if (p) print $0}'

clean:
    $(RM) test.o test

test: test.o

...and a very simple test.cpp:

#include <vector>

using std::vector;

void big(vector<int>& v) {
    v.resize(10000000);
}

void small(vector<int>& v) {
    v.resize(0);
}

int main() {
    return 0;
}

Compiling without optimization (-O0), note how both big() and small() call resize(), which does a whole bunch of stuff in a loop (as you've also found in the source code).

ndk-vector-speed$ export NDKPATH=~/.androidsdk/ndk-bundle
ndk-vector-speed$ make clean && OPTLEVEL=0 make dump
rm -f test.o test
/home/snild/.androidsdk/ndk-bundle/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang++ -ggdb -O0   -c -o test.o test.cpp
/home/snild/.androidsdk/ndk-bundle/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang++   test.o   -o test
/home/snild/.androidsdk/ndk-bundle/toolchains/llvm/prebuilt/linux-x86_64/aarch64-linux-android/bin/objdump -d -C test | gawk '/<big|<small|::resize/ {p=1} /^$/ {p=0} {if (p) print }'
0000000000000f04 <big(std::__ndk1::vector<int, std::__ndk1::allocator<int> >&)>:
     f04:   d10083ff    sub sp, sp, #0x20
     f08:   a9017bfd    stp x29, x30, [sp,#16]
     f0c:   910043fd    add x29, sp, #0x10
     f10:   d292d001    mov x1, #0x9680                 // #38528
     f14:   f2a01301    movk    x1, #0x98, lsl #16
     f18:   f90007e0    str x0, [sp,#8]
     f1c:   f94007e0    ldr x0, [sp,#8]
     f20:   94000013    bl  f6c <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::resize(unsigned long)>
     f24:   a9417bfd    ldp x29, x30, [sp,#16]
     f28:   910083ff    add sp, sp, #0x20
     f2c:   d65f03c0    ret
0000000000000f30 <small(std::__ndk1::vector<int, std::__ndk1::allocator<int> >&)>:
     f30:   d10083ff    sub sp, sp, #0x20
     f34:   a9017bfd    stp x29, x30, [sp,#16]
     f38:   910043fd    add x29, sp, #0x10
     f3c:   d2800001    mov x1, #0x0                    // #0
     f40:   f90007e0    str x0, [sp,#8]
     f44:   f94007e0    ldr x0, [sp,#8]
     f48:   94000009    bl  f6c <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::resize(unsigned long)>
     f4c:   a9417bfd    ldp x29, x30, [sp,#16]
     f50:   910083ff    add sp, sp, #0x20
     f54:   d65f03c0    ret
0000000000000f6c <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::resize(unsigned long)>:
     f6c:   d100c3ff    sub sp, sp, #0x30
     f70:   a9027bfd    stp x29, x30, [sp,#32]
     f74:   910083fd    add x29, sp, #0x20
     f78:   f81f83a0    stur    x0, [x29,#-8]
     f7c:   f9000be1    str x1, [sp,#16]
     f80:   f85f83a0    ldur    x0, [x29,#-8]
     f84:   f90003e0    str x0, [sp]
     f88:   94000020    bl  1008 <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::size() const>
     f8c:   f90007e0    str x0, [sp,#8]
     f90:   f94007e0    ldr x0, [sp,#8]
     f94:   f9400be1    ldr x1, [sp,#16]
     f98:   eb01001f    cmp x0, x1
     f9c:   1a9f27e8    cset    w8, cc
     fa0:   37000048    tbnz    w8, #0, fa8 <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::resize(unsigned long)+0x3c>
     fa4:   14000007    b   fc0 <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::resize(unsigned long)+0x54>
     fa8:   f9400be8    ldr x8, [sp,#16]
     fac:   f94007e9    ldr x9, [sp,#8]
     fb0:   eb090101    subs    x1, x8, x9
     fb4:   f94003e0    ldr x0, [sp]
     fb8:   9400001e    bl  1030 <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::__append(unsigned long)>
     fbc:   14000010    b   ffc <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::resize(unsigned long)+0x90>
     fc0:   f94007e8    ldr x8, [sp,#8]
     fc4:   f9400be9    ldr x9, [sp,#16]
     fc8:   eb09011f    cmp x8, x9
     fcc:   1a9f97ea    cset    w10, hi
     fd0:   3700004a    tbnz    w10, #0, fd8 <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::resize(unsigned long)+0x6c>
     fd4:   1400000a    b   ffc <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::resize(unsigned long)+0x90>
     fd8:   b27e03e8    orr x8, xzr, #0x4
     fdc:   f94003e9    ldr x9, [sp]
     fe0:   f9400129    ldr x9, [x9]
     fe4:   f9400bea    ldr x10, [sp,#16]
     fe8:   9b0a7d08    mul x8, x8, x10
     fec:   8b080128    add x8, x9, x8
     ff0:   f94003e0    ldr x0, [sp]
     ff4:   aa0803e1    mov x1, x8
     ff8:   94000054    bl  1148 <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::__destruct_at_end(int*)>
     ffc:   a9427bfd    ldp x29, x30, [sp,#32]
    1000:   9100c3ff    add sp, sp, #0x30
    1004:   d65f03c0    ret

With -O2, the compiler can do lots of optimization for us.

First of all, resize() is completely gone; it has been removed because no one needs it anymore.

big() has inlined what it needs from resize(), calling __append() directly instead, and looks generally simpler than the full resize() function we called before. Since I haven't run this code, I can't make any claims regarding how much this helps with speed.

small() now has no function calls, no loops, and only five instructions (which I've annotated manually below). It has essentially become if (v.begin != v.end) v.end = v.begin. This will of course be very fast.

ndk-vector-speed$ make clean && OPTLEVEL=2 make dump
rm -f test.o test
/home/snild/.androidsdk/ndk-bundle/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang++ -ggdb -O2   -c -o test.o test.cpp
/home/snild/.androidsdk/ndk-bundle/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang++   test.o   -o test
/home/snild/.androidsdk/ndk-bundle/toolchains/llvm/prebuilt/linux-x86_64/aarch64-linux-android/bin/objdump -d -C test | gawk '/<big|<small|::resize/ {p=1} /^$/ {p=0} {if (p) print }'
0000000000000e64 <big(std::__ndk1::vector<int, std::__ndk1::allocator<int> >&)>:
     e64:   a9402408    ldp x8, x9, [x0]
     e68:   5292d00a    mov w10, #0x9680                    // #38528
     e6c:   72a0130a    movk    w10, #0x98, lsl #16
     e70:   cb080129    sub x9, x9, x8
     e74:   9342fd2b    asr x11, x9, #2
     e78:   eb0a017f    cmp x11, x10
     e7c:   54000062    b.cs    e88 <big(std::__ndk1::vector<int, std::__ndk1::allocator<int> >&)+0x24>
     e80:   cb0b0141    sub x1, x10, x11
     e84:   14000011    b   ec8 <std::__ndk1::vector<int, std::__ndk1::allocator<int> >::__append(unsigned long)>
     e88:   528b400a    mov w10, #0x5a00                    // #23040
     e8c:   72a04c4a    movk    w10, #0x262, lsl #16
     e90:   eb0a013f    cmp x9, x10
     e94:   540000a0    b.eq    ea8 <big(std::__ndk1::vector<int, std::__ndk1::allocator<int> >&)+0x44>
     e98:   528b4009    mov w9, #0x5a00                 // #23040
     e9c:   72a04c49    movk    w9, #0x262, lsl #16
     ea0:   8b090108    add x8, x8, x9
     ea4:   f9000408    str x8, [x0,#8]
     ea8:   d65f03c0    ret
0000000000000eac <small(std::__ndk1::vector<int, std::__ndk1::allocator<int> >&)>:
     eac:   a9402408    ldp x8, x9, [x0]  // load the first two values (begin and end) from v
     eb0:   eb08013f    cmp x9, x8        // compare them
     eb4:   54000040    b.eq    ebc <small(std::__ndk1::vector<int, std::__ndk1::allocator<int> >&)+0x10>
                                          // skip to 'ret' if they were equal
     eb8:   f9000408    str x8, [x0,#8]   // write v.begin to v.end
     ebc:   d65f03c0    ret               // return.

Conclusion: Maciej and Andy are correct; you're not building with optimizations enabled.

I have posted this question again, but with a small difference, could you take a look https://stackoverflow.com/q/59377604/5709159 — Sirop4ik, Dec 17 '19 at 15:42

Android NDK: vector.resize() is too slow, related to allocation?

3 Answers3

Linked