Code translated to C/NDK/JNI less efficient than Java original

Question

this is my first not-so-deep-dive into NDK.

I wanted to rewrite this code to NDK for performance purposes. My c file looks like this:

#include <jni.h>
#include <stdbool.h>
#include <stdio.h>
#include <time.h>
#include <android/log.h>

JNIEXPORT jbyteArray JNICALL
Java_com_company_app_tools_NV21FrameRotator_rotateNV21(JNIEnv *env, jclass thiz,
                                                           jbyteArray data, jbyteArray output,
                                                           jint width, jint height, jint rotation) {
    clock_t start, end;
    double cpu_time_used;
    start = clock();

    jbyte *dataPtr = (*env)->GetByteArrayElements(env, data, NULL);
    jbyte *outputPtr = (*env)->GetByteArrayElements(env, output, NULL);

    unsigned int frameSize = width * height;
    bool swap = rotation % 180 != 0;
    bool xflip = rotation % 270 != 0;
    bool yflip = rotation >= 180;

    for (unsigned int j = 0; j < height; j++) {
        for (unsigned int i = 0; i < width; i++) {
            unsigned int yIn = j * width + i;
            unsigned int uIn = frameSize + (j >> 1u) * width + (i & ~1u);
            unsigned int vIn = uIn + 1;

            unsigned int wOut = swap ? height : width;
            unsigned int hOut = swap ? width : height;
            unsigned int iSwapped = swap ? j : i;
            unsigned int jSwapped = swap ? i : j;
            unsigned int iOut = xflip ? wOut - iSwapped - 1 : iSwapped;
            unsigned int jOut = yflip ? hOut - jSwapped - 1 : jSwapped;

            unsigned int yOut = jOut * wOut + iOut;
            unsigned int uOut = frameSize + (jOut >> 1u) * wOut + (iOut & ~1u);
            unsigned int vOut = uOut + 1;

            outputPtr[yOut] = (jbyte) (0xff & dataPtr[yIn]);
            outputPtr[uOut] = (jbyte) (0xff & dataPtr[uIn]);
            outputPtr[vOut] = (jbyte) (0xff & dataPtr[vIn]);
        }
    }

    (*env)->ReleaseByteArrayElements(env, data, dataPtr, 0);
    (*env)->ReleaseByteArrayElements(env, output, outputPtr, 0);

    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

    char str[10];
    sprintf(str, "%f", cpu_time_used * 1000);
    __android_log_write(ANDROID_LOG_ERROR, "NV21FrameRotator", str);

    return output;
}

both snippets (linked Java and above) works well, but when I measure processing duration it looks like on same device Java version takes about 7 ms (Log.i( Java side log) and C 12-13 ms... Shouldn't be faster, why it is't? Where is the catch?

long micros = System.nanoTime() / 1000;
// ~7ms, Java
//data = rotateNV21(inputData, width, height, rotateCameraDegrees);
// ~12-13ms, C
NV21FrameRotator.rotateNV21(inputData, data, width, height, rotateCameraDegrees);
Log.d(TAG, "Last frame processing duration: " + (System.nanoTime() / 1000 - micros) + "µs");

PS. Java log sometimes is showing shorter duration than native clock() measurement in c file... sample log:

NV21FrameRotator: 7.942000
NV21RotatorJava: Last frame processing duration: 7403µs
NV21FrameRotator: 7.229000
NV21RotatorJava: Last frame processing duration: 7166µs
NV21FrameRotator: 16.918000
NV21RotatorJava: Last frame processing duration: 20644µs
NV21FrameRotator: 19.594000
NV21RotatorJava: Last frame processing duration: 20479µs
NV21FrameRotator: 9.484000
NV21RotatorJava: Last frame processing duration: 7274µs

edit: compile_commands.json for armeabi-v7a (old device, I'm building only this one)

[
{
  "directory": "...app/.cxx/cmake/basicRelease/armeabi-v7a",
  "command": "...sdk\\ndk\\21.0.6113669\\toolchains\\llvm\\prebuilt\\windows-x86_64\\bin\\clang.exe --target=armv7-none-linux-androideabi21 --gcc-toolchain=...sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/windows-x86_64 --sysroot=...sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/windows-x86_64/sysroot -DNV21FrameRotator_EXPORTS  -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -march=armv7-a -mthumb -Wformat -Werror=format-security  -Oz -DNDEBUG  -fPIC   -o CMakeFiles\\NV21FrameRotator.dir\\NV21FrameRotator.c.o   -c ...app\\src\\main\\cpp\\NV21FrameRotator.c",
  "file": "...app\\src\\main\\cpp\\NV21FrameRotator.c"
}
]

CMakeFile:

cmake_minimum_required(VERSION 3.4.1)
add_library(NV21FrameRotator SHARED
    NV21FrameRotator.c)
find_library(log-lib
    log )
target_link_libraries(NV21FrameRotator
    ${log-lib} )

You shouldn't be leaving out `ReleaseByteArrayElements`. Pass `JNI_ABORT` as the mode if you wish, but you should still call it for both of the pointers that you obtained with `GetByteArrayElements`. — Michael, Sep 21 '20 at 15:43
I've thought that releasing output arrays on native side will release also on Java side. Turns out it doesn't, so I've edited code in my question: added two releasing lines after `for` loop. Everything still works, still less efficient than Java version. But thanks for improvement — snachmsm, Sep 21 '20 at 16:16
What is a typical value for `frameSize`? Did you verify that [you are not getting hit with a copy going into JNI and a copy going out](https://stackoverflow.com/questions/21691356/ndk-does-getbytearrayelements-copy-data-from-java-to-c/21693632) ? — Botje, Sep 21 '20 at 23:46
Another guess: doing that `swap` comparison for every pixel you touch might not be hoisted out of the loop in c++, while the Android AOT compilation *might* do it. — Botje, Sep 21 '20 at 23:49
`frameSize` value is 76800, data array (`byte[]` on Java side) length is 115200. I've added `jboolean isCopyData;` and passed `&isCopyData` to `GetByteArrayElements`. Printing this value using `sprintf(str, "%hhu", isCopyData);` shows `0`, for both arrays... — snachmsm, Sep 22 '20 at 06:30
Next step is then to profile the inner loop or at least look at the generated code. I think those ternaries are slowing you down, while the Java AOT compilation something smarter. — Botje, Sep 23 '20 at 10:43
I was hoping for some obvious mistake trivial to fix... I'm very begginer in this topic and deep-dive into profiling, memory analysis etc. will take some time, which I can spend on some other cases in which I'm fluent (and this won't be my decision). Maybe some bounty will convince someone to analyze my case... — snachmsm, Sep 23 '20 at 12:48
I would strongly recommend to simply use `libyuv`. On Android, it will automatically choose a NEON implementation which delivers a huge improvement. — Alex Cohn, Sep 24 '20 at 06:42
First of all, make sure that your C build is on **release**. Remove all multiplications from the inner loop, all pointers can be incremented by fixed step. Note that if your input comes from the camera, you have three byte buffers instead of a single NV21 array, and you will see a significant performance gain if you skip the intermediate array. — Alex Cohn, Sep 24 '20 at 07:02
Another recipe to improve the performance: split copy of **y** from **uv**. Now, you perform conversion for each **u** byte four times. BTW, in NV21 **u** comes *after* **v** (but this mistake compensates itself because it's done both on input and output). — Alex Cohn, Sep 24 '20 at 09:00
right, in fact this method rotates NV12 (var naming), but both NV12 and NV21 have U and V planes interleaved, so it would work for both formats. looks like `libyuv` is working on I420 and I'm working on single NV21 `byte[]` (old camera API, old device, no option for `Camera2` or `CameraX`), conversion in both directions in Java will take some time, conversion on NDK side will be unefficient also. why I think so? note the **real** question of my post: **why almost exacly same code written in Java is almost twice faster than C?** — snachmsm, Sep 25 '20 at 09:33
`libyuv` [does handle NV12](https://chromium.googlesource.com/libyuv/libyuv/+/refs/heads/master/include/libyuv/rotate.h#72). — Alex Cohn, Sep 25 '20 at 15:42
As for comparing speed of JNI to Java, please post your C compile flags. Note that `320x240` is very small; the variance of times that you show in your question is big (7.1 to 20.6 ms). You definitely need more statistics to make some conclusions. — Alex Cohn, Sep 25 '20 at 15:50
Thats good point, I will do that, but I think that will just make more iterations through width/height, so diff will be even bigger... About flags: could you please tell me where can I find them? I've just added simple `CMakeFile` (+ point on it in gradle, without determining flags in there) and also simple Java wrapper, thats just started working... About logs: Java-side log is after line from NDK (wrapped), every pair is from one frame (1&2, 3&4, lines etc.), I can see such big differences even with `libyuv`, this probably may depend on scene (light or dark, more or less colors) — snachmsm, Sep 26 '20 at 19:26
@AlexCohn I've tried different resolutions (up to HD), but problem remains - Java implementation is way faster than NDK-side code — snachmsm, Sep 28 '20 at 06:48
You can simply post the `.cxx/cmake/release/arm64-v8a/compile_commands.json` file for your module. But you didn't answer the basic question: is it a debug or release build? And which device did you test? Emulator will not show you reliable times. — Alex Cohn, Sep 28 '20 at 20:01
All above is from debug, I've just made release version and it takes a little less, at most 6 ms... I'm testing only on physical devices, above logs/timings comes from Nokia 8. I've added `compile_commands.json` content to my question — snachmsm, Sep 29 '20 at 05:56
Have you added the optimize for speed build flag in your CMakeFile? "-O3" — PerracoLabs, Sep 29 '20 at 11:24
I've just added `CMakeFile` to question, very basic, no flags... — snachmsm, Sep 29 '20 at 12:23
OK, so with **release** you have 6ms vs. 7ms in Java, twice as fast as **debug**? That's a reasonable improvement, IMHO. The advantage may better be seen on higher resolutions. As @PerracoLabs comments above, you may find that `-O3` is a bit better. — Alex Cohn, Sep 29 '20 at 12:42
Now, speaking about code improvements, you should give a try to separate loops for **Y** and **UV**, as in this [gist](https://gist.github.com/alexcohn/7697892f78f960e823ab8a6e019ce4bb). Note that I did not optimize out the calculations *(that will be the next iteration)*, only took **UV** out. — Alex Cohn, Sep 29 '20 at 12:52
Note that your phone is supposed to support arm64-v8a ABI, and this, too, can make the Java code faster than expected compared to 32-bit C code. — Alex Cohn, Sep 29 '20 at 13:03
@AlexCohn put this gist as answer to this question, your comments and engagement deserve bounty :) I currently have a lot of work, so I will try your gist tomorrow at least... about: 32-bit is sadly a must be, there are some other native snippets (made by previous coders) in this project. I'm using custom distribution system, not through Google Play — snachmsm, Sep 29 '20 at 13:08

Jari Vetoniemi · Answer 1 · 2020-09-27T13:25:34.453

1

JNI has a very high overhead especially when passing non POD types or buffers. So calling a JNI function often may very well be much slower than java version.

Consider passing java.nio.ByteBuffer instead to avoid potential copies for the byte array.

edited Sep 27 '20 at 13:25

answered Sep 27 '20 at 10:48

Jari Vetoniemi

49
1
2

I will shurely do that (`ByteBuffer`) as a test, thanks for suggestion. about: "JNI has a very high overhead" can you refer to any source? If this is true then creating such "small" methods in C/JNI isn't worth... but still: I've measured duration of `for` loops and it is larger that Java's `Log` call which wraps whole Java function (same `for` loops + some initializations)... so even way more complicated methods which would work on "simple math" (some `int`s, `bool`s, adding, multiplying etc.) will also take more time, making JNI "idea" not worth trying (and I know this isn't true...) – snachmsm Sep 27 '20 at 18:09
1

Here's quote from google about JNI use for micro-optimization: https://developer.android.com/training/articles/perf-tips#NativeMethods – Jari Vetoniemi Sep 27 '20 at 19:27
sadly `ByteBuffer` seems to perform same as `byte[]`. In fact after few tests with different resolutions `ByteBuffer` had 5-10% worse performance, but this may depend on scene composition and quality – snachmsm Sep 28 '20 at 07:07

Alex Cohn · Answer 2 · 2020-10-01T13:16:43.403

Compare performance of Java to C on a real device, the emulator will not produce reliable results.
Compare performance of Java to C on release build, the debug in C is slow, while Java still gets full JIT (and AOT) optimization.
You may look for the best optimization choice for your scenario. By default, the release will be built with -Oz. To prefer speed over size, you can add to your build.gradle:
```
android {
  buildTypes {
    release {
      externalNativeBuild.cmake.cFlags "-O3"
    }
  }
}
```
Your C code (and actually the original Java code) begs some optimization. The main inefficiency (as far as I can tell) is that you recalculate each U and V value four times. The easy fix is to split the loops.
Further optimization can avoid multiplication in inner loop (in the outer loop, it can be eliminated, too, but the effect will be negligible):

#include <jni.h>
#include <stdbool.h>
#include <stdio.h>
#include <time.h>
#include <android/log.h>

JNIEXPORT jbyteArray JNICALL
Java_com_company_app_tools_NV21FrameRotator_rotateNV21(JNIEnv *env, jclass thiz,
                                                       jbyteArray data, jbyteArray output,
                                                       jint width, jint height, jint rotation) {
    clock_t start, end;
    double cpu_time_used;
    start = clock();

    jbyte *dataPtr = (*env)->GetByteArrayElements(env, data, NULL);
    jbyte *outputPtr = (*env)->GetByteArrayElements(env, output, NULL);

    unsigned int frameSize = width * height;
    bool swap = rotation % 180 != 0;
    bool xflip = rotation % 270 != 0;
    bool yflip = rotation >= 180;

    unsigned int wOut = swap ? height : width;
    unsigned int hOut = swap ? width : height;
    unsigned int yIn = 0;

    for (unsigned int j = 0; j < height; j++) {

        unsigned int iSwapped = swap ? j : 0;
        unsigned int jSwapped = swap ? 0 : j;
        unsigned int iOut = xflip ? wOut - iSwapped - 1 : iSwapped;
        unsigned int jOut = yflip ? hOut - jSwapped - 1 : jSwapped;
        unsigned int yOut = jOut * wOut + iOut;

        for (unsigned int i = 0; i < width; i++) {
            outputPtr[yOut] = dataPtr[yIn];
            if (swap) {
                yOut += yflip ? -wOut : wOut;
            } else {
                yOut += xflip ? -1 : 1;
            }
            yIn++;
        }
    }

    unsigned int uIn = frameSize;

    for (unsigned int j = 0; j < height; j+=2) {

        unsigned int iSwapped = swap ? j : 0;
        unsigned int jSwapped = swap ? 0 : j;
        unsigned int iOut = xflip ? wOut - iSwapped - 1 : iSwapped;
        unsigned int jOut = yflip ? hOut - jSwapped - 1 : jSwapped;
        unsigned int uOut = frameSize + (jOut / 2) * wOut + (iOut & ~1u);

        for (unsigned int i = 0; i < width; i+=2) {
            unsigned int vIn = uIn + 1;
            unsigned int vOut = uOut + 1;

            outputPtr[uOut] = dataPtr[uIn];
            outputPtr[vOut] = dataPtr[vIn];

            if (swap) {
                uOut += yflip ? -wOut : wOut;
            } else {
                uOut += xflip ? -2 : 2;
            }
            uIn += 2;
        }
    }

    (*env)->ReleaseByteArrayElements(env, data, dataPtr, JNI_ABORT);
    (*env)->ReleaseByteArrayElements(env, output, outputPtr, 0);

    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

    __android_log_print(ANDROID_LOG_ERROR, "NV21FrameRotator", "%.1f ms", cpu_time_used * 1000);

    return output;
}

FYI: your improved method is reducing duration by half (at most 7.5 ms on debug), thats nice! `-O3` flag doesn't make any significant difference in my case. production device is already using probably best solution - `libyuv`, but thanks to you I've learned a lot about NDK/JNI. Thank you for all your suggestions and support, you deserve this bounty for shure! cheers! — snachmsm, Oct 01 '20 at 08:22
Yes, `libyuv` adds vectored operations on NEON, done in hand-tuned assembly. Hard to beat that. — Alex Cohn, Oct 01 '20 at 13:15

Code translated to C/NDK/JNI less efficient than Java original

2 Answers2