15

I have an app that streams video using Kickflip and ButterflyTV libRTMP

Now for 99% percent of the time the app is working ok, but from time to time I get a native segmentation fault that I am not able to debug, since messages are too cryptic:

01-24 10:52:25.576 199-199/? A/DEBUG: *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
01-24 10:52:25.576 199-199/? A/DEBUG: Build fingerprint: 'google/hammerhead/hammerhead:6.0.1/M4B30Z/3437181:user/release-keys'
01-24 10:52:25.576 199-199/? A/DEBUG: Revision: '11'
01-24 10:52:25.576 199-199/? A/DEBUG: ABI: 'arm'
01-24 10:52:25.576 199-199/? A/DEBUG: pid: 14302, tid: 14382, name: MuxerThread  >>> tv.myapp.broadcast.dev <<<
01-24 10:52:25.576 199-199/? A/DEBUG: signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), fault addr 0x9fef1000
01-24 10:52:25.636 199-199/? A/DEBUG: Abort message: 'Setting to ready!'
01-24 10:52:25.636 199-199/? A/DEBUG:     r0 9c6f9500  r1 9c6f94fc  r2 9fee900c  r3 00007ff4
01-24 10:52:25.636 199-199/? A/DEBUG:     r4 9fee9010  r5 9fef0ffd  r6 00007ff1  r7 9fef0d88
01-24 10:52:25.636 199-199/? A/DEBUG:     r8 cfe40980  r9 9e0a6900  sl 00007ff4  fp 9c6f94fc
01-24 10:52:25.636 199-199/? A/DEBUG:     ip 9c6f9058  sp 9c6f94dc  lr 000000e9  pc b3a33cb6  cpsr 800f0030
01-24 10:52:25.650 199-199/? A/DEBUG: backtrace:
01-24 10:52:25.651 199-199/? A/DEBUG:     #00 pc 00004cb6  /data/app/tv.myapp.broadcast.dev-2/lib/arm/librtmp-jni.so
01-24 10:52:25.651 199-199/? A/DEBUG:     #01 pc 00005189  /data/app/tv.myapp.broadcast.dev-2/lib/arm/librtmp-jni.so (rtmp_sender_write_video_frame+28)
01-24 10:52:25.651 199-199/? A/DEBUG:     #02 pc 00005599  /data/app/tv.myapp.broadcast.dev-2/lib/arm/librtmp-jni.so (Java_net_butterflytv_rtmp_1client_RTMPMuxer_writeVideo+60)
01-24 10:52:25.651 199-199/? A/DEBUG:     #03 pc 014e84e7  /data/app/tv.myapp.broadcast.dev-2/oat/arm/base.odex (offset 0xa66000) (int net.butterflytv.rtmp_client.RTMPMuxer.writeVideo(byte[], int, int, int)+122)
01-24 10:52:25.651 199-199/? A/DEBUG:     #04 pc 014dbd55  /data/app/tv.myapp.broadcast.dev-2/oat/arm/base.odex (offset 0xa66000) (void io.kickflip.sdk.av.muxer.RtmpMuxerMix.writeThread()+2240)
01-24 10:52:25.651 199-199/? A/DEBUG:     #05 pc 014d8c41  /data/app/tv.myapp.broadcast.dev-2/oat/arm/base.odex (offset 0xa66000) (void io.kickflip.sdk.av.muxer.RtmpMuxerMix.access$000(io.kickflip.sdk.av.muxer.RtmpMuxerMix)+60)
01-24 10:52:25.651 199-199/? A/DEBUG:     #06 pc 014d819f  /data/app/tv.myapp.broadcast.dev-2/oat/arm/base.odex (offset 0xa66000) (void io.kickflip.sdk.av.muxer.RtmpMuxerMix$1.run()+98)
01-24 10:52:25.651 199-199/? A/DEBUG:     #07 pc 721e78d1  /data/dalvik-cache/arm/system@framework@boot.oat (offset 0x1ed6000)

Again, in a 2 hour stream this might not ever happen or it might happen 10 minutes into the stream. It is super hard to debug because I cannot force the bug to happen.

Is there any way to improve the debugging information I get? What exactly does SEGV_ACCER mean? I've read that this "means you tried to access an address that you don't have permission to access." but I am unsure as what that means, as I can stream for hours without the bug happening.

Is there any way to catch the signal and just continue?

EDIT: to add more information, this is the part of the native library where the app crashes (found using ndk-stack):

JNIEXPORT jint JNICALL
Java_net_butterflytv_rtmp_1client_RTMPMuxer_writeVideo(JNIEnv *env, jobject instance,
                                                       jbyteArray data_, jint offset, jint length,
                                                       jint timestamp) {
    jbyte *data = (*env)->GetByteArrayElements(env, data_, NULL);
    jint result = rtmp_sender_write_video_frame(data, length, timestamp, 0, 0);
    (*env)->ReleaseByteArrayElements(env, data_, data, 0);

    return result;
}


int rtmp_sender_write_video_frame(uint8_t *data,
                                  int size,
                                  uint64_t dts_us,
                                  int key,
                                  uint32_t abs_ts)
{


    uint8_t * buf;
    uint8_t * buf_offset;
    int val = 0;
    int total;
    uint32_t ts;
    uint32_t nal_len;
    uint32_t nal_len_n;
    uint8_t *nal;
    uint8_t *nal_n;
    char *output ;
    uint32_t offset = 0;
    uint32_t body_len;
    uint32_t output_len;

    buf = data;
    buf_offset = data;
    total = size;
    ts = (uint32_t)dts_us;

    //ts = RTMP_GetTime() - start_time;
    offset = 0;

    nal = get_nal(&nal_len, &buf_offset, buf, total);

(...)


}



static uint8_t * get_nal(uint32_t *len, uint8_t **offset, uint8_t *start, uint32_t total)
{
    uint32_t info;
    uint8_t *q ;
    uint8_t *p  =  *offset;
    *len = 0;




    if ((p - start) >= total)
        return NULL;

    while(1) {
        info =  find_start_code(p, 3);

        if (info == 1)
            break;
        p++;
        if ((p - start) >= total)
            return NULL;
    }
    q = p + 4;
    p = q;

    while(1) {
        info =  find_start_code(p, 3);

        if (info == 1)
            break;
        p++;
        if ((p - start) >= total)
            //return NULL;
            break;
    }


    *len = (p - q);
    *offset = p;
    return q;
}


static uint32_t find_start_code(uint8_t *buf, uint32_t zeros_in_startcode)
{
    uint32_t info;
    uint32_t i;

    info = 1;
    if ((info = (buf[zeros_in_startcode] != 1)? 0: 1) == 0)
        return 0;

    for (i = 0; i < zeros_in_startcode; i++)
        if (buf[i] != 0)
        {
            info = 0;
            break;
        };

    return info;
}

Crash happens at buf[zeros_in_startcode] in find_start_code. I have removed a few android_log lines as well (dont think this matters?).

To my understanding, this buffer should be accessible, it makes no sense that it crashes only "sometimes".

PS. this is where I call the native code from Java:

private void writeThread() {

       while (true) {

           Frame frame = null;
           synchronized (mBufferLock) {
              if (!mConfigBuffer.isEmpty()) {
                   frame = mConfigBuffer.peek();
               } else if (!mBuffer.isEmpty()) {
                   frame = mBuffer.remove();
               }
               if (frame == null) {
                   try {
                       mBufferLock.wait();
                   } catch (InterruptedException e) {
                   }
               }
           }

           if (frame == null) {
               continue;
           } else if (frame instanceof Sentinel) {
               break;
           }


           int writeResult = 0;

           synchronized (mWriteFence) {
               if (!mConnected) {
                   debug(WARN, "Skipping frame due to disconnection");
                   continue;
               }

               if (frame.getFrameType() == Frame.VIDEO_FRAME) {              
                   writeResult = mRTMPMuxer.writeVideo(frame.getData(), frame.getOffset(), frame.getSize(), frame.getTime());
               } else if (frame.getFrameType() == Frame.AUDIO_FRAME) {
                   writeResult = mRTMPMuxer.writeAudio(frame.getData(), frame.getOffset(), frame.getSize(), frame.getTime());

               }

               if (writeResult < 0) {
                       mRtmpListener.onDisconnected();
                       mConnected = false;
               } else {
                   //Now we remove the config frame, only if sending was successful!
                   if (frame.isConfig()) {
                       synchronized (mBufferLock) {
                           mConfigBuffer.remove();
                       }
                   }
               }
           }

       }

   }

Note that the crash happens even when I dont send audio at all.

M Rajoy
  • 4,028
  • 14
  • 54
  • 111
  • I'm guessing you're overrunning your array. Are you certain you're passing the correct number of array elements? Does `length` match the number of elements returned from `GetArrayLength()`? What is the actual value of `data`? Can you modify your JNI code to somehow log or otherwise show that information? – Andrew Henle Jan 26 '17 at 13:23
  • @AndrewHenle all of this is very diffcult to figure out because this is happening 30 times per second and the crash only happens randomly (and it might take hours to happen), I can try and print this information for every frame to find out the values when the crash happens, but then the code is executing fewer times (due to logging slowing it down) and thus it might take hours to actually happen... In any case data is never less than 3 bytes which is the index being accessed at the sensitive point. – M Rajoy Jan 26 '17 at 18:05
  • *all of this is very diffcult to figure out...* Yes, it is very difficult to figure out memory corruption problems. The easiest solution is probably to write a simple C driver for your JNI library and make calls to your `rtmp_sender_write_video_frame()` function from a simple C program. Then you can easily do things like port your code to Linux and run under Valgrind, for example. Note that even doing that might not find your memory corruption error - JNI is *very* unforgiving and memory corruption anywhere can produce failures in seemingly completely unrelated code. – Andrew Henle Jan 27 '17 at 13:11
  • @AndrewHenle can you check my comment on the other answer? Is it possible that hte GC collects my buffers when they "get out" of the java memory space? – M Rajoy Jan 27 '17 at 13:12
  • *Is it possible that hte GC collects my buffers when they "get out" of the java memory space?* I've never seen that, and it would be a huge bug for the JVM to GC an array that's passed to a JNI function. The odds of you being the only one to observe such a bug are infinitesimally small. – Andrew Henle Jan 27 '17 at 13:23
  • @AndrewHenle ok "new" piece of information -- I removed all the logging of my app and I havent run into the bug again so far. Is is possible that the logging mechanisms (from the java code and also android__log from jni) are causing this? Seems farfetched but I dont know what to make up from this anymore – M Rajoy Jan 27 '17 at 13:25
  • You're guessing. Is this your only code that uses that logging functionality? Is the logging code a widely-used library? Your seeing a textbook example of memory corruption. The most likely source of that corruption is code that isn't successfully used by many other people. And just because you only observe crashes in the code you posted, that doesn't mean that's where the actual bug is. Instead of continuing on your [Easter Egg hunt](http://www.dictionary.com/browse/easter-egging) you need to use actual tools like Valgrind, or even an Android equivalent of Rational Purify. – Andrew Henle Jan 27 '17 at 13:44
  • @AndrewHenle you are right, you know any good source where I can start learning about that? – M Rajoy Jan 27 '17 at 13:45
  • "You can store the data in a byte[]. This allows very fast access from managed code. On the native side, however, you're not guaranteed to be able to access the data without having to copy it." See https://developer.android.com/training/articles/perf-jni.html – Jon Goodwin Jan 27 '17 at 17:04
  • @JonGoodwin that sounds like it might actually be it. I will give it a try. – M Rajoy Jan 27 '17 at 17:58
  • Yes, seems to tick all the boxes. Good hunting @GabrielSanmartin – Jon Goodwin Jan 27 '17 at 20:40

2 Answers2

6

"You can store the data in a byte[]. This allows very fast access from managed code. On the native side, however, you're not guaranteed to be able to access the data without having to copy it."

See https://developer.android.com/training/articles/perf-jni.html

Analysis

Some musings and things to try:

  • The code where it falls over is very generic, so probably no bug there
  • It must be the frame data has been removed/damaged/locked/moved
  • Has the Java garbage collector removed OR relocated the data ?
  • You could write detailed debug to a file, overwriting it on every frame, so you only have a small log with the last debug info.
  • send a local copy of the frame variable info (using ByteBuffer) to mRTMPMuxer.writeVideo
    Unlike regular byte buffers,in ByteBuffer the storage is not allocated on the managed heap, and can always be accessed directly from native code.

Implementation

//allocates memory from the native heap
ByteBuffer data = ByteBuffer.allocateDirect(frame.getData().length);
data.clear();
//System.gc();
//copy data
data.get(frame.getData(), 0, frame.getData().length);
//data = (frame.getData() == null) ? null : frame.getData().clone();
int offset  = frame.getOffset();
int size    = frame.getSize();
int time    = frame.getTime();
writeResult = mRTMPMuxer.writeVideo(data , offset, size, time);

JNIEXPORT jint JNICALL
Java_net_butterflytv_rtmp_1client_RTMPMuxer_writeVideo(
    JNIEnv *env,
    jobject instance,
    jobject data_, //NOT jbyteArray data_,
    jint offset,
    jint length,
    jint timestamp) 
{
    jbyte *data = env->GetDirectBufferAddress(env, data);//GetDirectBufferAddress NOT GetByteArrayElements
    jint result = rtmp_sender_write_video_frame(data, length, timestamp, 0, 0);
    //(*env)->ReleaseByteArrayElements(env, data_, data, 0);//????
    return result;
}

Debugging

Some code from SO Catching exceptions thrown from native code:

    static uint32_t find_start_code(uint8_t *buf, uint32_t zeros_in_startcode){
    //...
    try {
        if ((info = (buf[zeros_in_startcode] != 1)? 0: 1) == 0) return 0;//your code
    }
    // You can catch std::exception for more generic error handling
    catch (std::exception e){
        throwJavaException (env, e.what());//see method below
    }
    //...

Then a new method:

    void throwJavaException(JNIEnv *env, const char *msg)
    {
     // You can put your own exception here
     jclass c = env->FindClass("java/lang/RuntimeException");
     if (NULL == c)
     {
         //B plan: null pointer ...
         c = env->FindClass("java/lang/NullPointerException");
     }
     env->ThrowNew(c, msg);
    }
}

Don't get too hung up on SEGV_ACCERR, you have a segmentation fault,SIGSEGV (caused by a program trying to read or write an illegal memory location, read in your case).
From siginfo.h:

SEGV_MAPERR means you tried to access an address that doesn't map to anything. SEGV_ACCERR means you tried to access an address that you don't have permission to access.

Other

This may be of interest:

Q: I noticed that there was RTMP support. But a patch which remove RTMP had been merged.
Q: Could you tell me why ?
A: We don't think RTMP serves the mobile broadcasting use case as well as HLS,
A: and so we don't want to dedicate our limited resources towards supporting it.

see: https://github.com/Kickflip/kickflip-android-sdk/issues/33

I suggest you register an issue with:
https://github.com/Kickflip/kickflip-android-sdk/issues
https://github.com/ButterflyTV/LibRtmp-Client-for-Android/issues

Jon Goodwin
  • 9,053
  • 5
  • 35
  • 54
  • Thanks for your answer. How does the memory management work between java and JNI? Can a buffer be GC'd if theres a reference in JNI still? Ive read those issues about RTMP in Kickflip, thats whay I didnt use their (flawed) RTMP implementation from back in the day and switched to lightweight ButterflyTV for that. – M Rajoy Jan 27 '17 at 11:40
  • I'd sure hope the JVM doesn't GC an array that's been passed to a JNI call while that call is still running. I'd agree though that whatever memory corruption might not be in the posted code - the posted code just might be the victim, and the failure happens in the posted code because it's called often *and* makes a lot of calls dependent on dynamic memory. In other words, it's the code most likely to run over a land mine planted by some other bit of code. – Andrew Henle Jan 27 '17 at 13:16
  • @JonGoodwin I need to do more testing but it appears that this might have solved the issue. I'm skeptic though, it feels weird that just using a byte array is prone to the memory being reused/collected... I'll do more testing and mark the answer (wth the bounty) if everything is positive :) – M Rajoy Jan 31 '17 at 20:26
  • For the record, the ByteBuffers I am using come directly from the MediaCodec api as encoded frames, so it does make more sense that the native codes receive a DirectByteBuffer instead of having to provide a byte array, so in any case I think the code is cleaner this way. – M Rajoy Jan 31 '17 at 20:27
  • Cool. The way it NORMALLY works is that pointers passed to JNI are stored in a special table, if the data is relocated by GC, this table pointer to data is changed but keeps the pointer address unchanged to JNI, so JNI does not see the data move, and references the data succesfully. GC can happen while your IN the JNI thread. The JVM decides if raw (hopefully indirect) pointer's or data copies are passed to JNI. Yes I also find it very strange your SEEM to have a absolute raw pointer, which is then subject to GC. – Jon Goodwin Jan 31 '17 at 21:31
  • I did some extensive tests with three devices working at the same time for 3+ hours and got no crash... I am hoping this solved the issue and granting you the answer and the bounty. In any case thanks very much! – M Rajoy Feb 01 '17 at 10:54
2

By symptom/description of the problem, your program is most likely experiencing some sort of invalid memory access/corruption which is somehow related with multi-thread race condition scenario. From my past experience, debugging memory corruption itself is very difficult and if it is linked to multi-thread environment it becomes very very difficult. Some of my previous post might be helpful and provide some general guidelines on these topics. Please note that these posts are related to Windows/Linux and not for Android platform.

cpp - valgrind - Invalid read of size 8

A segmentation fault sometimes occurs when the function cvCreateFileCapture is invoked on network URL

While reading further about similar issue and your code sinppet, I came across one post which is mentioned below:

What does SEGV_ACCERR mean?

Client code snippet of your application

synchronized (mWriteFence) {
                if (!mConnected) {
                    continue;
                }
                if (frame.getFrameType() == Frame.VIDEO_FRAME) {
                    writeResult = mRTMPMuxer.writeVideo(frame.getData(), frame.getOffset(), frame.getSize(), frame.getTime());
                    calcVideoFpsAndBitrate(frame.getSize());

                } else if (frame.getFrameType() == Frame.AUDIO_FRAME) {
                    writeResult = mRTMPMuxer.writeAudio(frame.getData(), frame.getOffset(), frame.getSize(), frame.getTime());
                    calcAudioBitrate(frame.getSize());
                }

}

From above code, it appears to me that if your application receives Frame.VIDEO_FRAME & Frame.AUDIO_FRAME in certain order it might be leading to some sort of race condition(may be async model implementation) while using the frame variable within RtmpMuxerMix.writeThread module.

To conclude such issues:

  • we should try to read the documentation about library and its best practices and get your code reviewed. Sometime it helps to find out obvious problems in our logic.
  • We should try to reproduce this issue while running application under dynamics tools. I am not aware about such tools on Android platform. Please not that once we start running application under dynamics tools, sequence of execution gets changed and after that it is possible that we either be able to reproduce such issues very frequently or almost not be able to reproduce it.

.

Community
  • 1
  • 1
Mantosh Kumar
  • 5,659
  • 3
  • 24
  • 48
  • Thanks for your answer. I updated the code to reflect the whole write loop. To my understanding, theres no way the frame variable gets replaced while its being used in the native code, as the while loop will never continue until its done with the muxer. Also, I tried disabling audio to reduce the problem to the minimum and the crash happens still. I will take a look at your posts. – M Rajoy Jan 30 '17 at 08:24