3

In tensorflow-lite android demo code for image classification, the images are first converted to ByteBuffer format for better performance.This conversion from bitmap to floating point format and the subsequent conversion to byte buffer seems to be an expensive operation(loops, bitwise operators, float mem-copy etc).We were trying to implement the same logic with opencv to gain some speed advantage.The following code works without error; but due to some logical error in this conversion, the output of the model(to which this data is fed) seems to be incorrect.The input of the model is supposed to be RGB with data type float[1,197,197,3].

How can we speed up this process of bitmap to byte buffer conversion using opencv (or any other means)?

Standard Bitmap to ByteBuffer Conversion:-

/** Writes Image data into a {@code ByteBuffer}. */
  private void convertBitmapToByteBuffer(Bitmap bitmap) {
    if (imgData == null) {
      return;
    }
    imgData.rewind();


    bitmap.getPixels(intValues, 0, bitmap.getWidth(), 0, 0, bitmap.getWidth(), bitmap.getHeight());



    long startTime = SystemClock.uptimeMillis();

    // Convert the image to floating point.
    int pixel = 0;

    for (int i = 0; i < getImageSizeX(); ++i) {
      for (int j = 0; j < getImageSizeY(); ++j) {
        final int val = intValues[pixel++];

        imgData.putFloat(((val>> 16) & 0xFF) / 255.f);
        imgData.putFloat(((val>> 8) & 0xFF) / 255.f);
        imgData.putFloat((val & 0xFF) / 255.f);
      }
    }

    long endTime = SystemClock.uptimeMillis();
    Log.d(TAG, "Timecost to put values into ByteBuffer: " + Long.toString(endTime - startTime));
  }

OpenCV Bitmap to ByteBuffer :-

    /** Writes Image data into a {@code ByteBuffer}. */
      private void convertBitmapToByteBuffer(Bitmap bitmap) {
        if (imgData == null) {
          return;
        }
        imgData.rewind();


        bitmap.getPixels(intValues, 0, bitmap.getWidth(), 0, 0, bitmap.getWidth(), bitmap.getHeight());

        long startTime = SystemClock.uptimeMillis();


        Mat bufmat = new Mat(197,197,CV_8UC3);
        Mat newmat = new Mat(197,197,CV_32FC3);


        Utils.bitmapToMat(bitmap,bufmat);
        Imgproc.cvtColor(bufmat,bufmat,Imgproc.COLOR_RGBA2RGB);

        List<Mat> sp_im = new ArrayList<Mat>(3);


        Core.split(bufmat,sp_im);

        sp_im.get(0).convertTo(sp_im.get(0),CV_32F,1.0/255/0);
        sp_im.get(1).convertTo(sp_im.get(1),CV_32F,1.0/255.0);
        sp_im.get(2).convertTo(sp_im.get(2),CV_32F,1.0/255.0);

        Core.merge(sp_im,newmat);



        //bufmat.convertTo(newmat,CV_32FC3,1.0/255.0);
        float buf[] = new float[197*197*3];


        newmat.get(0,0,buf);

        //imgData.wrap(buf).order(ByteOrder.nativeOrder()).getFloat();
        imgData.order(ByteOrder.nativeOrder()).asFloatBuffer().put(buf);


        long endTime = SystemClock.uptimeMillis();
        Log.d(TAG, "Timecost to put values into ByteBuffer: " + Long.toString(endTime - startTime));
      }
anilsathyan7
  • 1,423
  • 17
  • 25
  • Now the tensorflow android demo has included a data type 'TensorImage' for loading bitmaps to model, in its recent 'support library'. – anilsathyan7 Dec 14 '19 at 10:25
  • You can use this class to ease your work now https://www.tensorflow.org/lite/inference_with_metadata/lite_support – Farmaker Nov 01 '21 at 04:54

3 Answers3

2
  1. I believe that 255/0 in your code is a copy/paste mistake, not real code.
  2. I wonder what the timecost of the pure Java solution is, especially when you weigh it against the timecost of inference. For me, with a slightly larger bitmap for Google's mobilenet_v1_1.0_224, the naïve float buffer preparation was less than 5% of inference time.
  3. I could quantize the tflite model (with the same tflite_convert utility that generated .tflite file from .h5. There could actually be three quantization operations, but I only used two: --inference_input_type=QUANTIZED_UINT8 and --post_training_quantize.
    • The resulting model is about 25% size of the float32 one, which is an achievement on its own.
    • The resulting model runs about twice faster (at least on some devices).
    • And, the resulting model consumes unit8 inputs. This means that instead of imgData.putFloat(((val>> 16) & 0xFF) / 255.f) we write imgData.put((val>> 16) & 0xFF), and so on.

By the way, I don't think that your formulae are correct. To achieve best accuracy when float32 buffers are involved, we use

putFLoat(byteval / 256f)

where byteval is int in range [0:255].

Alex Cohn
  • 56,089
  • 9
  • 113
  • 307
  • Thanks, it was a silly mistake.. I changed 255/0 to 255.0.Now it works perfectly and is twice as fast.Actually i was too ambitious to optimize this part of TF code.I was trying to remove costly nested loops. We were running other code in parallel to this conversion, and as a result the time taken increased by 4x fold. We tried those two quantization methods ; but it did not work properly.Also we would loose accuracy (cant afford this)in this scenario.We need to try quantization aware training.We are using gpu delegate & tflite would run only float model in gpu. – anilsathyan7 May 01 '19 at 10:06
  • You don't need nested loops `for i / for j` in the simple Java implementation. Switching to single loop with precalculated limit could give you significant boost. – Alex Cohn May 01 '19 at 13:56
  • Its faster; but not as fast as the current opencv method .Looks like opencv has neon optimized low-level vectorized operations ... – anilsathyan7 May 01 '19 at 14:27
  • Nice to know. I hope we can live with quantized model, it's faster for us than GPU and also significantly smaller. BTW, on the face of it, quantizing *input only* should not loose accuracy (this means, conversion of bytes to floats cannot improve accuracy), and the resulting model may still work with GpuDelegate. – Alex Cohn May 01 '19 at 14:42
  • It depends on scenario or use case; for our case of semantic segmentation small accuracy loss really shows in the output.Accuracy loss depends on level of quantization.GPU supports only float 32(float16 is still experimental).Weights only quantization reduced size; but some operators were not supported in tflite gpu(dequantize).You may see difference between quantized cpu and gpu if you use bigger image/models.GPU will be faster.By the way what do you mean by input only quantization?Is it quantized UINT8 input?This wont run in gpu.We tried two method you mentioned earlier; but it didnt work – anilsathyan7 May 02 '19 at 05:20
  • In my experiment, `--inference_input_type=QUANTIZED_UINT8` does not fail with GPU delegate. Unfortunately, it does not run faster, probably simply falling back to CPU. – Alex Cohn May 02 '19 at 11:18
  • Yes, may be ..once it switches to CPU, it will run only on CPU from there on till end ..Also GPU can handle only float values(specialized). The node implementation, i suppose are done using opengl compute shaders, which mosltly rely on float values in tf. If you try out android tflite gpu classification demo app on a high end gpu phone, you can see the speed difference. GPU outperforms even quantized models in terms of speed !!!! – anilsathyan7 May 02 '19 at 12:14
  • *GPU outperforms even quantized models in terms of speed* – also in my experiments on quite modest devices, without setting threads number. For quantized, the best inference time (on that same demo app) is when it is set up with 3 threads, and this is faster than GPU on float. – Alex Cohn May 02 '19 at 12:21
  • I depends on model and device..for me it was two threads that gave best speed..but gpu was still faster.In low end device (may be also for smaller input) quantized would give better performance; since the gpu inference involves some memory copy overhead.Also gpu is power efficient !!! – anilsathyan7 May 02 '19 at 12:25
0

As mentioned here use the below code from here for converting Bitmap to ByteBuffer(float32)

private fun convertBitmapToByteBuffer(bitmap: Bitmap): ByteBuffer? {
    val byteBuffer =
        ByteBuffer.allocateDirect(4 * BATCH_SIZE * inputSize * inputSize * PIXEL_SIZE)
    byteBuffer.order(ByteOrder.nativeOrder())
    val intValues = IntArray(inputSize * inputSize)
    bitmap.getPixels(intValues, 0, bitmap.width, 0, 0, bitmap.width, bitmap.height)
    var pixel = 0
    for (i in 0 until inputSize) {
        for (j in 0 until inputSize) {
            val `val` = intValues[pixel++]
            byteBuffer.putFloat(((`val` shr 16 and 0xFF) - IMAGE_MEAN) / IMAGE_STD)
            byteBuffer.putFloat(((`val` shr 8 and 0xFF) - IMAGE_MEAN) / IMAGE_STD)
            byteBuffer.putFloat(((`val` and 0xFF) - IMAGE_MEAN) / IMAGE_STD)
        }
    }
    return byteBuffer
}
Robert
  • 7,394
  • 40
  • 45
  • 64
Sunit Roy
  • 1
  • 1
0

For floats, mean = 1 and std = 255.0 the function would be:

fun bitmapToBytebufferWithOpenCV(bitmap: Bitmap): ByteBuffer {
            val startTime = SystemClock.uptimeMillis()
            val imgData = ByteBuffer.allocateDirect(1 * 257 * 257 * 3 * 4)
            imgData.order(ByteOrder.nativeOrder())

            val bufmat = Mat()
            val newmat = Mat()
            Utils.bitmapToMat(bitmap, bufmat)
            Imgproc.cvtColor(bufmat, bufmat, Imgproc.COLOR_RGBA2RGB)
            val splitImage: List<Mat> = ArrayList(3)

            Core.split(bufmat, splitImage)
            splitImage[0].convertTo(splitImage[0], CV_32F, 1.0 / 255.0)
            splitImage[1].convertTo(splitImage[1], CV_32F, 1.0 / 255.0)
            splitImage[2].convertTo(splitImage[2], CV_32F, 1.0 / 255.0)
            Core.merge(splitImage, newmat)

            val buf = FloatArray(257 * 257 * 3)
            newmat.get(0, 0, buf)

            for (i in buf.indices) {
                imgData.putFloat(buf[i])
            }
            imgData.rewind()
            val endTime = SystemClock.uptimeMillis()
            Log.v("Bitwise", (endTime - startTime).toString())
            return imgData
        }

Unfortunately this one is slightly slower (10ms) than what Sunit wrote above with for loops and bitwise operations (8ms).

Farmaker
  • 2,700
  • 11
  • 16