2

Exactly as the title says.

I have a parallelized image creating/processing algorithm that I'd like to use. This is a kind of perlin noise implementation.

// Logging is never used here
#pragma version(1)
#pragma rs java_package_name(my.package.name)
#pragma rs_fp_full

float sizeX, sizeY;
float ratio;

static float fbm(float2 coord)
{ ... }

uchar4 RS_KERNEL root(uint32_t x, uint32_t y)
{
float u = x / sizeX * ratio;
float v = y / sizeY;

float2 p = {u, v};

float res = fbm(p) * 2.0f;   // rs.: 8245 ms, fs: 8307 ms; fs 9842 ms on tablet

float4 color = {res, res, res, 1.0f};
//float4 color = {p.x, p.y, 0.0, 1.0};  // rs.: 96 ms

return rsPackColorTo8888(color);
}

As a comparison, this exact algorithm runs with at least 30 fps when I implement it on the gpu via fragment shader on a textured quad.

The overhead for running the RenderScript should be max 100 ms which I calculated from making a simple bitmap by returning the x and y normalized coordinates.

Which means that in case it would use the gpu it would surely not become 10 seconds.

The code I am using the RenderScript with:

// The non-support version gives at least an extra 25% performance boost
import android.renderscript.Allocation;
import android.renderscript.RenderScript;

public class RSNoise {

    private RenderScript renderScript;
    private ScriptC_noise noiseScript;

    private Allocation allOut;

    private Bitmap outBitmap;

    final int sizeX = 1536;
    final int sizeY = 2048;

    public RSNoise(Context context) {
        renderScript = RenderScript.create(context);

        outBitmap = Bitmap.createBitmap(sizeX, sizeY, Bitmap.Config.ARGB_8888);
        allOut = Allocation.createFromBitmap(renderScript, outBitmap, Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_GRAPHICS_TEXTURE);

        noiseScript = new ScriptC_noise(renderScript);
    }

    // The render function is benchmarked only
    public Bitmap render() {
        noiseScript.set_sizeX((float) sizeX);
        noiseScript.set_sizeY((float) sizeY);
        noiseScript.set_ratio((float) sizeX / (float) sizeY);

        noiseScript.forEach_root(allOut);

        allOut.copyTo(outBitmap);

        return outBitmap;
    }
}

If I change it to FilterScript, from using this help (https://stackoverflow.com/a/14942723/4420543), I get several hundred milliseconds worse in case of support library and about double time worse in case of the non-support one. The precision did not influence the results.

I have also checked every question on stackoverflow, but most of them are outdated and I have also tried it with a nexus 5 (7.1.1 os version) among several other new devices, but the problem still remains.

So, when does RenderScript run on GPU? It would be enough if someone could give me an example on a GPU-running RenderScript.

Community
  • 1
  • 1
andras
  • 3,305
  • 5
  • 30
  • 45
  • This has been a long time ago. I could not solve this issue and it was not related to the `#pragma rs_fp_relaxed` as one might think. The key lies somewhere in the specialty of the `AsyncTask` that I could not replicate with custom threads. Only the official example runs stable on the GPU. – andras May 28 '18 at 17:31

2 Answers2

3

Can you try to run it with rs_fp_relaxed instead of rs_fp_full?

#pragma rs_fp_relaxed

rs_fp_full will force your script running on CPU, since most GPUs don't support full precision floating point operations.

Miao Wang
  • 1,120
  • 9
  • 12
  • A problem with disabling this line is that the noise implementation requires high (float) precision and it would be better if the device uses cpu only in case it does not support it gpu, rather than render a 'bad' picture on gpu. – andras Mar 29 '17 at 07:52
  • "rs_fp_relaxed" does not mean it is of lower precision of OpenGL "high precision". – Miao Wang Mar 30 '17 at 00:01
  • Here we are talking about compute precision. "rs_fp_relaxed" is still 32bit float precision, it just relaxes certain math opertions like denorm. Actually GL shader will do similar things on mobile. I suggest you try it out on your test devices to see if that works for you. – Miao Wang Mar 30 '17 at 00:07
0

I can agree with your guess.

On Nexux 7 (2013, JellyBean 4.3) I wrote a renderscript and a filterscript, respectively, to calculate the famous Mandelbrot set. Compared to an OpenGL fragment shader doing the same thing (all with 32 bit floats), the scripts were about 3 times slower. I assume OpenGL uses GPUs where renderscript (and filterscript !) do not.

Then I compared camera preview conversion (NV21 format -> RGB) with a renderscript, a filterscript and the ScriptIntrinsicYuvToRGB, respectively. Here the Intrinsic is about 4 times faster than the self written scripts. Again I see no differences in performance between renderscript and filterscript. In this case I assume the self written scripts again use CPUs only where the Intrinsic makes use of GPUs (too ?).

Matti81
  • 216
  • 3
  • 3