The bulk operation on GPU blocks the rendering of new frames. It's hard to solve this problem and it actually have nothing to do with SNPE, because we can reproduce this issue using non-SNPE implementation (in-house OpenCL-based framework). You can simply change the placement of tensor operations to mitigate this problem. For example, you can do the computation on CPU (e.g.: tensorflow mobile), and the UI can be rendered properly while being much slower and CPU-hunger.
It's possible to visualize my explanation by on-device developer options. For more information, follow this link: https://developer.android.com/studio/profile/inspect-gpu-rendering#profile_rendering. You'll be able to see that several "Swap Buffer"1 operations could take unusually long intervals.
The best solution is to do computation on DSP with quantized network, but there are many limitations on the available operators and memory.
It's possible that Android 8.1 could solve these issues with NN-API abstraction and OS-level scheduling of GPU-resource, but I would not expect too much from Google.
BTW: I have a hypothetical scheme to mitigate this issue by fragmenting the bulk operations. In theory, if the worker-thread would sleep for 20ms between sub-50ms operations so that UI thread could render properly, the user experience should be tolerable since the FPS could be maintained above 15. We'll try this scheme because this handicapped scheme should still be much faster than schemes based on CPU.