yes this is very confusing, I ran to the same problem. What I noticed is that on my device at least (Snapdragon 835) ResizeBilinear_2 and ArgMax takes an insane amount of time. If you disable CPU fallback you will see that ResizeBilinear_2 is actually not supported since in the deeplab implementation they used align_corner=true.
If you pick ResizeBilinear_1 as the output layer there will be a significant improvement to the inference time with the trade off of you not having the bilinear resize layer and argmax which you will have to implement yourself.
But even then using the gpu I was only able to reach around 200 ms runtime. With the DSP I did manage to get around 100 ms.
Also be sure that your kit has opencl support in it, otherwise gpu runtime won't work afaik.
Side Note: I'm currently still testing stuff with deeplab + snpe as well. I noticed that comparing this and TFLITE gpu delegate theres some differences in the output. While SNPE in general is about twice as fast theres a lot of segmentation artifacts errors which can result in unusable model. Check this out https://developer.qualcomm.com/forum/qdn-forums/software/snapdragon-neural-processing-engine-sdk/34844
What I found out so far is that if you drop the output stride to 16 not only will you get double the inference speed, said artifacts seems to be less visible. Of course you lose some accuracy doing so. Good Luck!