How is MobileNet V3 faster than V2?

Question

Here's the link to the paper regarding MobileNet V3.

According to the paper, h-swish and Squeeze-and-excitation module are implemented in MobileNet V3, but they aim to enhance the accuracy and don't help boost the speed.

h-swish is faster than swish and helps enhance the accuracy, but is much slower than ReLU if I'm not mistaken.

SE also helps enhance the accuracy, but it increases the number of parameters of the network.

Am I missing something? I still have no idea how MobileNet V3 can be faster than V2 with what's said above implemented in V3.

I didn't mention the fact that they also modify the last part of their network as I plan to use MobileNet V3 as the backbone network and combine it with SSD layers for the detection purpose, so the last part of the network won't be used.

The following table, which can be found in the paper mentioned above, shows that V3 is still faster than V2 is.

Object detection results for comparison

score 5 · Answer 1 · answered Jul 10 '19 at 09:56

MobileNetV3 is faster and more accurate than MobileNetV2 on classification task, but this is not necessarily true on different task, such as object detection. As you mention yourself, optimizations they did on the deepest end of network are mostly relevant to the classification variant, and as can be seen on the table you referenced, the mAP is no better.

Few things to consider though:

It's true SE and h-swish both slow down the network a bit. SE adds some FLOPs and parameters, and h-swish adds complexity, and both causes some latency. However, both are added such that the accuracy-latency trade-off is better, meaning either the latency addition is worth the accuracy gain, or you can maintain the same accuracy while reducing other stuff, thus reducing overall latency. Specifically regarding h-swish, note that they mostly use it in deeper layers, where the tensors are smaller. They are thicker, but due to quadratic drop in resolution (height x width), they are smaller overall, hence h-swish causes less latency.
The architecture itself (without h-swish, and even without considering the SE) is searched. Meaning it is better suited to the task than "vanilla" MobileNetV2, since the architecture is "less hand-engineered", and actually optimized to the task. You can see for example, that as in MNASNet, some of the kernels grew to 5x5 (rather than 3x3), not all expansion rates are x6, etc.
One change they did to the deepest end of the network is also relevant to object detection. Oddly, while using SSDLite-MobileNetV2, the original authors chose to keep the last 1x1 convolution which expands from depth of 320 to 1280. While this amount of features makes sense for 1000 classes classification, for 80 classes detection it's probably redundant, as the authors of MNv3 say themselves in the middle of page 7 (bottom of first column-top of second).

Yep the mAP isn't better and even worse in MNv3-small's case, but the speed is much boosted according to the Table so I'm interested in the reason. I'm currently trying things out with MNv3-small and would like to connect SSD layers to it and compare with MNv1-SSD for my own detection purpose. — Jefferson Chiu, Jul 11 '19 at 07:31
I'm currently using MNv1-SSD for my task and would like to replace the backbone network with a new one. I chose to start with MNv3-small over MNv3-large as it's much faster despite the accuracy drop off it. — Jefferson Chiu, Jul 11 '19 at 07:33

How is MobileNet V3 faster than V2?

1 Answers1