2

Does SageMaker Neo (SageMaker compilation job) use any techniques for model optimization? Are there any compression techniques used (distillation, quantization etc) to reduce the model size?

I found some description here (https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html) regarding quantization but it's not clear how it could be used.

Thanks very much for any insight.

juvchan
  • 6,113
  • 2
  • 22
  • 35
ryfeus
  • 323
  • 2
  • 6

1 Answers1

1

Neo is optimizing inference using compilation, which is different and often orthogonal to compression

  • compilation makes inference faster and lighter by specializing the prediction application, notably: (1) changing the environment in which the model runs, in particular replacing training frameworks by the least amount of necessary math libraries, (2) optimizing the model graph to be prediction-only and grouping together operators that can be, (3) specializing the runtime to use best the specific hardware and instructions available on a given target machine. Compilation is not supposed to change the model math, thereby doesn't change its footprint on disk

  • compression makes inference faster by removing model weights or making them smaller (quantization). Weights can be removed by pruning (dropping weights that do not influence much results or distillation (training a small model to mimic a big model).

At the time of this writing, SageMaker Neo is a managed compilation service. That being said, compilation and compression can be combined, and you can prune or distill your network before feeding it to Neo.

SageMaker Neo covers a large grid of hardware targets and model architectures, and consequently leverages numerous backends and optimizations. Neo internals are publicly documented in many places:

Olivier Cruchant
  • 3,747
  • 15
  • 18