43 MIN READ

DNN Model Optimization Series: Part III - Achieve up to 30x DNN Model Compression (while maintaining accuracy!)

Welcome back to the DNN model optimization series! Our recent paper on Deeplite’s unique optimization software, Neutrino™, won the “Best Deployed System” award at Innovative Applications of Artificial Intelligence 2021. For this occasion, we decided to dedicate this issue of the DNN model optimization series to a deep-dive into the champion research paper.

AI and deep learning, particularly for edge devices, has been gaining momentum and promising to empower the digitalization of many vertical markets. Despite all the excitement, deploying deep learning is easier said than done! We already looked at why you need to optimize your DNN models in one of our previous blog posts. Let’s look at ‘how’ Deeplite’s Neutrino™ framework addresses the ever-larger AI models’ optimization and paves the way for easy deployments into increasingly constrained edge devices.

Neutrino™ is an end-to-end black-box platform for fully automated DNN model compression that maintains the model accuracy. It can be seamlessly and smoothly integrated into any development and deployment pipeline, as shown in the figure below. Guided by the end-user constraints and requirements, Neutrino™ produces the optimized model that can be further used for inference, either directly deployed on the edge device or in a cloud environment.

Model Version Update

 

Deeplite’s Unique Way to Compress DNN Models

What will you need to get started?

  1. Your pre-trained DNN model, M,
  2. The actual train-test data split you used to train the model, DLtrain and DLtest, and
  3. The following set of requirements to guide the optimization:
  • Delta: The acceptable tolerance of accuracy drop with respect to the original model, for example, 1%.
  • Stage: The two different compression stages, while stage 1 is less intensive compression requiring fewer computational resources, stage 2 provides more aggressive compression using more resources and time.
  • Device: Perform the entire optimization and model inference in either CPU, GPU, or multi-GPU (distributed GPU environment).
  • Modularity: The end-user can customize multiple parts of the optimization process for Neutrino™ to adapt to more complex scenarios. This customization includes specialized data-loaders, custom backpropagation optimizers, and intricate loss functions that their native library implementation allows.

The algorithm that we used to understand, analyze, and compress the DNN model automatically for a given dataset is explained below.

Neutrino algorithm workflow

 

The Four Components of Neutrino™

  1. Neutrino™ Model Zoo: To ease the use for the end-user, a collection of popular DNN architectures with trained weights on different benchmark datasets are provided as Neutrino™ Zoo.
  2. Conductor: The purpose of the conductor is to collect all the provided inputs, understand the given requirements, and orchestrate the entire optimization pipeline, accordingly.
  3. Exploration: It’s a high-level coarse compression stage, where the composer selects different transformation functions for the different layers in the DNN model.
  4. Annealing: It’s a fine-grained aggressive compression stage to obtain the maximum possible compression in the required tolerance of accuracy.

The Process

In one of our blog posts, we already answered the top 10 questions about the model compression process. To briefly explain, at Deeplite, we dynamically combine different optimization methods for different layers in a model to create a beautiful result, achieving maximum compression with minimal accuracy loss (and when we say minimal, we mean it! Look at the outcomes below).

When the model compression is just right

Let’s look at an example here. Our pre-trained model has N optimizable layers: {L1, L2, ..., LN}. In a typical CNN model, the convolutional layers and the fully connected layers are optimizable, while the rest of the layers are ignored from the optimization process. The conductor analyzes the training data size, the number of output classes, model architecture, and optimization criteria, delta, and produces a composed list, CL = {c1, c2, ..., cN} of different transformation functions for different layers in the model. For any layer, Li, in the model, the composed function ci looks like this:

composed function ci for any layer Li in the pretrained DNN model

 

The challenge is to find an ideal size r that produces good compression retaining the model’s robustness. When r is equal to the actual size of the weight tensor of the layer, r = size(Wi), there is an over-approximation of the transformation with very low compression. However, a very small size, r → 0, produces a high compression with a lossy re-construction of the transformation.

Stage 2 optimization aims to perform aggressive compression and obtain the maximum possible compression in the required accuracy tolerance. For example, if the delta of accuracy is 1%, and stage 1 produces a 4x compression with an accuracy drop of 0.6%, stage 2 aims to advance the compression with the delta going as close as possible to 1%.

 

How Do We Measure the Optimization and Compression Performance?

Here are the metrics we use to measure the amount of optimization and performance of Neutrino™.

Accuracy: We measure the top-1 accuracy (%) of the model. Successful optimization retains the accuracy of the original model.

Model Size: We measure the disk size (MB) occupied by the trainable parameters of the model. Smaller model size enables models to be deployed into devices with memory constraints.

MACs: We measure the model’s computational complexity by the number (billions) of Multiply-Accumulate Operation (MAC) computed across the layers of the model. The lower the number of MACs, the better optimized the model is.

Number of Parameters: We measure the total number (millions) of trainable parameters (weights and biases) in the model. Optimization aims to reduce the number of parameters.

Memory Footprint: We measure the total memory (MB) required to perform the inference on a batch of data, including the memory required by the trainable parameters and the layer activations. A lower memory footprint is achieved by better optimization.

Execution Time: We measure the time (ms) required to perform a forward pass on a batch of data. Optimized models have a lower execution time.

 

DNN Model Compression Results Using Neutrino™

All the optimization experiments are run with an end-user requirement of accuracy delta of 1%. The experiments are executed with a mini-batch size of 1024, while the metrics are normalized for a mini-batch size of 1. All the experiments are run on four parallel GPU, using Horovod, and each GPU is a Tesla V100 SXM2 with 32GB memory.

Arch

Model

Accuracy (%)

Size (MB)

FLOPs (Billions)

#Params (Millions)

Memory (MB)

Execution Time (ms)

Resnet 18

Original

76.8295

42.8014

0.5567

11.2201

48.4389

0.0594

Stage 1

76.7871

7.5261

0.1824

1.9729

15.3928

0.0494

Stage 2

75.8008

3.4695

0.0790

0.9095

10.3965

0.0376

Enhance

-0.9300x

12.34x

7.05x

12.34x

4.66x

1.58x

Resnet50

Original

78.0657

90.4284

1.3049

23.7053

123.5033

3.9926

Stage 1

78.7402

25.5877

0.6852

6.7077

65.2365

0.2444

Stage 2

77.1680

8.4982

0.2067

2.2278

43.7232

0.1772

Enhance

-0.9400

10.64x

6.31x

10.64x

2.82x

1.49x

VGG19

Original

72.3794

76.6246

0.3995

20.0867

80.2270

1.4238

Stage 1

71.5918

3.3216

0.0631

0.8707

7.5440

0.0278

Stage 2

71.6602

2.6226

0.0479<

0.6875

6.7399

0.0263

Enhance

-0.8300

29.22x

8.34x

29.22x

11.90x

1.67x

 

Arch

Model

Accuracy (%)

Size (MB)

FLOPs (Billions)

#Params (Millions)

Memory (MB)

Execution Time (ms)

DenseNet121

Original

78.4612

26.8881

0.8982

7.0485

66.1506

10.7240

Stage 1

79.0348

15.7624

0.5477

4.132

61.8052

0.2814

Stage 2

77.8085

6.4246

0.1917

1.6842

48.3280

0.2372

Enhance

-0.6500

4.19x

4.69x

4.19x

1.37x

1.17x

GoogleNet

Original

79.3513

23.8743

1.5341

6.2585

64.5977

5.7186

Stage 1

79.4922

12.6389

0.8606

3.3132

62.1568

0.2856

Stage 2

78.8086

6.1083

0.3860

1.6013

51.3652

0.2188

Enhance

-0.4900

3.91x

3.97x

3.91x

1.26x

1.28x

MobileNet v1

Original

66.8414

12.6246

0.0473

3.3095

16.6215

1.8147

Stage 1

66.4355

6.4211

0.0286

1.6833

10.5500

0.0306

Stage 2

66.6211

3.2878

0.0170

0.8619

7.3447

0.0286

Enhance

-0.4000

3.84x

2.78x

3.84x

2.26x

1.13x

 

Arch: Resnet18

Dataset

Model

Accuracy (%)

Size (MB)

FLOPs (Billions)

#Params (Millions)

Memory (MB)

Execution Time (ms)

Imagenet16

Original

94.4970

42.6663

1.8217

11.1847

74.6332

0.2158

Stage 1

93.8179

3.3724

0.5155

0.8840

41.0819

0.1606

Stage 2

93.6220

1.8220

0.3206

0.4776

37.4608

0.1341

Enhance

-0.8800

23.42x

5.68x

23.42x

1.99x

1.61x

VWW (Visual Wake Words)

Original

93.5995

42.6389

1.8217

11.1775

74.6057

0.2149

Stage 1

93.8179

3.3524

0.4014

0.8788

39.8382

0.1445

Stage 2

92.6220

1.8309

0.2672

0.4800

36.6682

0.1296

Enhance

-0.9800

23.29x

6.82x

23.29x

2.03x

1.66x

 

Depending on the architecture of the original model, it can be observed that the model size could be compressed anywhere between ∼3x to ∼30x. VGG19 is known to be one of the highly over parameterized CNN models. As expected, it achieved a 29.22x reduction in the number of parameters with almost ∼12x compression in the overall memory footprint and ∼an 8.3x reduction in computation complexity. The resulting VGG19 model occupies only 2.6MB as compared to the original model requiring 76.6MB.

The performance of Neutrino™ on large scale vision datasets produces around ∼23.5x compression of ResNet18 on Imagenet16 and VWW datasets. The optimized model requires only 1.8MB as compared to the 42.6MB needed for the original model.

 

Stage 1 vs. Stage 2 Compression

Crucially, it can be observed that Stage 2 compresses the model at least 2x more than Stage 1 compression. The overall time taken for optimization by Neutrino™, including Stage 1 and Stage 2, is shown in the figure below. It can be observed that most of the models could be optimized in less than ∼2 hours. Complex architectures with longer training times, such as Resnet50 and DenseNet121, take around ∼6 hours and ∼13 hours for optimization.

Table comparison - time taken for DNN model optimization

 

The comparison between the time taken for Stage 1 and Stage 2 compression is visually shown in the below figure. It can be observed that almost 60−70% of the overall optimization is achieved in Stage 2, while Stage 1 consumes less than ∼40% of the overall time required. This differentiation acts as a key feature of Neutrino™, where end-users who need quick optimization with less resource consumption can choose Stage 1. In contrast, those needing aggressive optimization can choose Stage 2 optimization.

Table comparison - time vs compression in DNN model optimization

 

Production Results of Neutrino™

Deeplite’s Neutrino™ is deployed in various production environments, and the results of the various use-cases obtained are summarized in the table below. To showcase the efficiency of Neutrino™’s performance, the optimization results are compared with other popular frameworks in the market, such as (i) Microsoft’s Neural Network Interface (NNI), Intel’s Neural Network Distiller, and (iii) Tensorflow Lite Micro. It can be observed that Neutrino™ consistently outperforms the competitors by achieving higher compression with better accuracy.

 

Use Case

Arch

Model

Accuracy (%)

Size (Bytes)

FLOPs (Millions)

#Params (Millions)

Andes Technology Inc.

MobileNetv1(VWW)

Original

88.1

12,836,104

105.7

3.2085

Neutrino™

87.6

188,000

24.6

0.1900

TensorFlow LM

84.0

860,000

-

0.2134

Prod #1

MobilNetV2-0.35x (Imagenet Small)

Original

80.9

1,637,076

66.50

0.4094

Neutrino™

80.4

675,200

50.90

0.1688

Intel Distiller

80.4

1,637,076

66.50

0.2562

Microsoft NNI

77.4

1,140,208

52.80

0.2851

Prod #2

MobileNet v2-1.0x (Imagenet Small)

Original

90.9

8,951,804

312.8

2.2367

Neutrino™

82.0

1,701,864

134.0

0.4254

Intel Distiller

82.0

8,951,804

312.86

0.2983

 

How to Use Neutrino™?

The Neutrino™ framework is completely automated and can optimize any convolutional neural network (CNN) based architecture with no human intervention. Neutrino™ framework is distributed as Python PyPI library or Docker container, with production support for PyTorch and early developmental support for TensorFlow/Keras.

That’s it for now, folks! Remember, compression is a critical first step in making deep learning more accessible for engineers and end-users on edge devices. Unlocking highly compact, highly accurate intelligence on devices we use every day, from phones to cars and coffee makers, will have an unprecedented impact on how we use and shape new technologies.

Do you have a question or want us to look into other AI model optimization topics? Leave a comment below. We’ll be happy to follow up!

Interested in signing up for a demo of Neutrino™? Please refer to this link.

 

Meet the blog authors:


Anush Sankaran 150px

Anush Sankaran
Senior Research Scientist at Deeplite
Medium.com: @anush.sankaran
Twitter: @goodboyanush


Anastasia Hamel-150px

Anastasia Hamel
Digital Marketing Manager at Deeplite
Medium.com: @anastasia.hamel
Twitter: @anastasia_hamel

 

Read On