The proliferation of edge devices has unlocked unprecedented opportunities for deep learning model deployment in computer vision applications. However, these complex models require considerable power, memory and compute resources that are typically not available on edge platforms. Ultra low-bit quantization presents an attractive solution to this problem by scaling down the model weights and activations from 32-bit to less than 8-bit.
DeepliteRT is an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices. We implement highly optimized ultra low-bit convolution operators for ARM-based targets that outperform existing methods by up to 4.34x. Accepted at the BMVC2023 conference, you can read the full paper on arXiv.
Computer vision is one of the most exciting and impactful applications of artificial intelligence (AI), enabling machines to see and understand the world around them. However, deploying computer vision models on edge devices such as smartphones, cameras, drones, and robots poses many challenges due to the limited resources and constraints of these platforms. How can we make computer vision models faster, smaller, and more energy-efficient without compromising their accuracy?
One promising solution is to use ultra low-bit quantization, a technique that reduces the precision of model weights and activations from 32-bit floating-point (FP32) to less than 8-bit, such as 4-bit, 2-bit, or even 1-bit. This can significantly compress the model size, reduce the memory bandwidth, and improve the computational efficiency of the model inference. However, implementing and deploying ultra low-bit models on edge devices is not trivial, as it requires specialized hardware support, optimized software libraries, and careful tuning of the quantization parameters.
This is where DeepliteRT comes in. DeepliteRT is a compiler and runtime package for ultra low-bit inference on ARM CPUs, developed by Deeplite, a provider of AI optimization software. DeepliteRT automates the process of converting fake-quantized convolution layers from different machine learning frameworks used for quantization-aware training into ultra low-bit convolution kernels. DeepliteRT also provides an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices, supporting various computer vision tasks such as image classification and object detection.
DeepliteRT achieves impressive performance improvements over existing ultra low-bit methods, outperforming them by up to 4.34x. DeepliteRT also delivers significant end-to-end speedups over optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines, achieving up to 2.20x, 2.33x and 2.17x speedups, respectively. DeepliteRT enables customers to utilize existing ARM CPUs for computer vision at the edge while delivering GPU-level performance.
To see DeepliteRT in action check out the E-Smart case study here. E-SMART powers Speed Management for Truck Fleets with DeepliteRT Speed Sign Detection on an Arm Cortex-A53 CPU with execution time of 109 ms; 3.77x faster than ONNX-RT!