3 MIN READ

DeepliteRT: Enable Edge Computer Vision with Ultra low-bit Inference

The proliferation of edge devices has unlocked unprecedented opportunities for deep learning model deployment in computer vision applications. However, these complex models require considerable power, memory and compute resources that are typically not available on edge platforms. Ultra low-bit quantization presents an attractive solution to this problem by scaling down the model weights and activations from 32-bit to less than 8-bit.

DeepliteRT is an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices. We implement highly optimized ultra low-bit convolution operators for ARM-based targets that outperform existing methods by up to 4.34x. Accepted at the BMVC2023 conference, you can read the full paper on arXiv.

Check out our poster presentation video here as our very own Saad Ashfaq walks you through DeepliteRT!

Computer vision is one of the most exciting and impactful applications of artificial intelligence (AI), enabling machines to see and understand the world around them. However, deploying computer vision models on edge devices such as smartphones, cameras, drones, and robots poses many challenges due to the limited resources and constraints of these platforms. How can we make computer vision models faster, smaller, and more energy-efficient without compromising their accuracy?

One promising solution is to use ultra low-bit quantization, a technique that reduces the precision of model weights and activations from 32-bit floating-point (FP32) to less than 8-bit, such as 4-bit, 2-bit, or even 1-bit. This can significantly compress the model size, reduce the memory bandwidth, and improve the computational efficiency of the model inference. However, implementing and deploying ultra low-bit models on edge devices is not trivial, as it requires specialized hardware support, optimized software libraries, and careful tuning of the quantization parameters.

This is where DeepliteRT comes in. DeepliteRT is a compiler and runtime package for ultra low-bit inference on ARM CPUs, developed by Deeplite, a provider of AI optimization software. DeepliteRT automates the process of converting fake-quantized convolution layers from different machine learning frameworks used for quantization-aware training into ultra low-bit convolution kernels. DeepliteRT also provides an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices, supporting various computer vision tasks such as image classification and object detection.

DeepliteRT achieves impressive performance improvements over existing ultra low-bit methods, outperforming them by up to 4.34x. DeepliteRT also delivers significant end-to-end speedups over optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines, achieving up to 2.20x, 2.33x and 2.17x speedups, respectively. DeepliteRT enables customers to utilize existing ARM CPUs for computer vision at the edge while delivering GPU-level performance.

To see DeepliteRT in action check out the E-Smart case study here. E-SMART powers Speed Management for Truck Fleets with DeepliteRT Speed Sign Detection on an Arm Cortex-A53 CPU with execution time of 109 ms; 3.77x faster than ONNX-RT!

DeepliteRT is based on the research paper DeepliteRT: Computer Vision at the Edge, accepted at the British Machine Vision Conference (BMVC) 2023. If you are interested in learning more about DeepliteRT or Deeplite’s AI optimization platform, you can visit us at www.deeplite.ai or email us at info@deeplite.ai!

I hope you enjoyed this blog! Please let me know if you have any questions or feedback. 😊

ai startup tinyML Edge AI PyTorch computer vision YOLO Embedded systems Object detection Quantization

DeepliteRT: Enable Edge Computer Vision with Ultra low-bit Inference

Read On

Deeplite Wins 2022 International Business Award® for Artificial Intelligence/Machine Learning Solution

YOLOBench: How to Find the Best YOLO Model for Your Edge Device

Deeplite Announces Community Version of Neutrino for Optimizing Deep Learning Applications at the Network Edge