Background Aware Vision Transformer for Token Pruning

Written by Deeplite team | Nov 21, 2024 4:42:31 PM

This blog is a summary of the research paper Sah, Sudhakar, et al. "Token Pruning using a Lightweight Background Aware Vision Transformer", accepted at the NeurIPS 2024 FITML Workshop.

In the realm of computer vision, Vision Transformers (ViTs) have emerged as a powerful tool, outperforming traditional convolutional neural networks (CNNs) in various tasks such as object detection, segmentation, and classification. However, their high computational requirements pose significant challenges, especially for deployment on edge devices with limited memory and processing power. Addressing this issue, we introduced an innovative solution: the Background Aware Vision Transformer (BAViT).

The Challenge of High Computational Demand

ViTs process images by dividing them into smaller patches, known as tokens. As the resolution of an image increases, so does the number of tokens, leading to higher computational demands and reduced throughput. This becomes particularly problematic for edge devices, which are often constrained by limited resources. To mitigate this, the concept of token pruning has been explored, where less important tokens are removed to enhance efficiency.

Introducing BAViT

The BAViT model is designed to identify and prune background tokens, thereby reducing the number of tokens processed by the ViT. This is achieved through a novel pre-processing block that classifies tokens as either foreground (FG) or background (BG) using semantic information from segmentation maps or bounding box annotations. By training a few layers of the ViT to distinguish between FG and BG tokens, BAViT can effectively prune unnecessary background tokens before they are fed into the object detection model.

Comparison of background token identification results between Sparse DETR, Focused DETR and BAViT models

Methodology & Results

We trained BAViT on datasets like PASCAL VOC and COCO, achieving impressive accuracy in token classification. For instance, a two-layer BAViT model achieved 75% accuracy on the VOC dataset and 71% on the COCO dataset. When integrated with YOLOS, a ViT-based object detection model, BAViT demonstrated a significant increase in throughput—up to 40%—with only a minimal drop in mean Average Precision (mAP).

One of the key innovations of BAViT is its lightweight design, making it suitable for edge AI applications. Unlike other token pruning methods that rely on heavy CNN backbones, BAViT uses a small ViT model, ensuring it remains computationally efficient. This approach not only enhances the performance of ViTs on edge devices but also maintains a good balance between latency and accuracy.

Practical Implications

The practical implications of BAViT are substantial. By reducing the computational load, BAViT enables the deployment of advanced Vision Transformer models on edge devices, opening up new possibilities for real-time applications in areas such as autonomous vehicles, robotics, and mobile devices. The ability to process high-resolution images efficiently without compromising on accuracy is a significant step forward in the field of computer vision.

Conclusion

The introduction of BAViT marks a promising advancement in the optimization of Vision Transformers for edge devices. By leveraging background token pruning, BAViT addresses the critical challenge of high computational demand, paving the way for more efficient and practical applications of ViTs in resource-constrained environments. As the field of computer vision continues to evolve, innovations like BAViT will play a crucial role in making advanced technologies more accessible and effective across various domains.

Interested in finding out more about BAViT for your edge AI application?

I hope you enjoyed this blog summary of the article. Please let me know if you have any questions or feedback. 😊

View full post