Distilling Grounding DINO for an Edge-Cloud Collaborative Advanced Driver Assistance System

Grounding DINO (GDINO) has strong potential for use in zero-shot detection and data annotation, but its use is limited by high computational costs. In addition, YOLOX allows real-time detection but struggles to perform well in complex scenes. To address this challenge, we propose an edge-cloud collaborative framework for an Advanced Driver Assistance System (ADAS) to enhance real-time detector performance on edge devices by leveraging the robust capabilities of cloud-based multimodal detectors to improve perception in complex environments.

Our framework consists of cloud and edge components: on the cloud side, we propose a distillation method for multimodal object detectors, which is referred to as MMKD, to optimize the performance of GDINO. Specifically, we use a two-stage distillation strategy, including Cross-modal Listwise Distillation (CLD) and Risk-focused Pseudo-label Distillation (RPLD). With MMKD, we successfully deploy the GDINO model to the cloud, achieving a 1.4% improvement in average precision (AP) and a 1.7× increase in inference speed.

On the edge side, leveraging this streamlined version of GDINO, we propose an ADAS data engine to construct a 1.5 Million-scale GDINO-based Dataset for ADAS, named GDDA1.5M. Impressively, on the basis of YOLOX-Lite, we develop a lightweight object detector that is optimized for the application of an ADAS on edge devices through pruning and architectural refinements. Leveraging the GDDA1.5M dataset and the RPLD training strategy, the model achieves a 7.5% improvement in AP, substantially surpassing its counterparts that were trained on 300K manually labeled images. After the YOLOX-Lite detector is deployed on edge devices within our proposed edge-cloud collaborative framework, it achieves an inference speed of 18 milliseconds on the Horizon X3E chip, while the cloud-based distilled model functions efficiently in complex environments.

Distilling Grounding DINO for an Edge-Cloud Collaborative Advanced Driver Assistance System

Visualization of the distilled GDINO model. The performance is notably stable, with accurate object detection and clear visual representation of key features.

Abstract

The edge-cloud collaborative framework.

Examples from the GDDA1.5M dataset.