MITracker: Multi-View Integration for Visual Object Tracking

Mengjie Xu1*, Yitao Zhu1*, Haotian Jiang1, Jiaming Li1,
Zhenrong Shen2, Sheng Wang1,2, Haolin Huang1, Xinyu Wang1, Qing Yang1,3, Han Zhang1,3, Qian Wang1,3†
1School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, 2School of Biomedical Engineering, Shanghai Jiao Tong University, 3Shanghai Clinical Research and Trial Center
Equal contribution     † Corresponding author
MITracker demo on multi-view videos from MVTrack Dataset.

Abstract

Multi-view object tracking (MVOT) offers promising solutions to challenges such as occlusion and target loss, which are common in traditional single-view tracking. However, progress has been limited by the lack of comprehensive multi-view datasets and effective cross-view integration methods. To overcome these limitations, we compiled a Multi-View object Tracking (MVTrack) dataset of 234K high-quality annotated frames featuring 27 distinct objects across various scenes. In conjunction with this dataset, we introduce a novel MVOT method, Multi-View Integration Tracker (MITracker), to efficiently integrate multi-view object features and provide stable tracking outcomes. MITracker can track any object in video frames of arbitrary length from arbitrary viewpoints. The key advancements of our method over traditional single-view approaches come from two aspects: (1) MITracker transforms 2D image features into a 3D feature volume and compresses it into a bird’s eye view (BEV) plane, facilitating inter-view information fusion; (2) we propose an attention mechanism that leverages geometric information from fused 3D feature volume to refine the tracking results at each view. MITracker outperforms existing methods on the MVTrack and GMTD datasets, achieving state-of-the-art performance.
Architecture
The overview of our MITracker.

MVTrack Dataset

MVTrack dataset is designed to fill the gaps in the field of MVOT. Compared to single-view datasets, we maintain competitive class diversity while adding multi-view capabilities. Compared to MVOT datasets, we provide significantly richer object categories and more videos with practical camera setups. MVTrack dataset is the only dataset that combines multi-view tracking, rich object categories, absent label annotations, and calibration information.

Architecture

(a) umbrella1-1: Deformation, Aspect Ratio Change and Scale Variation.

Architecture

(b) phone3-3: Low Resolution, Fully Occlusion and Partial Occlusion.

Architecture

(c) tenis5-1: Out of View, Motion Blur and Background Clutter.

Example sequences and their corresponding tracking attributes in the MVTrack dataset.

Quantitative Results

Below are the results on the MVTrack dataset. MITracker provides multi-view tracking results, whereas other single-view tracking methods yield results from individual views.

Comparison with SOTA methods. Methods marked with are zero-shot, others are supervised.
Model AUC(%) PNorm(%) P(%)
DiMP 43.14 59.52 53.13
PrDiMP 48.61 66.09 58.93
MixFormer 57.59 75.44 67.72
OSTrack 60.04 77.72 70.06
GRM 52.53 69.91 62.31
SeqTrack 58.37 76.63 69.03
ARTrack 53.23 70.25 62.49
HIPTrack 60.45 78.92 70.53
EVPTrack 61.37 79.76 71.97
AQATrack 61.93🥉 80.00 🥉 72.69 🥉
ODTrack 63.36 🥈 82.25 🥈 74.46 🥈
SAM2 46.49 63.12 56.82
SAM2Long 55.30 72.84 67.40
MITracker (ours) 71.13 🥇 91.87 🥇 83.95 🥇

Architecture
The numbers in the legend represent the method's recovery rate within 10 frames after the target disappears.
Architecture
The maximum tracking duration in video frames and the occurrence of restarts prompted by a target loss > 10 frames.