|
TrackVLA: Embodied Visual Tracking in the Wild
Shaoan Wang*, Jiazhao Zhang*, Minghan Li, Jiahang Liu,
Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu,
Zhizheng Zhang†, He Wang†
CoRL 2025
TrackVLA is a vision-language-action model capable of simultaneous object recognition and visual tracking, trained on a dataset of 1.7 million samples. It demonstrates robust tracking, long-horizon tracking, and cross-domain generalization across diverse challenging environments.
|