Srivastava, Prashant Anand (2025) Innovations in visual language models for robotic interaction and contextual awareness: Progress, pitfalls and perspectives. World Journal of Advanced Engineering Technology and Sciences, 15 (1). pp. 1145-1152. ISSN 2582-8266
![WJAETS-2025-0311.pdf [thumbnail of WJAETS-2025-0311.pdf]](https://eprint.scholarsrepository.com/style/images/fileicons/text.png)
WJAETS-2025-0311.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial Share Alike.
Abstract
Vision‑Language Models (VLMs) promise to bridge visual perception and natural language for truly intuitive robotic interaction, yet their real‑world robustness remains underexplored. In this paper, we quantitatively evaluate state‑of‑the‑art VLM performance—showing VLM‑RT achieves 96.8% reasoning accuracy at 18.2 FPS but suffers dramatic degradation (94.3% → 37.8% accuracy) under variable lighting and a 48.4‑point recognition gap between Western and East Asian objects. We introduce a concise failure‑mode analysis that links these deficits to core root causes (environmental variability, distributional bias, multimodal misalignment) and map each to practical mitigation strategies. Building on this foundation, we propose a prioritized research roadmap—human‑in‑the‑loop systems, continual learning, and embodied intelligence—and define standardized metrics for fairness, privacy containment, and safety verification. Together, these contributions offer actionable benchmarks to guide the development of robust, trustworthy VLM‑powered robots.
Item Type: | Article |
---|---|
Official URL: | https://doi.org/10.30574/wjaets.2025.15.1.0311 |
Uncontrolled Keywords: | Multimodal Representation; Zero-Shot Generalization; Embodied Cognition; Distributional Bias; Human-Robot Collaboration |
Depositing User: | Editor Engineering Section |
Date Deposited: | 04 Aug 2025 16:09 |
Related URLs: | |
URI: | https://eprint.scholarsrepository.com/id/eprint/2890 |