Unified AI Multi-modal Chatbot

Chiranjeevi, P and Kalluri, Nagalaxmi and Gurubhagavatula, Sai Saket and Kuncham, Abhishek and Sami, Mohammed (2025) Unified AI Multi-modal Chatbot. World Journal of Advanced Engineering Technology and Sciences, 15 (2). 089-097. ISSN 2582-8266

Abstract

In today’s digital age, we are surrounded by a massive amount of information in different formats—documents, images, and videos. However, making sense of all this data in a meaningful way is still a challenge. This project proposes a smart, unified chatbot system that can understand and interact with content from multiple sources using a multi-modal Retrieval-Augmented Generation (RAG) approach powered by Google’s Gemini-1.5 model. The chatbot allows users to upload PDFs, Word documents, CSV files, images containing text, and even YouTube links. It then extracts key information using techniques like OCR and video transcription, and allows users to ask questions directly about the content. What makes this system powerful is its ability to merge different types of inputs and generate accurate, context-aware answers. The entire interface is built using Streamlit, offering an easy and interactive user experience with features like real-time previews, downloadable notes, chat history, and multilingual support.The project reflects the growing need for AI systems that are intelligent, flexible, and capable of understanding information the way humans do—from all angles and in all forms.

Item Type:	Article
Official URL:	https://doi.org/10.30574/wjaets.2025.15.2.0513
Uncontrolled Keywords:	Multi-modal Retrieval-Augmented Generation; Gemini-1.5 Language Model; Document and Image Processing; YouTube Transcript Summarization
Date Deposited:	04 Aug 2025 16:20
Related URLs:	https://journalwjaets.com/node/633 https://journalwjaets.com/sites/default/... https://doi.org/10.30574/wjaets.2025.15....
URI:	https://eprint.scholarsrepository.com/id/eprint/3382

View Item