Attri, Ashwani and Gudeboyena, Priyanka and Chigurla, Vaishnavi and Moluguri, Soumika and Kasoju, Nithin (2025) Multimodal AI framework for image captioning, story generation and natural speech narration. World Journal of Advanced Research and Reviews, 26 (2). pp. 1037-1044. ISSN 2581-9615
![WJARR-2025-1685.pdf [thumbnail of WJARR-2025-1685.pdf]](https://eprint.scholarsrepository.com/style/images/fileicons/text.png)
WJARR-2025-1685.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial Share Alike.
Abstract
With the increasing ubiquity of digital imagery, there is a growing need for intelligent systems capable of understanding visual content and expressing that understanding in human-like language. This paper presents a comprehensive AI-based pipeline that not only generates captions from images but also constructs vivid stories based on those captions and finally delivers them in a human voice. The proposed system integrates multiple components: a Convolutional Neural Network (VGG16) for extracting visual features, an LSTM-based sequence model for caption generation, GPT-2 for creative story generation, and Google Text-to-Speech (gTTS) for voice synthesis. The result is a multi-modal AI framework capable of transforming static images into rich, spoken narratives. This approach has applications in assistive technologies, interactive storytelling, content automation, and education. The proposed model is trained and evaluated on the Flickr8k dataset, demonstrating a viable path for automated visual storytelling.
Item Type: | Article |
---|---|
Official URL: | https://doi.org/10.30574/wjarr.2025.26.2.1685 |
Uncontrolled Keywords: | Image Captioning; CNN-LSTM; VGG16; GPT-2; Text-to-Speech (gTTS); Image-to-Story Generation; Natural Language Processing (NLP) |
Depositing User: | Editor WJARR |
Date Deposited: | 20 Aug 2025 10:46 |
Related URLs: | |
URI: | https://eprint.scholarsrepository.com/id/eprint/2742 |