GeoLLMs in action: A systematic review of multimodal models for satellite image captioning and geospatial understanding

Eyinade, John Adeyemi and Ademusire, Adebisi Joseph (2025) GeoLLMs in action: A systematic review of multimodal models for satellite image captioning and geospatial understanding. Open Access Research Journal of Science and Technology, 14 (2). 049-064. ISSN 2782-9960

Abstract

A new generation of geospatial Artificial Intelligence systems that can convert complex Earth observation (EO) data into natural language insights has been sparked by the convergence of satellite imagery and large language models (LLMs). Recent developments in multimodal large language models (MLLMs) applied to geospatial tasks are methodically reviewed in this review, with particular attention to semantic segmentation, satellite image captioning, and spatial question answering. 42 peer-reviewed studies and preprints covering models like BLIP-2, Kosmos-2, Earth-GPT, GeoPix, and GeoRSMLLM that were published between 2020 and 2025 were examined using a structured review protocol. Three main architectural patterns in GeoLLMs are identified by the review: retrieval-augmented or symbolic hybrid systems, end-to-end tuned multimodal transformers, and frozen vision encoders with language adapters. Although domain-specific models demonstrate encouraging outcomes in topological reasoning and pixel-level instruction following, difficulties still exist in geographic generalization, temporal reasoning, and spatial grounding. With little representation of non-Western regions and multilingual contexts, the field is still fragmented in terms of assessment metrics and benchmark datasets. The review summarizes new research avenues to fill these gaps, such as the creation of geometry-aware embeddings, multilingual fine-tuning, and standardized spatial benchmarks. This research offers a thorough basis for developing GeoLLMs as interpretable, scalable, and internationally inclusive geospatial intelligence tools.

Item Type: Article
Official URL: https://doi.org/10.53022/oarjst.2025.14.2.0093
Uncontrolled Keywords: Geollms; Multimodal Large Language Models; Satellite Image Captioning; Geospatial Artificial Intelligence (GeoAI); Spatial Reasoning; Remote Sensing and Vision-Language Models
Date Deposited: 01 Sep 2025 14:01
Related URLs:
URI: https://eprint.scholarsrepository.com/id/eprint/5400