ICL CHARACTERIZATION OF MULTI-MODAL GEO-FOUNDATION MODELS: WHEN CAN VISION-LANGUAGE TRANSFORMERS LEARN GEOSPATIAL TASKS?
Mosab Hawarey
Director, Geospatial Research
Abstract
Multi-modal geo-foundation models combining vision and language have emerged as powerful tools for Earth observation, yet a fundamental question remains unanswered: which vision-language geospatial tasks can be learned in-context, and which cannot? Models such as GeoChat, EarthGPT, and RSGPT demonstrate impressive performance on scene description and attribute queries, but struggle with counting and multi-object localizationāa pattern that lacks theoretical explanation. We address this gap by extending the in-context learning (ICL) characterization framework to vision-language geospatial tasks. We introduce the concept of language specificity s(ā) ā [0,1], measuring how uniquely a language query identifies target objects, and derive the effective object count J_eff(ā) = JĀ·(1ās(ā)) capturing language-induced compression of the localization problem. We prove that language affects ICL complexity through three mechanismsāconstraint specification, sufficient statistic compression, and task decompositionābut cannot overcome fundamental combinatorial hardness when the effective object count exceeds the threshold J* ā 2ā4. Our main result is the Multi-Modal GeoAI Dichotomy Theorem: every natural vision-language geospatial function class falls into exactly one of three categories. Type A (unconditionally ICL-Easy) tasksāincluding scene captioning, existence VQA, and change descriptionāadmit additive sufficient statistics with sample complexity nICL = Ī(CB²/ε). Type C (unconditionally ICL-Hard) tasksāincluding counting VQA and universal object localizationārequire combinatorial statistics regardless of language specification. Type A|ā (conditionally Easy) tasksāincluding attribute VQA, referring expression comprehension, and text-guided detectionātransition from Hard to Easy when language specificity satisfies s(ā) ā„ 1 ā J*/J. We provide complete classification of vision-language geospatial tasks across all major categories (16 representative task types spanning scene-level, pixel-level, VQA, referring expression, and detection categories), derive seven testable predictions about model behavior (including threshold effects at J* ā 3ā4 and specificity-accuracy correlations Ļ > 0.8), and establish five prompt engineering guidelines for practitioners. The dichotomy explains observed performance patterns in existing modelsāstrong on descriptive tasks, weak on quantitative localizationāand provides principled guidance for when few-shot ICL suffices versus when fine-tuning is required. Our framework bridges vision-language AI and geospatial analysis, offering the first theoretical foundation for multi-modal GeoAI deployment.
Keywords
How to Cite
APA:
Hawarey, M. (2026), ICL Characterization of Multi-Modal Geo-Foundation Models: When Can Vision-Language Transformers Learn Geospatial Tasks? AIR Journal of Mathematics & Computational Sciences, Vol. 2026, AIRMCS2026446, DOI: 10.65737/AIRMCS2026446
Indexed & Discoverable In
Plus automatic indexing in CORE, Scilit, and other DOI-triggered discovery services
Copyright & Open Access
Ā© 2026 Mosab Hawarey. This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author(s) and source are credited. Authors retain full copyright to their work.