ICL Characterization of Multi-Modal Geo-Foundation Models: When Can Vision-Language Transformers Learn Geospatial Tasks?

Hawarey, Mosab

doi:10.65737/AIRMCS2026446

AIR JOURNAL OF MATHEMATICS AND COMPUTATIONAL SCIENCES

ICL CHARACTERIZATION OF MULTI-MODAL GEO-FOUNDATION MODELS: WHEN CAN VISION-LANGUAGE TRANSFORMERS LEARN GEOSPATIAL TASKS?

Mosab Hawarey

Director, Geospatial Research

Published: March 06, 2026

DOI: 10.65737/AIRMCS2026446

License: CC BY 4.0

📄 Download Full Paper (PDF)

Abstract

Multi-modal geo-foundation models combining vision and language have emerged as powerful tools for Earth observation, yet a fundamental question remains unanswered: which vision-language geospatial tasks can be learned in-context, and which cannot? Models such as GeoChat, EarthGPT, and RSGPT demonstrate impressive performance on scene description and attribute queries, but struggle with counting and multi-object localization—a pattern that lacks theoretical explanation. We address this gap by extending the in-context learning (ICL) characterization framework to vision-language geospatial tasks. We introduce the concept of language specificity s(ℓ) ∈ [0,1], measuring how uniquely a language query identifies target objects, and derive the effective object count J_eff(ℓ) = J·(1−s(ℓ)) capturing language-induced compression of the localization problem. We prove that language affects ICL complexity through three mechanisms—constraint specification, sufficient statistic compression, and task decomposition—but cannot overcome fundamental combinatorial hardness when the effective object count exceeds the threshold J* ≈ 2–4. Our main result is the Multi-Modal GeoAI Dichotomy Theorem: every natural vision-language geospatial function class falls into exactly one of three categories. Type A (unconditionally ICL-Easy) tasks—including scene captioning, existence VQA, and change description—admit additive sufficient statistics with sample complexity nICL = Θ(CB²/ε). Type C (unconditionally ICL-Hard) tasks—including counting VQA and universal object localization—require combinatorial statistics regardless of language specification. Type A|ℓ (conditionally Easy) tasks—including attribute VQA, referring expression comprehension, and text-guided detection—transition from Hard to Easy when language specificity satisfies s(ℓ) ≥ 1 − J*/J. We provide complete classification of vision-language geospatial tasks across all major categories (16 representative task types spanning scene-level, pixel-level, VQA, referring expression, and detection categories), derive seven testable predictions about model behavior (including threshold effects at J* ≈ 3–4 and specificity-accuracy correlations ρ > 0.8), and establish five prompt engineering guidelines for practitioners. The dichotomy explains observed performance patterns in existing models—strong on descriptive tasks, weak on quantitative localization—and provides principled guidance for when few-shot ICL suffices versus when fine-tuning is required. Our framework bridges vision-language AI and geospatial analysis, offering the first theoretical foundation for multi-modal GeoAI deployment.

Keywords

in-context learning multi-modal geo-foundation models vision-language transformers visual question answering referring expression comprehension sufficient statistic complexity ICL-Easy ICL-Hard remote sensing prompt engineering

How to Cite

APA:

Hawarey, M. (2026). ICL Characterization of Multi-Modal Geo-Foundation Models: When Can Vision-Language Transformers Learn Geospatial Tasks? AIR Journal of Mathematics and Computational Sciences, Vol. 2026, AIRMCS2026446.

https://doi.org/10.65737/AIRMCS2026446

Indexed & Discoverable In

🔗

Crossref

🧠

Semantic Scholar

📚

OpenAlex

🔍

Google Scholar

Plus automatic indexing in CORE, Scilit, and other DOI-triggered discovery services

Copyright & Open Access

© 2026 Mosab Hawarey. This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author(s) and source are credited. Authors retain full copyright to their work.

Publication Information

Journal: AIR Journal of Mathematics and Computational Sciences

Publisher: Artificial Intelligence Review AIR Publishing House LLC (AIR Journals)

Submitted: March 04, 2026

Approved: March 05, 2026 (based on this Evaluation Report; shared with author’s permission)

Published: March 06, 2026

DOI: 10.65737/AIRMCS2026446

Submission ID: AIR-2026-000446

Share This Research

📧 Email 🔗 Copy Link 🐦 X [Twitter]