Multi-Modal Vision Language Models

Overview This post takes a look at vision-language models (VLMs). Common VLM tasks are image-text retrieval, visual question answering (VQA), visual reasoning, captioning, visual entailment. weakly-supervised grounding. This post looks for models understanding images and interacting with languages and focuses on models fusing vision language modalities with cross-attention. Models with fusing mechanisms are commonly: Dual-encoder: CLIP; Encoder–decoder: models that explicitly encode vision then decode text, SimVLM; Hybrid dual-encoder and encoder-decoder: ALBEF, CoCa; Unified transformer: BLIP; Multi-modal LLM with cross-attention adapters: Flamingo. Models ALBEF (Li et al., 2021) This paper summarizes the related work in two categories regarding multi-modal modelling. ...

February 25, 2026 · 10 min