Multi-Modal Vision Language Models

Overview This post takes a look at vision-language models (VLMs). Common VLM tasks are image-text retrieval, visual question answering (VQA), visual reasoning, captioning, visual entailment. weakly-supervised grounding. This post looks for models understanding images and interacting with languages and focuses on models fusing vision language modalities with cross-attention. Models with fusing mechanisms are commonly: Dual-encoder: CLIP; Encoder–decoder: models that explicitly encode vision then decode text, SimVLM; Hybrid dual-encoder and encoder-decoder: ALBEF, CoCa; Unified transformer: BLIP; Multi-modal LLM with cross-attention adapters: Flamingo. Models ALBEF (Li et al., 2021) This paper summarizes the related work in two categories regarding multi-modal modelling. ...

February 25, 2026 · 10 min

Lessons and experiences of continual learning in deep learning for foundation models and LLMs

Talking about continual learning, very likely we may talk about different things, like AGI. It can be the post-training of LLM; self-evolve agents; experience replay in DQN (Mnih et al., 2013); sequential learning of MNIST digits; optimal order of learning tasks, … For a better conversation, this post aims at bringing back the context of continual learning in deep learning, positioning it together with other learning regimes, show the definition, evaluation, and solutions without technical details. Moreover, discuss the connection and inspiration for LLMs and foundation models. This post is inspired by (Mundt et al., 2023). ...

January 28, 2026 · 9 min