Lessons and experiences of continual learning in deep learning for foundation models and LLMs
Talking about continual learning, very likely we may talk about different things, like AGI. It can be the post-training of LLM; self-evolve agents; experience replay in DQN (Mnih et al., 2013); sequential learning of MNIST digits; optimal order of learning tasks, … For a better conversation, this post aims at bringing back the context of continual learning in deep learning, positioning it together with other learning regimes, show the definition, evaluation, and solutions without technical details. Moreover, discuss the connection and inspiration for LLMs and foundation models. This post is inspired by (Mundt et al., 2023). ...
CLIP
Overview Pursuing an efficient way to learn vision representation with natural language supervision, CLIP uses contrastive learning to train a vision encoder and a text encoder. Different from previous works, the pre-training dataset is from web-image-text, which is significantly more than the existed crowded labeled datasets. Moreover, the dataset doesn’t require the gold labels, which enables the scalability. As a result, the image representation can be used for zero-shot transfer tasks, such as ImageNet classification. And the shared multi-modality embedding space enables the analysis between image and text modality. ...
BERT
BERT takes the encoder of Transformer (Vaswani et al., 2017) pre-trained with masked language model task and next sentence prediction. It can be finetuned on downstream tasks and achieves state-of-the-art performance. MLM plays an important role in self-supervised learning, and inspired MAE-ViT (He et al., 2022). Another important self-supervised learning task is contrastive learning, e.g., used by DINO (Caron et al., 2021). While the representation learned with MLM in general requires finetuning for downstream tasks, contrastive learning leads to better zero-shot, few-shot or in-context learning performance. But the simplicity and efficiency of MLM makes it as a compelling method for pretraining. ...
MHSA of BERT
Multi-Head Self-Attention The Transformer blocks remove casual masks while take the attention masks. The attention mask indicating the padding tokens has the shape (B,T) for batch size B and sequence length T. Such masks are applied before softmax and often mask over key instead of query in attention computation. Moreover, an interesting implementation for multi-head attention is to initialize one matrix for all head and then reshape-reorder for attention score computation. ...
AURORA 1.3B: A foundation model for the Earth system
With the major impact of climate change, we are facing increasingly challenging and extreme weather. During the New Year period of 2026, the named winter event Anna hit Sweden with strong winds and heavy snowfall. This led to power outages, disrupted railway traffic, dangerous road conditions, etc. In Gävleborg in particular, it caused severe disruption—including widespread outages and major travel and transport paralysis—prompting authorities to issue a red warning. Can we know better where, when, and how such events will come for a better preparation for the future? We are expecting an answer from AI for science. And Microsoft Research AI for Science provides a foundation model AURORA for this. ...
Hello
My first post.