SMAT

Conventional wisdom suggests parameter efficient fine-tuning of foundation models as the state-of-the-art method for transfer learning in vision, replacing the rich literature of alternatives such as meta-learning. In trying to harness the best of both worlds, meta-tuning introduces a subsequent optimization stage of foundation models but has so far only shown limited success and crucially tends to underperform on out-of-distribution (OOD) tasks. In this paper, we introduce Sparse MetA-Tuning (SMAT), a method inspired by sparse mixture-of-experts approaches and trained to isolate subsets of pre-trained parameters automatically for meta-tuning on each task. SMAT successfully overcomes OOD sensitivity and delivers on the promise of enhancing the transfer abilities of vision foundation models beyond parameter-efficient finetuning. We establish new state-of-the-art results on a challenging combination of Meta-Dataset augmented with additional OOD tasks in both zero-shot and gradient-based adaptation settings. In addition, we provide a thorough analysis of the superiority of learned over hand-designed sparsity patterns for sparse expert methods and the pivotal importance of the sparsity level in balancing between in-domain and out-of-domain generalization.

Method

SMAT meta-learns a shared knowledge pool consists of sparse interpolated experts characterized by a common, learnable set of dense parameters and distinct, learnable sets of gating masks with sparsity constraints. To construct each task-specific model for both meta-training and inference, (1) SMAT first combines experts via a weighted-sum with merging weights generated by a meta-learned hypernetwork based on the task's support set . (2) The experts are then subsequently interpolated with the frozen pre-trained model to enhance both in-distribution (ID) and out-of-distribution (OOD) generalization performance. Alongside (3) the query prediction loss , (4) knowledge distillation with task-specific dense teachers is introduced during meta-training to promote specialization and cooperation of the sparse interpolated experts, ensuring optimization success.

Experimental Results

Few-shot testing performance: We report the average in-distribution (ID) and out-of-distribution (OOD) few-shot testing performance of the meta-tuned models on the Meta-dataset benchmark augmented with additional OOD few-shot learning tasks. The results highlight that SMAT consistently achieves the best generalization performance among all methods across all evaluation settings, including direct inference without fine-tuning, gradient-based fine-tuning by fine-tuning the full model, and parameter-efficient fine-tuning using LoRA.

Learning speedup (Left): SMAT yields better ID results with an attractive learning speedup while achieves and maintains high OOD generalization performance.

Meta-tuning task diversities (Right): SMAT achieves both improved ID and OOD generalization performance over the baselines under all evaluated meta-tuning settings with various training task diversities.

Sparsity finds optimal ID vs OOD trade-offs (Left): The sparsity level of experts essentially constrols the realtive strength of interpoaltion between pre-trained model and the meta-trained experts, therefore, establishes a trade-off between ID and OOD performance, with an optimal point usually existing between the extremes.

Sparsity encourages specialization (Right): Higher sparsity in SMAT potentially induces better meta-gradient alignment during meta-tuning, indicating a sign of development for each expert into a highly specialized region of parameters.

Meta-learned expert sparsity patterns (a-b,d): (a-b) Expert capacity (i.e., the number of non-zero parameters remaining after meta-tuning) grouped by (a) layer types, and (b) layer depth. (d) Overlap (of non-zero regions) between expert masks. The results indicate a noticeable deviation in meta-learned sparsity patterns among experts exists.

Implied task relationship (c): A dendrogram, produced based on expert selection scores, clearly shows hierarchical clustering according to visual similarities between tasks.

SMAT Explained

Citation

@article{chen2024unleashing, title={Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts}, author={Shengzhuang Chen and Jihoon Tack and Yunqiao Yang and Yee Whye Teh and Jonathan Richard Schwarz and Ying Wei}, journal={arXiv preprint arXiv:2403.08477}, year={2024}, }

Abstract

Method

Experimental Results

SMAT Explained

Citation