Unleashing the Power of Meta-tuning for Few-shot Generalization
Through Sparse Interpolated Experts

Shengzhuang Chen1,   Jihoon Tack2,   Yunqiao Yang1,   Yee Whye Teh3,   Jonathan Richard Schwarz4°,  Ying Wei5°  

1 CityU Hong Kong     2 KAIST     3 University of Oxford  4 Harvard University   5 NTU  

°Joint senior authorship, equal contribution    

[arXiv]            [Code]

Abstract

Conventional wisdom suggests parameter efficient fine-tuning of foundation models as the state-of-the-art method for transfer learning in vision, replacing the rich literature of alternatives such as meta-learning. In trying to harness the best of both worlds, meta-tuning introduces a subsequent optimization stage of foundation models but has so far only shown limited success and crucially tends to underperform on out-of-distribution (OOD) tasks. In this paper, we introduce Sparse MetA-Tuning (SMAT), a method inspired by sparse mixture-of-experts approaches and trained to isolate subsets of pre-trained parameters automatically for meta-tuning on each task. SMAT successfully overcomes OOD sensitivity and delivers on the promise of enhancing the transfer abilities of vision foundation models beyond parameter-efficient finetuning. We establish new state-of-the-art results on a challenging combination of Meta-Dataset augmented with additional OOD tasks in both zero-shot and gradient-based adaptation settings. In addition, we provide a thorough analysis of the superiority of learned over hand-designed sparsity patterns for sparse expert methods and the pivotal importance of the sparsity level in balancing between in-domain and out-of-domain generalization.


Method

SMAT meta-learns a shared knowledge pool consists of sparse interpolated experts characterized by a common, learnable set of dense parameters and distinct, learnable sets of gating masks with sparsity constraints. To construct each task-specific model for both meta-training and inference, (1) SMAT first combines experts via a weighted-sum with merging weights generated by a meta-learned hypernetwork based on the task's support set . (2) The experts are then subsequently interpolated with the frozen pre-trained model to enhance both in-distribution (ID) and out-of-distribution (OOD) generalization performance. Alongside (3) the query prediction loss , (4) knowledge distillation with task-specific dense teachers is introduced during meta-training to promote specialization and cooperation of the sparse interpolated experts, ensuring optimization success.


Experimental Results

Few-shot testing performance: We report the average in-distribution (ID) and out-of-distribution (OOD) few-shot testing performance of the meta-tuned models on the Meta-dataset benchmark augmented with additional OOD few-shot learning tasks. The results highlight that SMAT consistently achieves the best generalization performance among all methods across all evaluation settings, including direct inference without fine-tuning, gradient-based fine-tuning by fine-tuning the full model, and parameter-efficient fine-tuning using LoRA.



Learning speedup (Left): SMAT yields better ID results with an attractive learning speedup while achieves and maintains high OOD generalization performance.

Meta-tuning task diversities (Right): SMAT achieves both improved ID and OOD generalization performance over the baselines under all evaluated meta-tuning settings with various training task diversities.



Sparsity finds optimal ID vs OOD trade-offs (Left): The sparsity level of experts essentially constrols the realtive strength of interpoaltion between pre-trained model and the meta-trained experts, therefore, establishes a trade-off between ID and OOD performance, with an optimal point usually existing between the extremes.

Sparsity encourages specialization (Right): Higher sparsity in SMAT potentially induces better meta-gradient alignment during meta-tuning, indicating a sign of development for each expert into a highly specialized region of parameters.



Meta-learned expert sparsity patterns (a-b,d): (a-b) Expert capacity (i.e., the number of non-zero parameters remaining after meta-tuning) grouped by (a) layer types, and (b) layer depth. (d) Overlap (of non-zero regions) between expert masks. The results indicate a noticeable deviation in meta-learned sparsity patterns among experts exists.

Implied task relationship (c): A dendrogram, produced based on expert selection scores, clearly shows hierarchical clustering according to visual similarities between tasks.

SMAT Explained


Citation

@article{chen2024unleashing,
     title={Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts},
     author={Shengzhuang Chen and Jihoon Tack and Yunqiao Yang and Yee Whye Teh and Jonathan Richard Schwarz and Ying Wei},
     journal={arXiv preprint arXiv:2403.08477},
     year={2024},
}