We present Sparse Interpolated Mixture-of-Experts (SIMoE) instruction-tuning, an end-to-end algorithm designed to fine-tune a dense pre-trained Large Language Model (LLM) into a MoE-style model that possesses capabilities in multiple specialized domains. During instruction-tuning, SIMoE automatically identifies multiple specialized experts under a specified sparsity constraint, with each expert representing a structurally sparse subset of the seed LLM's parameters that correspond to domainspecific knowledge within the data. SIMoE simultaneously learns an input-dependent expert merging strategy via a router network, leveraging rich cross-expert knowledge for superior downstream generalization that surpasses existing baselines. Empirically, SIMoE consistently achieves state-of-the-art performance on common instruction-tuning benchmarks while maintaining an optimal performance-compute trade-off compared to all baselines.
SIMoE conceptually resembles the MoE principle in routing and combining specialized parameter components through soft merging, while it differs in implementation from conventional MoE architectures by defining each expert as a specific subset of sparse parameters within a shared network. Specifically, SIMoE upcycles a pre-trained LLM into a MoE-style model characterized by M experts, consisting of a shared, trainable set of expert parameters and M distinct, trainable sets of expert masks. In forward computation, (1-2) SIMoE merges experts via a weighted-sum with coefficients generated via a router network based on the input prompt, before combining with the frozen, pre-trained LLM. (3) During instruction-tuning, we enforce structured sparsity and orthogonality on the trainable masks in addition to the usual NLL loss, determining where-to-upcycle and encouraging expert specialization in a fully automatic manner.
Cross-task generalization performance: SIMoE consistently achieves the strongest generalization performance across all of our experiments. SIMoE excels in cross-task generalization on the Super-NaturalInstructions benchmark, outperforming baselines in at least 7 out of 12 unseen task categories. This results in overall average gains of 2.5% and 1.6% over Full FT for the 3B and 8B pre-trained models, respectively.
SIMoE demonstrates strong generalization performance when transferring to a larger pre-trained model and a relatively larger instruction fine-tuning dataset,i.e., the Tülu-v3. SIMoE maintains its competitive edge, surpassing all baseline methods on average over 12 common LLM evaluation benchmarks, with a noticeable improvement of 0.6% over the official Tülu-v3-8B-SFT model – the recent open-source state-of-the-art.
Specialized experts and orthogonality: In the left panel, we visualize the average expert activation for different tasks and notice that all experts exhibit some utilization across datasets, and hierarchical clustering
of activation similarities reveals a clear dendrogram structure aligned with task and domain relationships. In the right panel, we assess expert specialization through pairwise mask overlap ratios. The results show that experts generally have low overlaps – sharing a small number of
parameters, though domain-similar experts (according to grouping in the dendrogram) exhibit marginally higher
overlaps – for instances, maths- and code-domain experts {2,6,7}; general- and safety-domain experts {3,4}. The results demonstrate that SIMoE
is capable of identifying a balanced shared and expert-specific parameter partitions, enabling nuanced specialization while maintaining strong synergies between distinct experts.
Training and inference cost:
We compare the model capacity of (1) SIMoE using M = 8 upcycled sparse interpolated experts at each linear layer, and (2) Sparse upcycling
with 4 experts at each FNN block and Top-2 expert routing. Thanks to the proposed learnable, structured sparsity masks in combination with expert parameter sharing, our method significantly reduces
model size during training, immediately providing a substantial reduction in peak GPU memory usage. Furthermore, by targeting a final sparsity of 75% in upcycled experts, our model achieves an smaller inference size, with approximately 30% fewer parameters compared to the number of active parameters per forward-pass in a upcycled SMoE model.
Learned sparse upcycling patterns: We visualize the distribution of non-zero experts in the upcycled LLM learned by SIMoE from instruction-tuning. Several key observations emerge. First, as shown in panel (right), upcycling primarily occurs in the shallow and intermediate Transformer layers, with significantly reduced activity in deeper layers. Second, panel (left) reveals that non-negligible upcycling manifests across all layer types, though with distinct intensity: layer normalization parameters exhibit the highest proportion of upcycled (non-zero) expert parameters, while the gate layer in the FNN demonstrates the lowest. Key, value, and output matrices in the attention block maintain a noticeably higher fraction of non-zero parameters than query weights, aligning with prior work that identified these matrices as crucial for knowledge injection and model editing. Notably, the learned upcycling pattern by SIMoE, which achieves the best empirical performance, diverges substantially from manually prescribed strategies (e.g., upcycle FFN only), underscoring the critical advantage of data-driven approaches for determining where-to-upcycle.
@inproceedings{chen-etal-2025-automatic,
title={Automatic Expert Discovery in LLM Upcycling via Sparse Interpolated Mixture-of-Experts}, author={Shengzhuang Chen and Ying Wei and Jonathan Richard Schwarz}, booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics}, year={2025}, url={https://arxiv.org/abs/2506.12597} } |