Try our new intelligent model routing solution, Arcee Conductor. Sign up today and get a $20 credit.
Research
A training-free method to transplant tokenizers in pre-trained language models
At Arcee AI, we're constantly pushing the boundaries of what's possible with small language models (SLMs). Today, we're excited to share groundbreaking research that solves one of the most persistent challenges in AI model development: making different language models work together, even when they have different vocabularies.
Imagine you have two brilliant translators—one specializes in medical terminology, the other in legal jargon. Both are experts, but they use completely different vocabularies. If you want them to collaborate on a document that requires both medical and legal expertise, they'd struggle to communicate effectively.
This is exactly what happens with language models. Each model is trained with its tokenizer—essentially its dictionary that breaks down text into digestible pieces. A model trained in English might split "unhappiness" into "un-happy-ness," while another model might treat it as a single unit. These differences create massive barriers that, until now, could only be solved with expensive retraining—often costing thousands of dollars and weeks of compute time.
In a newly published research paper, “Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit”, Arcee AI researchers Charles Goddard and Fernando Fernandes Neto introduce a revolutionary approach called "tokenizer transplantation" using a technique called Orthogonal Matching Pursuit (OMP).
Think of it as a sophisticated translation system that can convert between different model vocabularies without any retraining. Here's the key insight: even though different models use different vocabularies, the concepts they represent often align in predictable ways. Our method finds these alignments and uses them to transplant one model's vocabulary into another.
The result? You can take a model trained with one tokenizer and seamlessly switch it to use a completely different vocabulary—all without losing performance or requiring expensive retraining.
Our experiments demonstrate remarkable results:
Most impressively, our method consistently outperformed all existing zero-shot approaches, often by significant margins.
This breakthrough opens up entirely new possibilities for building and deploying language models:
Problem: You want to compress a powerful 70B parameter model into a nimble 7B version, but they use different vocabularies.
Solution: Transplant the large model's tokenizer onto the small one, then directly transfer knowledge with perfect vocabulary alignment.
Impact: Build specialized, efficient models that retain most of their teacher's capabilities.
Problem: Speed up inference by having a small draft model propose completions for a larger model, but they need matching vocabularies.
Solution: Use tokenizer transplantation to align any model pair, regardless of their original training.
Impact: 2-3x faster inference with any combination of models—mix and match for optimal speed/quality tradeoffs.
Problem: Combine multiple specialized models (one for coding, one for math, one for writing) but they can't communicate due to vocabulary mismatches.
Solution: Harmonize all vocabularies through transplantation, enabling direct output combination.
Impact: Create specialized models that excel across multiple domains simultaneously.
Problem: Your general-purpose model struggles with medical terminology, legal documents, or code because it wasn't trained on domain-specific vocabularies.
Solution: Transplant a domain-specific tokenizer that better handles specialized terminology.
Impact: Dramatically improve performance in niche domains without costly retraining.
Problem: Adapting English models for other languages typically requires extensive retraining.
Solution: Transplant tokenizers optimized for target languages, preserving learned capabilities while improving linguistic coverage.
Impact: Rapidly expand model capabilities to new languages and regions.
This research exemplifies Arcee AI's commitment to making advanced AI accessible and practical. We're not just building models—we're creating the tools and techniques that make AI development faster, cheaper, and more effective for everyone.
Our tokenizer transplantation method is available in the mergekit-tokensurgeon tool which is part of our open-source MergeKit library. Whether you're building specialized domain models, experimenting with model merging or model distillation, or optimizing inference pipelines, these techniques can accelerate your development and improve your results.
The future of AI isn't just about building bigger models—it's about building smarter, more collaborative systems where different models can seamlessly work together. Tokenizer transplantation is a crucial step toward that vision.