Open Models and Libraries

Open-Weight Models

Permissive Apache 2.0 license.

Models

^{AFM 4.5B}

Overview

AFM-4.5B is a 4.5 billion parameter instruction-tuned model developed by Arcee.ai, designed for enterprise-grade performance across diverse deployment environments from cloud to edge. The base model was trained on a dataset of 8 trillion tokens, comprising 6.5 trillion tokens of general pretraining data followed by 1.5 trillion tokens of midtraining data with enhanced focus on mathematical reasoning and code generation. Following pretraining, the model underwent supervised fine-tuning on high-quality instruction datasets. The instruction-tuned model was further refined through reinforcement learning on verifiable rewards as well as for human preference. We use a modified version of TorchTitan for pretraining, Axolotl for supervised fine-tuning, and a modified version of Verifiers for reinforcement learning.

The development of AFM-4.5B prioritized data quality as a fundamental requirement for achieving robust model performance. We collaborated with DatologyAI, a company specializing in large-scale data curation. DatologyAI's curation pipeline integrates a suite of proprietary algorithms—model-based quality filtering, embedding-based curation, target distribution-matching, source mixing, and synthetic data. Their expertise enabled the creation of a curated dataset tailored to support strong real-world performance.

The model architecture follows a standard transformer decoder-only design based on Vaswani et al., incorporating several key modifications for enhanced performance and efficiency. Notable architectural features include grouped query attention for improved inference efficiency and ReLU^2 activation functions instead of SwiGLU to enable sparsification while maintaining or exceeding performance benchmarks.

The model available in this repo is the instruct model following supervised fine-tuning and reinforcement learning.

View our documentation here for more details: https://docs.arcee.ai/arcee-foundation-models/introduction-to-arcee-foundation-models

Model Details

Model Architecture: ArceeForCausalLM
Parameters: 4.5B
Training Tokens: 8T
License: Apache 2.0
Recommended settings:
- temperature: 0.5
- top_k: 50
- top_p: 0.95
- repeat_penalty: 1.1

Benchmarks

*Qwen3 and SmolLM's reasoning approach causes their scores to vary wildly from suite to suite - but these are all scores on our internal harness with the same hyperparameters. Be sure to reference their reported scores. SmolLM just released its bench.

How to use with `transformers`

You can use the model directly with the transformers library.

We recommend a lower temperature, around 0.5, for optimal performance.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/AFM-4.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.5,
    top_k=50,
    top_p=0.95
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

How to use with `vllm`

Ensure you are on version 0.10.1 or newer

pip install vllm>=0.10.1

You can then serve the model natively

vllm serve arcee-ai/AFM-4.5B

How to use with Together API

You can access this model directly via the Together Playground.

Python (Official Together SDK)

from together import Together

client = Together()
response = client.chat.completions.create(
    model="arcee-ai/AFM-4.5B",
    messages=[
        {
            "role": "user",
            "content": "What are some fun things to do in New York?"
        }
    ]
)
print(response.choices[0].message.content)

cURL

curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "arcee-ai/AFM-4.5B",
    "messages": [
      {
        "role": "user",
        "content": "What are some fun things to do in New York?"
      }
    ]
  }'

Quantization support

Support for llama.cpp and Intel OpenVINO is available:

https://huggingface.co/arcee-ai/AFM-4.5B-GGUF

https://huggingface.co/arcee-ai/AFM-4.5B-ov

License

AFM-4.5B is released under the Apache-2.0 license.

Download

^{Arcee-SuperNova-Medius}

Overview

Arcee-SuperNova-Medius is a 14B parameter language model developed by Arcee.ai, built on the Qwen2.5-14B-Instruct architecture. This unique model is the result of a cross-architecture distillation pipeline, combining knowledge from both the Qwen2.5-72B-Instruct model and the Llama-3.1-405B-Instruct model. By leveraging the strengths of these two distinct architectures, SuperNova-Medius achieves high-quality instruction-following and complex reasoning capabilities in a mid-sized, resource-efficient form.

SuperNova-Medius is designed to excel in a variety of business use cases, including customer support, content creation, and technical assistance, while maintaining compatibility with smaller hardware configurations. It’s an ideal solution for organizations looking for advanced capabilities without the high resource requirements of larger models like our SuperNova-70B.

Distillation Overview

The development of SuperNova-Medius involved a sophisticated multi-teacher, cross-architecture distillation process, with the following key steps:

Logit Distillation from Llama 3.1 405B:
- We distilled the logits of Llama 3.1 405B using an offline approach.
- The top K logits for each token were stored to capture most of the probability mass while managing storage requirements.
Cross-Architecture Adaptation:
- Using mergekit-tokensurgeon, we created a version of Qwen2.5-14B that uses the vocabulary of Llama 3.1 405B.
- This allowed for the use of Llama 3.1 405B logits in training the Qwen-based model.
Distillation to Qwen Architecture:
- The adapted Qwen2.5-14B model was trained using the stored 405B logits as the target.
Parallel Qwen Distillation:
- In a separate process, Qwen2-72B was distilled into a 14B model.
Final Fusion and Fine-Tuning:
- The Llama-distilled Qwen model's vocabulary was reverted to Qwen vocabulary.
- After re-aligning the vocabularies, a final fusion and fine-tuning step was conducted, using a specialized dataset from EvolKit to ensure that SuperNova-Medius maintained coherence, fluency, and context understanding across a broad range of tasks.

Performance Evaluation

Below are the benchmark results of SuperNova-Medius compared to similar models in its class:\

Model	Average	IFEval	BBH	GPQA	MMLU Pro	MuSR	Math Lvl 5
Mistral-Small 2409	0.423	0.628	0.581	0.333	0.410	0.406	0.181
Supernova-Lite	0.427	0.786	0.511	0.306	0.388	0.415	0.155
Qwen2.5-14B-Instruct	0.450	0.827	0.623	0.358	0.490	0.403	0.000
Supernova-Medius	0.480	0.832	0.631	0.359	0.502	0.402	0.152

SuperNova-Medius performs exceptionally well in instruction-following (IFEval) and complex reasoning tasks (BBH), demonstrating its capability to handle a variety of real-world scenarios. It outperforms Qwen2.5-14B and SuperNova-Lite in multiple benchmarks, making it a powerful yet efficient choice for high-quality generative AI applications.

Model Use Cases

Arcee-SuperNova-Medius is suitable for a range of applications, including:

Customer Support: With its robust instruction-following and dialogue management capabilities, SuperNova-Medius can handle complex customer interactions, reducing the need for human intervention.
Content Creation: The model’s advanced language understanding and generation abilities make it ideal for creating high-quality, coherent content across diverse domains.
Technical Assistance: SuperNova-Medius has a deep reservoir of technical knowledge, making it an excellent assistant for programming, technical documentation, and other expert-level content creation.

Deployment Options

SuperNova-Medius is available for use under the Apache-2.0 license. For those who need even higher performance, the full-size 70B SuperNova model can be accessed via an Arcee-hosted API or for local deployment. To learn more or explore deployment options, please reach out to sales@arcee.ai.

Technical Specifications

Model Architecture: Qwen2.5-14B-Instruct
Distillation Sources: Qwen2.5-72B-Instruct, Llama-3.1-405B-Instruct
Parameter Count: 14 billion
Training Dataset: Custom instruction dataset generated with EvolKit
Distillation Technique: Multi-architecture offline logit distillation with cross-architecture vocabulary alignment.

Summary

Arcee-SuperNova-Medius provides a unique balance of power, efficiency, and versatility. By distilling knowledge from two top-performing teacher models into a single 14B parameter model, SuperNova-Medius achieves results that rival larger models while maintaining a compact size ideal for practical deployment. Whether for customer support, content creation, or technical assistance, SuperNova-Medius is the perfect choice for organizations looking to leverage advanced language model capabilities in a cost-effective and accessible form.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	37.22
IFEval (0-Shot)	55.60
BBH (3-Shot)	49.30
MATH Lvl 5 (4-Shot)	32.48
GPQA (0-shot)	17.90
MuSR (0-shot)	19.19
MMLU-PRO (5-shot)	48.83

Download

^{Arcee-SuperNova-Lite}

Overview

Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture. It is a distilled version of the larger Llama-3.1-405B-Instruct model, leveraging offline logits extracted from the 405B parameter variant. This 8B variation of Llama-3.1-SuperNova maintains high performance while offering exceptional instruction-following capabilities and domain-specific adaptability.

The model was trained using a state-of-the-art distillation pipeline and an instruction dataset generated with EvolKit, ensuring accuracy and efficiency across a wide range of tasks. For more information on its training, visit blog.arcee.ai.

Llama-3.1-SuperNova-Lite excels in both benchmark performance and real-world applications, providing the power of large-scale models in a more compact, efficient form ideal for organizations seeking high performance with reduced resource requirements.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	29.73
IFEval (0-Shot)	80.17
BBH (3-Shot)	31.57
MATH Lvl 5 (4-Shot)	15.48
GPQA (0-shot)	7.49
MuSR (0-shot)	11.67
MMLU-PRO (5-shot)	31.97

Download

^{Arcee Spark}

Overview

Arcee Spark is a powerful 7B parameter language model that punches well above its weight class. Initialized from Qwen2, this model underwent a sophisticated training process:

Fine-tuned on 1.8 million samples
Merged with Qwen2-7B-Instruct using Arcee's mergekit
Further refined using Direct Preference Optimization (DPO)

This meticulous process results in exceptional performance, with Arcee Spark achieving the highest score on MT-Bench for models of its size, outperforming even GPT-3.5 on many tasks.

Key Features

7B parameters
State-of-the-art performance for its size
Initialized from Qwen2
Advanced training process including fine-tuning, merging, and DPO
Highest MT-Bench score in the 7B class
Outperforms GPT-3.5 on many tasks
Has a context length of 128k tokens, making it ideal for tasks requiring many conversation turns or working with large amounts of text.

Business Use Cases

Arcee Spark offers a compelling solution for businesses looking to leverage advanced AI capabilities without the hefty computational requirements of larger models. Its unique combination of small size and high performance makes it ideal for:

Real-time applications: Deploy Arcee Spark for chatbots, customer service automation, and interactive systems where low latency is crucial.
Edge computing: Run sophisticated AI tasks on edge devices or in resource-constrained environments.
Cost-effective scaling: Implement advanced language AI across your organization without breaking the bank on infrastructure or API costs.
Rapid prototyping: Quickly develop and iterate on AI-powered features and products.
On-premise deployment: Easily host Arcee Spark on local infrastructure for enhanced data privacy and security.

Performance and Efficiency

Arcee Spark demonstrates that bigger isn't always better in the world of language models. By leveraging advanced training techniques and architectural optimizations, it delivers:

Speed: Blazing fast inference times, often 10-100x faster than larger models.
Efficiency: Significantly lower computational requirements, reducing both costs and environmental impact.
Flexibility: Easy to fine-tune or adapt for specific domains or tasks.

Despite its compact size, Arcee Spark offers deep reasoning capabilities, making it suitable for a wide range of complex tasks including:

Advanced text generation
Detailed question answering
Nuanced sentiment analysis
Complex problem-solving
Code generation and analysis

Model Availability

Quants: Arcee Spark GGUF
FP32: For those looking to squeeze every bit of performance out of the model, we offer an FP32 version that scores slightly higher on all benchmarks.

Benchmarks and Evaluations

MT-Bench

########## First turn ##########
                     score
model       turn          
arcee-spark 1     8.777778
########## Second turn ##########
                     score
model       turn          
arcee-spark 2     8.164634
########## Average ##########
                score
model                
arcee-spark  8.469325

EQ-Bench

EQ-Bench: 71.4

TruthfulQA

Task	Version	Metric	Value	Stderr
truthfulqa_mc	1	mc1	0.4382	± 0.0174
		mc2	0.6150	± 0.0155

AGI-Eval

Task	Version	Metric	Value	Stderr
agieval_aqua_rat	0	acc	0.3937	± 0.0307
		acc_norm	0.3937	± 0.0307
agieval_logiqa_en	0	acc	0.4731	± 0.0196
		acc_norm	0.4854	± 0.0196
agieval_lsat_ar	0	acc	0.2783	± 0.0296
		acc_norm	0.3000	± 0.0303
agieval_lsat_lr	0	acc	0.5549	± 0.0220
		acc_norm	0.5451	± 0.0221
agieval_lsat_rc	0	acc	0.6022	± 0.0299
		acc_norm	0.6208	± 0.0296
agieval_sat_en	0	acc	0.8155	± 0.0271
		acc_norm	0.8107	± 0.0274
agieval_sat_en_without_passage	0	acc	0.4806	± 0.0349
		acc_norm	0.4612	± 0.0348
agieval_sat_math	0	acc	0.4909	± 0.0338
		acc_norm	0.4545	± 0.0336

AGI-eval average: 51.11

GPT4All Evaluation

Task	Version	Metric	Value	Stderr
arc_challenge	0	acc	0.5333	± 0.0146
		acc_norm	0.5640	± 0.0145
arc_easy	0	acc	0.8131	± 0.0080
		acc_norm	0.7668	± 0.0087
boolq	1	acc	0.8471	± 0.0063
hellaswag	0	acc	0.6206	± 0.0048
		acc_norm	0.8118	± 0.0039
openbookqa	0	acc	0.3560	± 0.0214
		acc_norm	0.4600	± 0.0223
piqa	0	acc	0.7987	± 0.0094
		acc_norm	0.8030	± 0.0093
winogrande	0	acc	0.7690	± 0.0130

Gpt4al Average: 69.37

Big Bench Hard

Task	Version	Metric	Value	Stderr
bigbench_causal_judgement	0	multiple_choice_grade	0.6053	± 0.0356
bigbench_date_understanding	0	multiple_choice_grade	0.6450	± 0.0249
bigbench_disambiguation_qa	0	multiple_choice_grade	0.5233	± 0.0312
bigbench_geometric_shapes	0	multiple_choice_grade	0.2006	± 0.0212
		exact_str_match	0.0000	± 0.0000
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	0.2840	± 0.0202
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	0.2429	± 0.0162
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	0.4367	± 0.0287
bigbench_movie_recommendation	0	multiple_choice_grade	0.4720	± 0.0223
bigbench_navigate	0	multiple_choice_grade	0.4980	± 0.0158
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	0.5600	± 0.0111
bigbench_ruin_names	0	multiple_choice_grade	0.4375	± 0.0235
bigbench_salient_translation_error_detection	0	multiple_choice_grade	0.2685	± 0.0140
bigbench_snarks	0	multiple_choice_grade	0.7348	± 0.0329
bigbench_sports_understanding	0	multiple_choice_grade	0.6978	± 0.0146
bigbench_temporal_sequences	0	multiple_choice_grade	0.4060	± 0.0155
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	0.2072	± 0.0115
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	0.1406	± 0.0083
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	0.4367	± 0.0287

Big Bench average: 45.78

License

Arcee Spark is released under the Apache 2.0 license.

Acknowledgments

The Qwen2 team for their foundational work
The open-source AI community for their invaluable tools and datasets
Our dedicated team of researchers and engineers who push the boundaries of what's possible with compact language models

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	25.54
IFEval (0-Shot)	56.21
BBH (3-Shot)	37.14
MATH Lvl 5 (4-Shot)	12.31
GPQA (0-shot)	7.61
MuSR (0-shot)	8.60
MMLU-PRO (5-shot)	31.36

Download

^{Arcee-Miraj-Mini}

Overview

Following the release of Arcee Meraj, our enterprise's globally top-performing Arabic LLM, we are thrilled to unveil Arcee Meraj Mini. This open-source model, meticulously fine-tuned from Qwen2.5-7B-Instruct, is expertly designed for both Arabic and English. This model has undergone rigorous evaluation across multiple benchmarks in both languages, demonstrating top-tier performance in Arabic and competitive results in English. Arcee Meraj Mini’s primary objective is to enhance Arabic capabilities while maintaining robust English language proficiency. Benchmark results confirm that Arcee Meraj Mini excels in Arabic, with English performance comparable to leading models — perfectly aligning with our vision for balanced bilingual strength.

Technical Details

Below is an overview of the key stages in Meraj Mini’s development:

Data Preparation: We filter candidate samples from diverse English and Arabic sources to ensure high-quality data. Some of the selected English datasets are translated into Arabic to increase the quantity of Arabic samples and improve the model’s quality in bilingual performance. Then, new Direct Preference Optimization (DPO) datasets are continuously prepared, filtered, and translated to maintain a fresh and diverse dataset that supports better generalization across domains.
Initial Training: We train the Qwen2.5 model with 7 billion parameters using these high-quality datasets in both languages. This allows the model to handle diverse linguistic patterns from over 500 million tokens, ensuring strong performance in Arabic and English tasks.
Iterative Training and Post-Training: Iterative training and post-training iterations refine the model, enhancing its accuracy and adaptability to ensure it can perform well across varied tasks and language contexts.
Evaluation: Arcee Meraj Mini is based on training and evaluating 15 different variants to explore optimal configurations, with assessments done on both Arabic and English benchmarks and leaderboards. This step ensures the model is robust in handling both general and domain-specific tasks.
Final Model Creation: We select the best-performing variant and use the MergeKit library to merge the configurations, resulting in the final Arcee Meraj Mini model. This model is not only optimized for language understanding but also serves as a starting point for domain adaptation in different areas.

With this process, Arcee Meraj Mini is crafted to be more than just a general-purpose language model—it’s an adaptable tool, ready to be fine-tuned for specific industries and applications, empowering users to extend its capabilities for domain-specific tasks.

Capabilities and Use Cases

Arcee Meraj Mini is capable of solving a wide range of language tasks, including the tasks as below:

Arabic Language Understanding: Arcee Meraj Mini excels in general language comprehension, reading comprehension, and common-sense reasoning, all tailored to the Arabic language, providing strong performance in a variety of linguistic tasks.
Cultural Adaptation: The model ensures content creation that goes beyond linguistic accuracy, incorporating cultural nuances to align with Arabic norms and values, making it suitable for culturally relevant applications.
Education: It enables personalized, adaptive learning experiences for Arabic speakers by generating high-quality educational content across diverse subjects, enhancing the overall learning journey.
Mathematics and Coding: With robust support for mathematical reasoning and problem-solving, as well as code generation in Arabic, Arcee Meraj Mini serves as a valuable tool for developers and professionals in technical fields.
Customer Service: The model facilitates the development of advanced Arabic-speaking chatbots and virtual assistants, capable of managing customer queries with a high degree of natural language understanding and precision.
Content Creation: Arcee Meraj Mini generates high-quality Arabic content for various needs, from marketing materials and technical documentation to creative writing, ensuring impactful communication and engagement in the Arabic-speaking world.

Quantized GGUF

Here are GGUF models:

Meraj-Mini-GGUF

How to

This model uses ChatML prompt template:

<|im_start|>system
{System}
<|im_end|>
<|im_start|>user
{User}
<|im_end|>
<|im_start|>assistant
{Assistant}

# Use a pipeline as a high-level helper

from transformers import pipeline

messages = [
    {"role": "user", "content": "مرحبا، كيف حالك؟"},
]
pipe = pipeline("text-generation", model="arcee-ai/Meraj-Mini")
pipe(messages)


# Load model directly

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Meraj-Mini")
model = AutoModelForCausalLM.from_pretrained("arcee-ai/Meraj-Mini")

Evaluations

Open Arabic LLM Leaderboard (OALL) Benchmarks

Arcee Meraj Mini model consistently outperforms state-of-the-art models on most of the Open Arabic LLM Leaderboard (OALL) benchmarks, highlighting its improvements and effectiveness in Arabic language content, and securing the top performing position on average among the other models.

Translated MMLU

We focused on the multilingual MMLU dataset, as distributed through the LM Evaluation Harness repository, to compare the multilingual strength of different models for this benchmark. Arcee Meraj Mini outperforms the other models, showcasing these models’ superior performance compared to the other state-of-the-art models.

English Benchmarks:

Arcee Meraj Mini performs comparably to state-of-the-art models, demonstrating how the model retains its English language knowledge and capabilities while learning Arabic.

Model Usage

For a detailed explanation of the model's capabilities, architecture, and applications, please refer to our blog post: https://blog.arcee.ai/arcee-meraj-mini-2/

To test the model directly, you can try it out using this Google Colab notebook: https://colab.research.google.com/drive/1hXXyNM-X0eKwlZ5OwqhZfO0U8CBq8pFO?usp=sharing

Acknowledgements

We are grateful to the open-source AI community for their continuous contributions and to the Qwen team for their foundational efforts on the Qwen2.5 model series.

Future Directions

As we release the Arcee Meraj Mini to the public, we invite researchers, developers, and businesses to engage with the Arcee Meraj Mini model, particularly in enhancing support for the Arabic language and fostering domain adaptation. We are committed to advancing open-source AI technology and invite the community to explore, contribute, and build upon Arcee Meraj Mini.

Download

^Arcee-Agent

Overview

Arcee Agent is a cutting-edge 7B parameter language model from arcee.ai specifically designed for function calling and tool use. Initialized from Qwen2-7B, it rivals the performance of much larger models while maintaining efficiency and speed. This model is particularly suited for developers, researchers, and businesses looking to implement sophisticated AI-driven solutions without the computational overhead of larger language models. Compute for training Arcee-Agent was provided by CrusoeAI. Arcee-Agent was trained using Spectrum.

Full model is available here: Arcee-Agent.

Key Features

Advanced Function Calling: Arcee Agent excels at interpreting, executing, and chaining function calls. This capability allows it to interact seamlessly with a wide range of external tools, APIs, and services.
Multiple Format Support: The model is compatible with various tool use formats, including:
- Glaive FC v2
- Salesforce
- Agent-FLAN

Arcee-Agent performs best when using the VLLM OpenAI FC format, but it also excels with prompt-based solutions. Agent-Spark can accommodate any specific use case or infrastructure needs you may have.

Dual-Mode Functionality:
- Tool Router: Arcee Agent can serve as intelligent middleware, analyzing requests and efficiently routing them to appropriate tools or larger language models for processing.
- Standalone Chat Agent: Despite its focus on function calling, Arcee Agent is capable of engaging in human-like conversations and completing a wide range of tasks independently.
Unparalleled Speed and Efficiency: With its 7B parameter architecture, Arcee Agent delivers rapid response times and efficient processing, making it suitable for real-time applications and resource-constrained environments.
Competitive Performance: In function calling and tool use tasks, Arcee Agent competes with the capabilities of models many times its size, offering a cost-effective solution for businesses and developers.

Detailed Function Calling and Tool Use Capabilities

Arcee Agent's function calling and tool use capabilities open up a world of possibilities for AI-driven applications. Here's a deeper look at what you can achieve:

API Integration: Seamlessly interact with external APIs, allowing your applications to:
- Fetch real-time data (e.g., stock prices, weather information)
- Post updates to social media platforms
- Send emails or SMS messages
- Interact with IoT devices
Database Operations: Execute complex database queries and operations through natural language commands, enabling:
- Data retrieval and analysis
- Record updates and insertions
- Schema modifications
Code Generation and Execution: Generate and run code snippets in various programming languages, facilitating:
- Quick prototyping
- Automated code review
- Dynamic script generation for data processing
Multi-step Task Execution: Chain multiple functions together to complete complex tasks, such as:
- Booking travel arrangements (flights, hotels, car rentals)
- Generating comprehensive reports from multiple data sources
- Automating multi-stage business processes

Business Use Cases

Arcee Agent's unique capabilities make it an invaluable asset for businesses across various industries. Here are some specific use cases:

Customer Support Automation:
- Implement AI-driven chatbots that handle complex customer inquiries and support tickets.
- Automate routine support tasks such as password resets, order tracking, and FAQ responses.
- Integrate with CRM systems to provide personalized customer interactions based on user history.
Sales and Marketing Automation:
- Automate lead qualification and follow-up using personalized outreach based on user behavior.
- Generate dynamic marketing content tailored to specific audiences and platforms.
- Analyze customer feedback from various sources to inform marketing strategies.
Operational Efficiency:
- Automate administrative tasks such as scheduling, data entry, and report generation.
- Implement intelligent assistants for real-time data retrieval and analysis from internal databases.
- Streamline project management with automated task assignment and progress tracking.
Financial Services Automation:
- Automate financial reporting and compliance checks.
- Implement AI-driven financial advisors for personalized investment recommendations.
- Integrate with financial APIs to provide real-time market analysis and alerts.
Healthcare Solutions:
- Automate patient record management and data retrieval for healthcare providers.
E-commerce Enhancements:
- Create intelligent product recommendation systems based on user preferences and behavior.
- Automate inventory management and supply chain logistics.
- Implement AI-driven pricing strategies and promotional campaigns.
Human Resources Automation:
- Automate candidate screening and ranking based on resume analysis and job requirements.
- Implement virtual onboarding assistants to guide new employees through the onboarding process.
- Analyze employee feedback and sentiment to inform HR policies and practices.
Legal Services Automation:
- Automate contract analysis and extraction of key legal terms and conditions.
- Implement AI-driven tools for legal research and case law summarization.
- Develop virtual legal assistants to provide preliminary legal advice and document drafting.
Educational Tools:
- Create personalized learning plans and content recommendations for students.
- Automate grading and feedback for assignments and assessments.
Manufacturing and Supply Chain Automation:
- Optimize production schedules and inventory levels using real-time data analysis.
- Implement predictive maintenance for machinery and equipment.
- Automate quality control processes through data-driven insights.

Benchmarking

Intended Uses

Arcee Agent is designed for a wide range of applications where efficient function calling and tool use are crucial. Some potential use cases include:

Developing sophisticated chatbots and virtual assistants with advanced tool integration
Creating efficient middleware for routing and preprocessing requests to larger language models
Implementing AI-driven process automation in resource-constrained environments
Prototyping and testing complex tool-use scenarios without the need for more computationally expensive models
Building interactive documentation systems that can execute code examples in real-time
Developing intelligent agents for IoT device management and home automation
Creating AI-powered research assistants for various scientific disciplines

Limitations

While Arcee Agent excels in its specialized areas, users should be aware of its limitations:

The model's general knowledge and capabilities outside of function calling and tool use may be more limited compared to larger, general-purpose language models.
Performance in tasks unrelated to its core functionalities may not match that of models with more diverse training.
As with all language models, outputs should be validated and used responsibly, especially in critical applications.
The model's knowledge cutoff date may limit its awareness of recent events or technological advancements.

Download

^Arcee-Lite

Overview

Arcee-Lite is a compact yet powerful 1.5B parameter language model developed as part of the DistillKit open-source project. Despite its small size, Arcee-Lite demonstrates impressive performance, particularly in the MMLU (Massive Multitask Language Understanding) benchmark.

GGUFS available here

Key Features

Model Size: 1.5 billion parameters
MMLU Score: 55.93
Distillation Source: Phi-3-Medium
Enhanced Performance: Merged with high-performing distillations

About DistillKit

DistillKit is our new open-source project focused on creating efficient, smaller models that maintain high performance. Arcee-Lite is one of the first models to emerge from this initiative.

Performance

Arcee-Lite showcases remarkable capabilities for its size:

Achieves a 55.93 score on the MMLU benchmark
Demonstrates exceptional performance across various tasks

Use Cases

Arcee-Lite is suitable for a wide range of applications where a balance between model size and performance is crucial:

Embedded systems
Mobile applications
Edge computing
Resource-constrained environments

Please note that our internal evaluations were consistantly higher than their counterparts on the OpenLLM Leaderboard - and should only be compared against the relative performance between the models, not weighed against the leaderboard.

Download

^Arcee-Nova

Overview

Arcee-Nova is our highest performing open source model. Evaluated on the same stack as the OpenLLM Leaderboard 2.0, making it the top-performing open source model tested on that stack. Its performance approaches that of GPT-4 from May 2023, marking a significant milestone.

Nova is a merge of Qwen2-72B-Instruct with a custom model tuned on a generalist dataset mixture.

GGUFs available here

Chat with Arcee-Nova here

Capabilities and Use Cases

Arcee-Nova excels across a wide range of language tasks, demonstrating particular strength in:

Reasoning: Solving complex problems and drawing logical conclusions.
Creative Writing: Generating engaging and original content across various genres.
Coding: Assisting with programming tasks, from code generation to debugging.
General Language Understanding: Comprehending and generating human-like text in diverse contexts.

Business Applications

Arcee-Nova can be applied to various business tasks:

Customer Service: Implement sophisticated chatbots and virtual assistants.
Content Creation: Generate high-quality written content for marketing and documentation.
Software Development: Accelerate coding processes and improve code quality.
Data Analysis: Enhance data interpretation and generate insightful reports.
Research and Development: Assist in literature reviews and hypothesis generation.
Legal and Compliance: Automate contract analysis and regulatory compliance checks.
Education and Training: Create adaptive learning systems and intelligent tutoring programs.

Evaluations

Acknowledgments

We extend our gratitude to the open source AI community, whose collective efforts have paved the way for Arcee-Nova. Their commitment to transparency and collaboration continues to drive innovation. We also would like to extend our thanks to the Qwen team - without Qwen2-72B this would not be possible.

Future Directions

As we release Arcee-Nova to the public, we look forward to seeing how researchers, developers, and businesses will leverage its capabilities. We remain committed to advancing open source AI technology and invite the community to explore, contribute, and build upon Arcee-Nova.

Note: This README was written with assistance from Arcee-Nova.

Download

^Arcee-Scribe

Overview

Arcee-Scribe is a versatile chat model that excels across a wide range of benchmarks. While it didn't quite meet our exacting standards for Arcee-Spark V2, we found it too delightful to not release. It's our strongest model to date in terms of reasoning on the Big-Bench Hard (BBH) dataset, and we discovered it to be particularly adept at creative writing tasks. The model is a merge of InternLM-2.5-chat with a custom InternLM finetune, incorporating both general and writing-specific data.

Capabilities and Use Cases

Arcee-Scribe functions as a writing assistant and can handle a variety of tasks. Here are some examples of its capabilities:‍

1. Challenging Dialogue: Crafting nuanced conversations with subtext and complex emotions.

Example:

Character A (an optimist who is usually the first one to bring up solutions, even in difficult situations):

"Maybe it's time we should figure this out, y'know? Like, how we both really feel about what we want from our future. It's been... quite a ride, hasn't it?"

Character B (a reserved person who prefers to let things unfold, but is deeply affected by what's said):

"Yeah, it's... it's been interesting. You've got such a way of making everything sound like it's gonna be all right. But I... I'm not sure it's that simple."

Character A (trying to see the lighter side, attempting to lighten the mood without making things awkward):

"You mean, I'm like a magician's rabbit? Poof, and everything's gonna be just fine? But hey, I like the sound of that. We could use a bit of magic, right?"

Character B (acknowledging the tension, but avoiding the conversation directly):

"I think what you mean is that I'm trying to see the silver lining in every cloud. And you're right, it's... it's a bit like that."

Character A (feeling the need to reiterate, hoping to gently steer the conversation toward a positive direction):

"Well, I mean, we both know it's been challenging. But what I'm trying to say is, maybe we could find a way to make things better. Because, y'know, that's what we've always done. We've always found a way to navigate through the rough waters, haven't we?"

Character B (feeling the weight of their unspoken emotions, hinting at a resignation to their fate):

"That's... that's true. But sometimes, no matter how hard we try, we can't make the wind change. Maybe it's time to look at what's right in front of us, instead of always looking towards what's beyond the horizon."

Character A (realizing the gravity of the unspoken words, attempting to bridge the gap without acknowledging the breakup directly):

"You know, life's been a series of chapters, and I think we've had our fair share of adventures. Maybe it's time for us to start writing a new story—one where we find a way to walk side by side, even if it's on different paths."

Character B (acknowledging the poignant conversation, but still avoiding the explicit mention of a breakup):

"Yeah, chapters and stories. You always have such a way with words, making everything sound so... so beautiful. But sometimes, the most beautiful stories are the ones we write by ourselves."

2. Integrating Technical Concepts: Explaining complex ideas using creative analogies.

Example:

In the quaint heart of a bustling city, there was a bakery renowned not only for its delectable pastries but for its mysterious, enigmatic nature. The bakery, run by the eccentric Mr. Quantum, was a place where the ordinary merged with the extraordinary in a dance of flours and creams.

In this bakery, there were two magical eggs, one white and one blue. These eggs were intertwined in a bond so strong and peculiar that when one was cracked open, the contents inside mirrored the other's contents in an instant. Just as a baker might watch with wonder as dough miraculously doubled in size, the bakers at the Quantum Bakery would stand in awe as the contents of these eggs shared the same fate, no matter the distance separating them.

These eggs represented quantum entanglement, a phenomenon where two or more particles become interconnected in such a way that the state of one instantly influences the state of the other, no matter how far apart they may be. Just as the contents of the magical eggs were indistinguishably alike, the particles in a quantum entanglement remain intertwined, their states dependent on each other's, transcending the limitations of space and time.

The Quantum Bakery was a place where bakers and patrons alike could witness this extraordinary connection firsthand. It was a testament to the boundless possibilities of the quantum world, where the unseen forces at work could create wonders that defied the logic of the everyday.

Through the metaphor of the magical eggs, we see that quantum entanglement is not merely a scientific concept, but a dance of particles that defies the conventions of the physical world, much like the artistry of baking at the Quantum Bakery. It is a reminder that the universe, with its infinite mysteries, continues to astound and inspire, challenging us to look beyond the surface and discover the marvels that lie within.

3. Complex Narrative Structures: Creating stories with non-linear narratives or unique perspectives.

Example:

Arcee-Scribe: "The Unraveling"

The old man closed his eyes, a peaceful smile on his face. The tattered photograph slipped from his fingers.

Moments earlier, he had gazed at the image - a young couple, laughing by a lakeside. His wrinkled hand traced her face.

That morning, he'd found the photo while clearing out his attic, buried in a dusty box he hadn't opened in decades.

Fifty years before, he'd shoved that box into the attic's darkest corner, determined to forget.

Fifty years and one day earlier, he'd received the telegram. "We regret to inform you..."

A week before that, she'd waved goodbye, boarding a train in her crisp nurse's uniform. "I'll be back before you know it," she'd said.

And at the very beginning, on a sunny day by the lake, a young man asked a question. She said yes, and they laughed, unaware of the camera capturing their joy.

Business Applications

Arcee-Scribe can be applied to various business tasks:

Content Creation: Develop blog posts, articles, and social media content.
Product Descriptions: Generate clear and engaging product descriptions.
Customer Communication: Draft responses to customer inquiries or complaints.
Training Materials: Create scenarios for employee training programs.
Brainstorming: Generate ideas for new products or marketing campaigns.

Evaluations

Acknowledgments

We owe a great deal of credit to the wonderful pre/post-training work done by the Internlm team. Their incredible work has enabled us to deliver models of this caliber. If you wish to use Arcee-Scribe for commercial purposes, InternLM allows free commercial usage - simply fill out this form: https://wj.qq.com/s2/12727483/5dba/

Future Directions

We look forward to seeing how users will utilize Arcee-Scribe in their creative and professional endeavors. The model aims to assist businesses in their content creation processes and provide a tool for exploring new ideas in writing and communication.

Note: This README was written in large part by Arcee-Scribe.

Download

^Arcee-SEC

Overview

Llama-3-SEC-Base: A Domain-Specific Chat Agent for SEC Data Analysis

Llama-3-SEC-Base is a state-of-the-art domain-specific large language model trained on a vast corpus of SEC (Securities and Exchange Commission) data. Built upon the powerful Meta-Llama-3-70B-Instruct model, Llama-3-SEC-Base has been developed to provide unparalleled insights and analysis capabilities for financial professionals, investors, researchers, and anyone working with SEC filings and related financial data. This checkpoint does not include supervised fine-tuning (SFT) and is strictly the our CPT model merged with Llama-3-70B-Instruct. For a variant that has been fine-tuned for chat-related purposes, please see Llama-3-SEC-Chat.

Model Details

Base Model: Meta-Llama-3-70B-Instruct
Training Data: This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. The final model is being trained with 72B tokens of SEC filings data, carefully mixed with 1B tokens of general data from Together AI's RedPajama dataset: RedPajama-Data-1T to maintain a balance between domain-specific knowledge and general language understanding
Training Method: Continual Pre-Training (CPT) using the Megatron-Core framework, followed by model merging with the base model using the state-of-the-art TIES merging technique in the Arcee Mergekit toolkit
Training Infrastructure: AWS SageMaker HyperPod cluster with 4 nodes, each equipped with 32 H100 GPUs, ensuring efficient and scalable training of this massive language model

Use Cases

Llama-3-SEC-Base is designed to assist with a wide range of tasks related to SEC data analysis, including but not limited to:

In-depth investment analysis and decision support
Comprehensive risk management and assessment
Ensuring regulatory compliance and identifying potential violations
Studying corporate governance practices and promoting transparency
Conducting market research and tracking industry trends

The model's deep understanding of SEC filings and related financial data makes it an invaluable tool for anyone working in the financial sector, providing powerful natural language processing capabilities tailored to the specific needs of this domain.

Evaluation

To ensure the robustness and effectiveness of Llama-3-SEC-Base, the model has undergone rigorous evaluation on both domain-specific and general benchmarks. Key evaluation metrics include:

Domain-specific perplexity, measuring the model's performance on SEC-related data
Extractive numerical reasoning tasks, using subsets of TAT-QA and ConvFinQA datasets
General evaluation metrics, such as BIG-bench, AGIEval, GPT4all, and TruthfulQA, to assess the model's performance on a wide range of tasks

These results demonstrate significant improvements in domain-specific performance while maintaining strong general capabilities, thanks to the use of advanced CPT and model merging techniques.

Training and Inference

Llama-3-SEC-Base has uses the llama3 chat template, which allows for efficient and effective fine-tuning of the model on the SEC data. This template ensures that the model maintains its strong conversational abilities while incorporating the domain-specific knowledge acquired during the CPT process.

To run inference with the Llama-3-SEC-Base model using the llama3 chat template, use the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"

model_name = "arcee-ai/Llama-3-SEC-Base"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What are the key regulatory considerations for a company planning to conduct an initial public offering (IPO) in the United States?"
messages = [
    {"role": "system", "content": "You are an expert financial assistant - specializing in governance and regulatory domains."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Limitations and Future Work

This release represents the initial checkpoint of the Llama-3-SEC-Base model, trained on 20B tokens of SEC data. Additional checkpoints will be released in the future as training on the full 70B token dataset is completed. Future work will focus on further improvements to the CPT data processing layer, exploration of advanced model merging techniques, and alignment of CPT models with SFT, DPO, and other cutting-edge alignment methods to further enhance the model's performance and reliability.

Usage

The model is available for both commercial and non-commercial use under the Llama-3 license. We encourage users to explore the model's capabilities and provide feedback to help us continuously improve its performance and usability. For more information - please see our detailed blog on Llama-3-SEC-Base.

Citation

If you use this model in your research or applications, please cite:

@misc{Introducing_SEC_Data_Chat_Agent, 
      title={Introducing the Ultimate SEC Data Chat Agent: Revolutionizing Financial Insights}, 
      author={Shamane Siriwardhana and Luke Mayers and Thomas Gauthier and Jacob Solawetz and Tyler Odenthal and Anneketh Vij and Lucas Atkins and Charles Goddard and Mary MacCarthy and Mark McQuade},
      year={2024},
      note={Available at: \url{firstname@arcee.ai}},
      url={URL after published}
}

For further information or inquiries, please contact the authors at their respective email addresses (firstname@arcee.ai). We look forward to seeing the exciting applications and research that will emerge from the use of Llama-3-SEC-Base in the financial domain.

Download

^{Virtuoso-Lite}

Overview

Virtuoso-Lite (10B) is our next-generation, 10-billion-parameter language model based on the Llama-3 architecture. It is distilled from Deepseek-v3 using ~1.1B tokens/logits, allowing it to achieve robust performance at a significantly reduced parameter count compared to larger models. Despite its compact size, Virtuoso-Lite excels in a variety of tasks, demonstrating advanced reasoning, code generation, and mathematical problem-solving capabilities.

GGUF

Quantizations available here

Model Details

Architecture Base: Falcon-10B (based on Llama-3)
Parameter Count: 10B
Tokenizer:
- Initially integrated with Deepseek-v3 tokenizer for logit extraction.
- Final alignment uses the Llama-3 tokenizer, with specialized “tokenizer surgery” for cross-architecture compatibility.
Distillation Data:
- ~1.1B tokens/logits from Deepseek-v3’s training data.
- Logit-level distillation using a proprietary “fusion merging” approach for maximum fidelity.
License: falcon-llm-license

Background on Deepseek Distillation

Deepseek-v3 serves as the teacher model, from which we capture logits across billions of tokens. Rather than standard supervised fine-tuning, Virtuoso-Lite applies a full logit-level replication to preserve the most crucial insights from the teacher. This approach enables:

Strong performance on technical/scientific queries
Enhanced code generation and debugging
Improved consistency in math-intensive tasks

Intended Use Cases

Chatbots & Virtual Assistants
Lightweight Enterprise Data Analysis
Research Prototypes & Proofs of Concept
STEM Educational Tools (where smaller footprint is advantageous)

Evaluations

How to Use

Below is a sample code snippet using transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "arcee-ai/virtuoso-lite"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Provide a concise summary of quantum entanglement."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training & Fine-Tuning

Initial Training: Began with Falcon-10B, optimized for large-scale text ingestion.
Distillation & Merging:
- Trained on ~1.1B tokens/logits from Deepseek-v3.
- Employed “fusion merging” to capture detailed teacher insights.
- Final step included DPO to enhance alignment and mitigate hallucinations.
Future Developments: We plan to incorporate additional R1 distillations to further improve specialized performance and reduce model footprint.

Performance

Virtuoso-Lite demonstrates strong results across multiple benchmarks (e.g., BBH, MMLU-PRO, MATH), often standing its ground against models with higher parameter counts. This efficiency is largely credited to logit-level distillation, which compresses the teacher model’s capabilities into a more parameter-friendly package.

Limitations

Context Length: 32k Tokens (may vary depending on the final tokenizer settings and system resources).
Knowledge Cut-off: Training data may not reflect the latest events or developments beyond June 2024.

Ethical Considerations

Content Generation Risks: Like any language model, Virtuoso-Lite can generate potentially harmful or biased content if prompted in certain ways.

License

Virtuoso-Lite (10B) is released under the falcon-llm-license License. You are free to use, modify, and distribute this model in both commercial and non-commercial applications, subject to the terms and conditions of the license.

If you have questions or would like to share your experiences using Virtuoso-Lite (10B), please connect with us on social media. We’re excited to see what you build—and how this model helps you innovate!

Download

^{Virtuoso-Medium-v2}

Overview

Virtuoso-Medium-v2 (32B) is our next-generation, 32-billion-parameter language model that builds upon the original Virtuoso-Medium architecture. This version is distilled from Deepseek-v3, leveraging an expanded dataset of 5B+ tokens worth of logits. It achieves higher benchmark scores than our previous release (including surpassing Arcee-Nova 2024 in certain tasks).

GGUF

Quantizations available here

Model Details

Architecture Base: Qwen-2.5-32B
Parameter Count: 32B
Tokenizer:
- Initially integrated with Deepseek-v3 tokenizer for logit extraction.
- Final alignment uses the Qwen tokenizer, using specialized “tokenizer surgery” for cross-architecture compatibility.
Distillation Data:
- ~1.1B tokens/logits from Deepseek-v3’s training data.
- Logit-level distillation using a proprietary “fusion merging” approach afterwards for maximum fidelity.
License: Apache-2.0

Background on Deepseek Distillation

Deepseek-v3 serves as the teacher model, from which we capture logits across billions of tokens. Rather than standard supervised fine-tuning, we apply a full logit-level replication. This ensures more precise transference of knowledge, including advanced reasoning in:

Technical and scientific queries
Complex code generation
Mathematical problem-solving

Intended Use Cases

Advanced Chatbots & Virtual Assistants
Enterprise Data Analysis & Workflow Automation
Research Simulations & Natural Language Understanding
Educational Tools for STEM Fields

Evaluations

How to Use

Below is a sample code snippet using transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "arcee-ai/Virtuoso-Medium-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Provide a concise summary of quantum entanglement."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training & Fine-Tuning

Initial Training: Began with Qwen-32B, calibrated for large-scale text ingestion.
Distillation & Merging:
- Trained on ~1.1B tokens worth of Deepseek-v3 logits.
- Employed “fusion merging” to retain as much teacher expertise as possible.
- Final step included DPO to improve alignment and reduce model hallucinations.
Continuous Development: Additional R1 distillations are in progress to further enhance performance and specialization.

Performance

Thanks to a larger parameter count and a richer training corpus, Virtuoso-Medium-v2 delivers high scores across multiple benchmarks (BBH, MMLU-PRO, MATH, etc.). It frequently surpasses other 30B+ models and even some 70B+ architectures in specific tasks.

Limitations

Context Length: 128k Tokens
Knowledge Cut-off: Training data may not reflect the latest events or developments, leading to gaps in current knowledge beyond June 2024.

Ethical Considerations

Content Generation Risks: Like any language model, Virtuoso-Medium-v2 can potentially generate harmful or biased content if prompted in certain ways.

License

Virtuoso-Medium-v2 (32B) is released under the Apache-2.0 License. You are free to use, modify, and distribute this model in both commercial and non-commercial applications, subject to the terms and conditions of the license.

If you have questions or would like to share your experiences using these models, please connect with us on social media. We’re excited to see what you build—and how these models help you innovate!

Download

Open-Source Libraries

Built by our researchers for you.

Toolkits

^MergeKit

Overview

Expanded Model Support: Merge Anything, Merge Faster

Meet MergeKit v0.1, which dramatically expands the range of models you can merge. No longer are you limited to specific architectures explicitly supported by MergeKit. This release introduces two game-changing improvements:

Arbitrary transformers models: MergeKit now seamlessly handles any model architecture supported by the popular transformers library. This means you can merge cutting-edge models as soon as they're released, including vision-language models like LLaVa or QwenVL, alongside the diverse collection of decoder-only models already supported. No more waiting for MergeKit to "catch up"–you're empowered to merge immediately.
Raw PyTorch Models: Beyond transformers models, MergeKit now supports merging raw PyTorch models. This opens up a world of possibilities, allowing you to merge models serialized with either torch.save (pickle) or safetensors. The new mergekit-pytorch entrypoint lets you merge diffusion models (like Stable Diffusion or FLUX), audio models (like Whisper), computer vision models, and virtually any other PyTorch model you can imagine. As with transformers models, the models being merged must be of the same architecture and size.

Arcee Fusion: The Art of Selective Merging

‍This release also introduces the public availability of Arcee Fusion, a sophisticated merging method previously used internally to develop our Supernova, Medius, and Virtuoso series models. Arcee Fusion takes a more intelligent approach to merging, focusing on the importance of differences between models rather than simply merging everything indiscriminately. Arcee Fusion works in three key stages:

Importance Scoring: Instead of blindly merging all parameters, Arcee Fusion calculates an importance score for each parameter, combining the absolute difference between model parameters with a divergence measure based on softmax distributions and KL divergence. This ensures that only meaningful changes are considered.
Dynamic Thresholding: The algorithm analyzes the distribution of importance scores, calculating key quantiles (median, Q1, and Q3) and setting a dynamic threshold using median + 1.5 × IQR (a standard technique for outlier detection). This intelligently filters out less significant changes.
Selective Integration: A fusion mask is created based on the importance scores and the threshold. Only the most significant elements are incorporated into the base model, ensuring that the merge process is adaptive and selective. This preserves the base model's stability while integrating the most valuable updates from the other model.

Arcee Fusion avoids the pitfalls of over-updating that can occur with simple averaging, providing a more refined and controlled merging experience. You can activate this powerful new method by specifying merge_method: arcee_fusion in your merge configuration file.

‍Multi-GPU Execution

‍MergeKit v0.1 introduces a new --parallel flag for multi-GPU execution. If you have access to a multi-GPU environment, this flag will unlock a near-linear speedup for your merge operations. The --parallel flag is compatible with all merge methods and model types, significantly reducing merge times and boosting your productivity.

Licensing: Balancing Open Access with Continued Development

‍Finally, we want to address a change to our licensing model. While we are committed to open access, we also need to ensure the long-term sustainability of MergeKit's development. Therefore, we are transitioning to a Business Source License (BSL).

Why are we doing this?

The techniques and methods we've developed are valuable and unique. We believe in the power of open source and want the community to benefit from our work. However, unrestricted commercial use by large entities could jeopardize our ability to continue developing and improving MergeKit. The BSL is a balanced approach that allows us to share our innovations while protecting our long-term viability.

What does this mean for you?

‍For the vast majority of users (personal, research, and non-commercial), nothing changes. You retain unrestricted access to MergeKit. Even most commercial users will likely be unaffected. The BSL primarily applies to large corporations and highly successful startups using MergeKit in a production setting. If this applies to you, we'll simply need to discuss a commercial license (which includes direct access to Charles Goddard and the MergeKit development team).

We believe this approach strikes the right balance between fostering open innovation and ensuring the continued growth and development of MergeKit. We want to make it clear: we want you to use MergeKit! This licensing change is about ensuring we can continue to provide you with the best possible tool for model merging.

Download

^DistillKit

Overview

DistillKit is an open-source research effort in model distillation by Arcee.AI. Our goal is to provide the community with easy-to-use tools for researching, exploring, and enhancing the adoption of open-source Large Language Model (LLM) distillation methods. This release focuses on practical, effective techniques for improving model performance and efficiency.

Features

Logit-based Distillation (models must be the same architecture)
Hidden States-based Distillation (models can be different architectures)
Support for Supervised Fine-Tuning (SFT) - DPO and CPT to come at a later date.

Installation

Quick Install

For a quick and easy installation, you can use our setup script:

./setup.sh

Manual Installation

If you prefer to install dependencies manually, follow these steps:

1. Install basic requirements:

pip install torch wheel ninja packaging

2. Install Flash Attention:

pip install flash-attn

3. Install DeepSpeed:

pip install deepspeed

4. Install remaining requirements:

pip install -r requirements.txt

Configuration

For simplicity, we've set the config settings directly within the training script. You can customize the configuration as follows:

config = {
    "project_name": "distil-logits",
    "dataset": {
        "name": "mlabonne/FineTome-100k", # Only sharegpt format is currently supported.
        "split": "train",
        # "num_samples": , # You can pass a number here to limit the number of samples to use.
        "seed": 42
    },
    "models": {
        "teacher": "arcee-ai/Arcee-Spark",
        "student": "Qwen/Qwen2-1.5B"
    },
    "tokenizer": {
        "max_length": 4096,
        "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
    },
    "training": {
        "output_dir": "./results",
        "num_train_epochs": 3,
        "per_device_train_batch_size": 1,
        "gradient_accumulation_steps": 8,
        "save_steps": 1000,
        "logging_steps": 1,
        "learning_rate": 2e-5,
        "weight_decay": 0.05,
        "warmup_ratio": 0.1,
        "lr_scheduler_type": "cosine",
        "resume_from_checkpoint": None,  # Set to a path or True to resume from the latest checkpoint
        "fp16": False,
        "bf16": True
    },
    "distillation": {
        "temperature": 2.0,
        "alpha": 0.5
    },
    "model_config": {
        "use_flash_attention": True
    }
    # "spectrum": {
    #     "layers_to_unfreeze": "/workspace/spectrum/snr_results_Qwen-Qwen2-1.5B_unfrozenparameters_50percent.yaml" # You can pass a spectrum yaml file here to freeze layers identified by spectrum.
    # }
}

Chat Template

If you want to use a chat template other than chatml, copy it from the model's tokenizer_config.json, and replace the current chat_template entry in the configuration.

Spectrum Integration

You can use Spectrum to increase speed (but not memory overhead). To enable Spectrum, uncomment the "spectrum" section in the configuration and provide the path to your Spectrum YAML file. Please note that further evaluations with Spectrum are TBD.

Usage

To launch DistillKit, use the following command:

accelerate launch distil_logits.py

You can replace distil_logits.py with whichever script you want to use.

Advanced Configurations

If you wish to use DeepSpeed, Fully Sharded Data Parallel (FSDP), or Megatron sharding, you can set up your configuration using:

accelerate config

Follow the prompts to configure your desired setup.

DeepSpeed Configurations

We provide sample DeepSpeed configuration files in the ./deepspeed_configs directory. These configurations are shamelessly stolen from the Axolotl (thanks to Wing Lian and the Axolotl team for their excellent work!).

To use a specific DeepSpeed configuration, you can specify it in your accelerate config.

Distillation Methods

DistillKit supports two primary distillation methods:

Logit-based Distillation: This method transfers knowledge from a larger teacher model to a smaller student model by using both hard targets (actual labels) and soft targets (teacher logits). The soft target loss, computed using Kullback-Leibler (KL) divergence, encourages the student to mimic the teacher's output distribution. This method enhances the student model's generalization and efficiency while maintaining performance closer to the teacher model.
Hidden States-based Distillation: This method involves transferring knowledge by aligning the intermediate layer representations of the student model with those of the teacher model. This process enhances the student's learning by providing richer, layer-wise guidance, improving its performance and generalization. This method allows for cross-architecture distillation, providing flexibility in model architecture choices.

Performance and Memory Requirements

While the implementation of DistillKit is relatively straightforward, the memory requirements for distillation are higher compared to standard SFT. We are actively working on scaling DistillKit to support models larger than 70B parameters, which will involve advanced techniques and efficiency improvements.

Experimental Results

Our experiments have shown promising results in both general-purpose and domain-specific tasks. Key findings include:

Both logit-based and hidden states-based distillation methods show improvements over standard SFT across most benchmarks.
Significant performance gains were observed when distilling models for domain-specific tasks.
Using the same training dataset for distillation as was used for the teacher model can lead to higher performance gains.

For detailed results and analysis, please refer to our case studies and experimental here.

Arcee-Labs

This release marks the debut of Arcee-Labs, a division of Arcee.ai dedicated to accelerating open-source research. Our mission is to rapidly deploy resources, models, and research findings to empower both Arcee and the wider community. In an era of increasingly frequent breakthroughs in LLM research, models, and techniques, we recognize the need for agility and adaptability. Through our efforts, we strive to significantly contribute to the advancement of open-source AI technology and support the community in keeping pace with these rapid developments.

Future Directions

We are excited to see how the community will use and improve DistillKit. Future releases will include Continued Pre-Training (CPT) and Direct Preference Optimization (DPO) distillation methods. We welcome community contributions in the form of new distillation methods, training routine improvements, and memory optimizations.

Contributing

We welcome contributions from the community! If you have ideas for improvements, new features, or bug fixes, please feel free to open an issue or submit a pull request.

Contact

For more information about Arcee.AI and our training platform, visit our website at https://arcee.ai.

For technical questions or support, please open an issue in this repository.

Acknowledgments

While our work is ultimately quite different - this project was inspired by Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs. We thank the authors for their efforts and contributions. We would like to thank the open-source community and all at arcee.ai who have helped make DistillKit possible. We're just getting started.

Download

^EvolKit

Overview

EvolKit is an framework for automatically enhancing the complexity of instructions used in fine-tuning Large Language Models (LLMs). Our project aims to revolutionize the evolution process by leveraging open-source LLMs, moving away from closed-source alternatives.

Key Features

Automatic instruction complexity enhancement
Integration with open-source LLMs
Streamlined fine-tuning process
Support for various datasets from Hugging Face
Flexible configuration options for optimization

Installation

To set up EvolKit, follow these steps:

1. Clone the repository:

git clone https://github.com/arcee-ai/EvolKit.git
cd EvolKit

2. Install the required dependencies:

pip install -r requirements.txt

Usage

To run the AutoEvol script, use the following command structure:

python run_evol.py --dataset <dataset_name> [options]

Required Parameters:

--dataset <dataset_name>: The name of the dataset on Hugging Face to use.
--model <model_name>: Model to use for evolving instructions.
--generator <generator_type>: Type of generator to use ('openrouter' or 'vllm').
--batch_size <int>: Number of instructions to process in each batch.
--num_methods <int>: Number of evolution methods to use.
--max_concurrent_batches <int>: Maximum number of batches to process concurrently (in our experiment, a cluster of 8xH100 hosting Qwen2-72B-Instruct-GPTQ-Int8 can handle batch size of 50 concurrently).
--evolve_epoch <int>: Maximum number of epochs for evolving each instruction.
--output_file <filename>: Name of the output file to save results.

Optional Parameters:

--dev_set_size <int>: Number of samples to use in the development set. Use -1 for no devset. Default is -1. (We do not recommend using a dev set since it will take much more time to finish each round)
--use_reward_model: Flag to use a reward model for evaluation. No value required.

Models

We found 2 models that work very well with this pipeline:

Qwen2-72B-Instruct and DeepSeek-V2.5 (GPTQ and AWQ versions are fine too).
Other models might work but it has to be very good at generating structured content (in order to parse using parsing operations)

VLLM Support

To use VLLM as the backend, set the VLLM_BACKEND environment variable:

export VLLM_BACKEND=http://your-vllm-backend-url:port/v1

If not set, it will default to 'http://localhost:8000/v1'.

Example Usage:

To run AutoEvol on the 'small_tomb' dataset with custom parameters:

python run_evol.py --dataset qnguyen3/small_tomb --model Qwen/Qwen2-72B-Instruct-GPTQ-Int8 --generator vllm --batch_size 100 --num_methods 3 --max_concurrent_batches 10 --evolve_epoch 3 --output_file the_tomb_evolved-3e-batch100.json --dev_set_size 5 --use_reward_model

This command will:

Load the 'qnguyen3/small_tomb' dataset from Hugging Face.
Use the Qwen2-72B-Instruct model with VLLM as the generator.
Process samples in batches of 100.
Apply 3 evolution methods for each instruction.
Process 10 batches concurrently.
Evolve each instruction for up to 3 epochs.
Use 5 samples for the development set.
Use the reward model for evaluation.
Output the final evolved instructions to the_tomb_evolved-3e-batch100.json.

After evolving the instructions, you can generate answers using:

python gen_answers.py --model Qwen/Qwen2-72B-Instruct-GPTQ-Int8 --generator vllm --data_path the_tomb_evolved-3e-batch100.json --batch_size 50 --output completed_evol_data.json

The final dataset will be saved to completed_evol_data.json in ShareGPT format.

Components

EvolKit consists of several key components:

Generator: Uses an LLM for generating initial instructions (OpenRouter or VLLM).
Evolver: Employs a recurrent evolution strategy.
Analyzer: Utilizes trajectory analysis.
Evaluator: Offers two options:
- Reward Model Evaluator
- Failure Detector Evaluator (originally from WizardLM's paper)
Optimizer: Optimizes the evolution method for the next round.

Output

The script saves the results in JSON format to the specified output file. Each entry in the JSON file represents an evolved instruction along with relevant metadata.

Find a 20k subset of a dataset generated using EvolKit here

Acknowledgement

Microsoft's WizardLM team for the inspiration from the AutoEvol paper.

Download

^Spectrum

Overview

This repository contains the implementation of Spectrum, as detailed in the paper Spectrum: Targeted Training on Signal to Noise Ratio.

Spectrum is a tool for scanning and evaluating the Signal-to-Noise Ratio (SNR) of layers in large language models. By identifying the top n% of layers with the highest SNR, you can optimize training efficiency.

Features

Model Scanning: Scan models to determine the SNR of each layer.
Top n% Layer Identification: Identify and sort the top n% of layers based on their SNR.
Unfrozen Parameters Configuration: Generate configuration files for unfreezing specific layers in Axolotl or other libraries.

Installation

To use Spectrum, you need to have Python installed. Clone this repository and install the necessary dependencies:

git clone https://github.com/cognitivecomputations/spectrum.git
cd spectrum
pip install -r requirements.txt

Usage

To use Spectrum, run the following command:

python spectrum.py --model-name <insert local or HF repo here> --top-percent <top % of snr ratios to target>

--model-name: Specify the local model path or the Hugging Face repository.
--top-percent: Specify the top percentage of SNR layers you want to retrieve.

Spectrum will check the model_snr_results folder to see if the model has already been scanned. If not, it will prompt you for the batch size to use for scanning. Once the scan is complete, it will output the SNR ratios in the model_snr_results folder and provide a sorted list of the highest to lowest SNR ratios along with an unfrozen parameters YAML file.

Example Command

python spectrum.py --model-name meta-llama/Meta-Llama-3-8B-Instruct --top-percent 50

Scanning New Models

It will check the model_snr_results folder to see if we've already scanned it (we invite you to add your own scans for models we don't have via PR) - if we have, it will give you the top n% of those ratios. Otherwise, it will ask you what batch size you want to scan at. We've been able to use a batch_size of 4 for 70b models on an 8xH100 node. It will then load the model, and you will be presented with all available modules to scan. We typically only select the MLP/attn layers - but if you're doing continued pretraining or language tasks it wouldn't hurt to include all available modules.

It will then scan the model, and output the snr ratios in the model_snr_results folder. It will also output a sorted from highest to lowest SNR ratios, along with an unfrozen parameters yaml. This matches an axolotl config, and you can copy and paste it directly into your axolotl yaml. That's it!

Integration with Axolotl

If you're using Axolotl, the generated YAML file can be directly integrated.

Integration with Other Libraries

For integration with other libraries, we provide a simple script to freeze and unfreeze parameters:

def _freeze_and_unfreeze_parameters(self):
    # Freeze all parameters
    for param in self.model.parameters():
        param.requires_grad = False
    #unfreeze spectrum parameters
    for name, param in self.model.named_parameters():
        if any(unfrozen_param in name for unfrozen_param in self.unfrozen_parameters):
            param.requires_grad = True

unfrozen_parameters = [
    'model.layers.62.mlp.down_proj', 'model.layers.63.mlp.down_proj', 'model.layers.66.mlp.down_proj',
    'model.layers.65.mlp.down_proj', 'model.layers.64.mlp.down_proj', 'model.layers.67.mlp.down_proj',
    'model.layers.68.mlp.down_proj', 'model.layers.60.mlp.down_proj', 'model.layers.31.mlp.down_proj',
    'model.layers.69.mlp.down_proj', 'model.layers.61.mlp.down_proj', 'model.layers.59.mlp.down_proj',
    'model.layers.70.mlp.down_proj', 'model.layers.30.mlp.down_proj', 'model.layers.76.mlp.down_proj',
    'model.layers.72.mlp.down_proj', 'model.layers.77.mlp.down_proj', 'model.layers.71.mlp.down_proj',
    'model.layers.29.mlp.down_proj', 'model.layers.58.mlp.down_proj', 'model.layers.78.mlp.gate_proj',
    'model.layers.77.mlp.gate_proj', 'model.layers.76.mlp.gate_proj', 'model.layers.79.mlp.gate_proj',
    'model.layers.75.mlp.gate_proj', 'model.layers.74.mlp.gate_proj', 'model.layers.73.mlp.gate_proj',
    'model.layers.70.mlp.gate_proj', 'model.layers.72.mlp.gate_proj', 'model.layers.71.mlp.gate_proj',
    'model.layers.69.mlp.gate_proj', 'model.layers.54.mlp.gate_proj', 'model.layers.68.mlp.gate_proj',
    'model.layers.57.mlp.gate_proj', 'model.layers.63.mlp.gate_proj', 'model.layers.49.mlp.gate_proj',
    'model.layers.55.mlp.gate_proj', 'model.layers.53.mlp.gate_proj', 'model.layers.44.mlp.gate_proj',
    'model.layers.46.mlp.gate_proj', 'model.layers.69.mlp.up_proj', 'model.layers.70.mlp.up_proj',
    'model.layers.71.mlp.up_proj', 'model.layers.68.mlp.up_proj', 'model.layers.67.mlp.up_proj',
    'model.layers.66.mlp.up_proj', 'model.layers.46.mlp.up_proj', 'model.layers.63.mlp.up_proj',
    'model.layers.72.mlp.up_proj', 'model.layers.64.mlp.up_proj', 'model.layers.62.mlp.up_proj',
    'model.layers.45.mlp.up_proj', 'model.layers.65.mlp.up_proj', 'model.layers.73.mlp.up_proj',
    'model.layers.47.mlp.up_proj', 'model.layers.44.mlp.up_proj', 'model.layers.49.mlp.up_proj',
    'model.layers.48.mlp.up_proj', 'model.layers.53.mlp.up_proj', 'model.layers.74.mlp.up_proj',
    'model.layers.79.self_attn.k_proj', 'model.layers.36.self_attn.k_proj', 'model.layers.35.self_attn.k_proj',
    'model.layers.74.self_attn.k_proj', 'model.layers.34.self_attn.k_proj', 'model.layers.78.self_attn.k_proj',
    'model.layers.77.self_attn.k_proj', 'model.layers.37.self_attn.k_proj', 'model.layers.39.self_attn.k_proj',
    'model.layers.41.self_attn.k_proj', 'model.layers.38.self_attn.k_proj', 'model.layers.33.self_attn.k_proj',
    'model.layers.69.self_attn.k_proj', 'model.layers.42.self_attn.k_proj', 'model.layers.32.self_attn.k_proj',
    'model.layers.25.self_attn.k_proj', 'model.layers.70.self_attn.k_proj', 'model.layers.22.self_attn.k_proj',
    'model.layers.63.self_attn.k_proj', 'model.layers.29.self_attn.k_proj', 'model.layers.14.self_attn.o_proj',
    'model.layers.39.self_attn.o_proj', 'model.layers.19.self_attn.o_proj', 'model.layers.16.self_attn.o_proj',
    'model.layers.17.self_attn.o_proj', 'model.layers.15.self_attn.o_proj', 'model.layers.69.self_attn.o_proj',
    'model.layers.12.self_attn.o_proj', 'model.layers.42.self_attn.o_proj', 'model.layers.23.self_attn.o_proj',
    'model.layers.22.self_attn.o_proj', 'model.layers.29.self_attn.o_proj', 'model.layers.13.self_attn.o_proj',
    'model.layers.46.self_attn.o_proj', 'model.layers.52.self_attn.o_proj', 'model.layers.26.self_attn.o_proj',
    'model.layers.38.self_attn.o_proj', 'model.layers.41.self_attn.o_proj', 'model.layers.18.self_attn.o_proj',
    'model.layers.49.self_attn.o_proj', 'model.layers.1.self_attn.q_proj', 'model.layers.2.self_attn.q_proj',
    'model.layers.3.self_attn.q_proj', 'model.layers.5.self_attn.q_proj', 'model.layers.4.self_attn.q_proj',
    'model.layers.0.self_attn.q_proj', 'model.layers.6.self_attn.q_proj', 'model.layers.8.self_attn.q_proj',
    'model.layers.7.self_attn.q_proj', 'model.layers.9.self_attn.q_proj', 'model.layers.10.self_attn.q_proj',
    'model.layers.12.self_attn.q_proj', 'model.layers.19.self_attn.q_proj', 'model.layers.18.self_attn.q_proj',
    'model.layers.25.self_attn.q_proj', 'model.layers.11.self_attn.q_proj', 'model.layers.15.self_attn.q_proj',
    'model.layers.61.self_attn.q_proj', 'model.layers.17.self_attn.q_proj', 'model.layers.55.self_attn.q_proj',
    'model.layers.15.self_attn.v_proj', 'model.layers.16.self_attn.v_proj', 'model.layers.23.self_attn.v_proj',
    'model.layers.24.self_attn.v_proj', 'model.layers.25.self_attn.v_proj', 'model.layers.26.self_attn.v_proj',
    'model.layers.27.self_attn.v_proj', 'model.layers.28.self_attn.v_proj', 'model.layers.29.self_attn.v_proj',
    'model.layers.30.self_attn.v_proj', 'model.layers.31.self_attn.v_proj', 'model.layers.32.self_attn.v_proj',
    'model.layers.33.self_attn.v_proj', 'model.layers.34.self_attn.v_proj', 'model.layers.35.self_attn.v_proj',
    'model.layers.36.self_attn.v_proj', 'model.layers.37.self_attn.v_proj', 'model.layers.38.self_attn.v_proj',
    'model.layers.39.self_attn.v_proj', 'model.layers.41.self_attn.v_proj'
]

Replace unfrozen_parameters with the layers specific to your model.

Example Configuration for Axolotl

base_model: Qwen/Qwen2-72B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

trust_remote_code: true

# load_in_8bit: true
# load_in_4bit: false
# strict: false

datasets:
  - path: /workspace/datasets/dolphin-2.9.2/dolphin201-sharegpt2.jsonl
    type: sharegpt
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/dolphin-coder-codegen-sharegpt2.jsonl
    type: sharegpt
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/dolphin-coder-translate-sharegpt2.jsonl
    type: sharegpt
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/m-a-p_Code-Feedback-sharegpt-unfiltered.jsonl
    type: sharegpt
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/m-a-p_CodeFeedback-Filtered-Instruction-sharegpt-unfiltered.jsonl
    type: sharegpt
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/not_samantha_norefusals.jsonl
    type: sharegpt
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/openhermes200k_unfiltered.jsonl
    type: sharegpt
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/Orca-Math-resort-unfiltered.jsonl
    type: sharegpt  
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/SystemChat_sharegpt.jsonl
    type: sharegpt  
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/toolbench_instruct_j1s1_3k_unfiltered.jsonl
    type: sharegpt
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/toolbench_negative_unfiltered.jsonl
    type: sharegpt
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/toolbench_react_10p_unfiltered.jsonl
    type: sharegpt
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/toolbench_tflan_cot_30p_unfiltered.jsonl
    type: sharegpt 
    conversation: chatml
  - path: /workspace/datasets/dolphin-2.9.2/agent_instruct_react_unfiltered.jsonl
    type: sharegpt
    conversation: chatml

unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# mlp.down_proj layers
- model.layers.62.mlp.down_proj
- model.layers.63.mlp.down_proj
- model.layers.66.mlp.down_proj
- model.layers.65.mlp.down_proj
- model.layers.64.mlp.down_proj
- model.layers.67.mlp.down_proj
- model.layers.68.mlp.down_proj
- model.layers.60.mlp.down_proj
- model.layers.31.mlp.down_proj
- model.layers.69.mlp.down_proj
- model.layers.61.mlp.down_proj
- model.layers.59.mlp.down_proj
- model.layers.70.mlp.down_proj
- model.layers.30.mlp.down_proj
- model.layers.76.mlp.down_proj
- model.layers.72.mlp.down_proj
- model.layers.77.mlp.down_proj
- model.layers.71.mlp.down_proj
- model.layers.29.mlp.down_proj
- model.layers.58.mlp.down_proj
- model.layers.75.mlp.down_proj
- model.layers.32.mlp.down_proj
- model.layers.56.mlp.down_proj
- model.layers.28.mlp.down_proj
- model.layers.26.mlp.down_proj
- model.layers.33.mlp.down_proj
- model.layers.34.mlp.down_proj
- model.layers.57.mlp.down_proj
- model.layers.27.mlp.down_proj
- model.layers.25.mlp.down_proj
- model.layers.35.mlp.down_proj
- model.layers.73.mlp.down_proj
- model.layers.24.mlp.down_proj
- model.layers.78.mlp.down_proj
- model.layers.74.mlp.down_proj
- model.layers.54.mlp.down_proj
# mlp.gate_proj layers
- model.layers.78.mlp.gate_proj
- model.layers.77.mlp.gate_proj
- model.layers.76.mlp.gate_proj
- model.layers.79.mlp.gate_proj
- model.layers.75.mlp.gate_proj
- model.layers.74.mlp.gate_proj
- model.layers.73.mlp.gate_proj
- model.layers.70.mlp.gate_proj
- model.layers.72.mlp.gate_proj
- model.layers.71.mlp.gate_proj
- model.layers.69.mlp.gate_proj
- model.layers.54.mlp.gate_proj
- model.layers.68.mlp.gate_proj
- model.layers.57.mlp.gate_proj
- model.layers.63.mlp.gate_proj
- model.layers.49.mlp.gate_proj
- model.layers.55.mlp.gate_proj
- model.layers.53.mlp.gate_proj
- model.layers.44.mlp.gate_proj
- model.layers.46.mlp.gate_proj
- model.layers.67.mlp.gate_proj
- model.layers.58.mlp.gate_proj
- model.layers.56.mlp.gate_proj
- model.layers.45.mlp.gate_proj
- model.layers.50.mlp.gate_proj
- model.layers.62.mlp.gate_proj
- model.layers.64.mlp.gate_proj
- model.layers.48.mlp.gate_proj
- model.layers.66.mlp.gate_proj
- model.layers.52.mlp.gate_proj
- model.layers.40.mlp.gate_proj
- model.layers.47.mlp.gate_proj
- model.layers.43.mlp.gate_proj
- model.layers.65.mlp.gate_proj
- model.layers.61.mlp.gate_proj
- model.layers.59.mlp.gate_proj
# mlp.up_proj layers
- model.layers.69.mlp.up_proj
- model.layers.70.mlp.up_proj
- model.layers.71.mlp.up_proj
- model.layers.68.mlp.up_proj
- model.layers.67.mlp.up_proj
- model.layers.66.mlp.up_proj
- model.layers.46.mlp.up_proj
- model.layers.63.mlp.up_proj
- model.layers.72.mlp.up_proj
- model.layers.64.mlp.up_proj
- model.layers.62.mlp.up_proj
- model.layers.45.mlp.up_proj
- model.layers.65.mlp.up_proj
- model.layers.73.mlp.up_proj
- model.layers.47.mlp.up_proj
- model.layers.44.mlp.up_proj
- model.layers.49.mlp.up_proj
- model.layers.48.mlp.up_proj
- model.layers.53.mlp.up_proj
- model.layers.74.mlp.up_proj
- model.layers.75.mlp.up_proj
- model.layers.57.mlp.up_proj
- model.layers.76.mlp.up_proj
- model.layers.43.mlp.up_proj
- model.layers.42.mlp.up_proj
- model.layers.61.mlp.up_proj
- model.layers.40.mlp.up_proj
- model.layers.56.mlp.up_proj
- model.layers.60.mlp.up_proj
- model.layers.31.mlp.up_proj
- model.layers.54.mlp.up_proj
- model.layers.55.mlp.up_proj
- model.layers.32.mlp.up_proj
- model.layers.41.mlp.up_proj
- model.layers.33.mlp.up_proj
- model.layers.58.mlp.up_proj
# self_attn.k_proj layers
- model.layers.79.self_attn.k_proj
- model.layers.36.self_attn.k_proj
- model.layers.35.self_attn.k_proj
- model.layers.74.self_attn.k_proj
- model.layers.34.self_attn.k_proj
- model.layers.78.self_attn.k_proj
- model.layers.77.self_attn.k_proj
- model.layers.37.self_attn.k_proj
- model.layers.39.self_attn.k_proj
- model.layers.41.self_attn.k_proj
- model.layers.38.self_attn.k_proj
- model.layers.33.self_attn.k_proj
- model.layers.69.self_attn.k_proj
- model.layers.42.self_attn.k_proj
- model.layers.32.self_attn.k_proj
- model.layers.25.self_attn.k_proj
- model.layers.70.self_attn.k_proj
- model.layers.22.self_attn.k_proj
- model.layers.63.self_attn.k_proj
- model.layers.29.self_attn.k_proj
- model.layers.68.self_attn.k_proj
- model.layers.24.self_attn.k_proj
- model.layers.30.self_attn.k_proj
- model.layers.66.self_attn.k_proj
- model.layers.31.self_attn.k_proj
- model.layers.23.self_attn.k_proj
- model.layers.65.self_attn.k_proj
- model.layers.57.self_attn.k_proj
- model.layers.28.self_attn.k_proj
- model.layers.64.self_attn.k_proj
- model.layers.44.self_attn.k_proj
- model.layers.27.self_attn.k_proj
- model.layers.75.self_attn.k_proj
- model.layers.40.self_attn.k_proj
- model.layers.26.self_attn.k_proj
- model.layers.61.self_attn.k_proj
# self_attn.o_proj layers
- model.layers.14.self_attn.o_proj
- model.layers.39.self_attn.o_proj
- model.layers.19.self_attn.o_proj
- model.layers.16.self_attn.o_proj
- model.layers.17.self_attn.o_proj
- model.layers.15.self_attn.o_proj
- model.layers.69.self_attn.o_proj
- model.layers.12.self_attn.o_proj
- model.layers.42.self_attn.o_proj
- model.layers.23.self_attn.o_proj
- model.layers.22.self_attn.o_proj
- model.layers.29.self_attn.o_proj
- model.layers.13.self_attn.o_proj
- model.layers.46.self_attn.o_proj
- model.layers.52.self_attn.o_proj
- model.layers.26.self_attn.o_proj
- model.layers.38.self_attn.o_proj
- model.layers.41.self_attn.o_proj
- model.layers.18.self_attn.o_proj
- model.layers.49.self_attn.o_proj
- model.layers.11.self_attn.o_proj
- model.layers.28.self_attn.o_proj
- model.layers.25.self_attn.o_proj
- model.layers.47.self_attn.o_proj
- model.layers.53.self_attn.o_proj
- model.layers.27.self_attn.o_proj
- model.layers.37.self_attn.o_proj
- model.layers.20.self_attn.o_proj
- model.layers.43.self_attn.o_proj
- model.layers.44.self_attn.o_proj
- model.layers.45.self_attn.o_proj
- model.layers.30.self_attn.o_proj
- model.layers.24.self_attn.o_proj
- model.layers.21.self_attn.o_proj
- model.layers.10.self_attn.o_proj
- model.layers.3.self_attn.o_proj
# self_attn.q_proj layers
- model.layers.1.self_attn.q_proj
- model.layers.2.self_attn.q_proj
- model.layers.3.self_attn.q_proj
- model.layers.5.self_attn.q_proj
- model.layers.4.self_attn.q_proj
- model.layers.0.self_attn.q_proj
- model.layers.6.self_attn.q_proj
- model.layers.8.self_attn.q_proj
- model.layers.7.self_attn.q_proj
- model.layers.9.self_attn.q_proj
- model.layers.10.self_attn.q_proj
- model.layers.12.self_attn.q_proj
- model.layers.19.self_attn.q_proj
- model.layers.18.self_attn.q_proj
- model.layers.25.self_attn.q_proj
- model.layers.11.self_attn.q_proj
- model.layers.15.self_attn.q_proj
- model.layers.61.self_attn.q_proj
- model.layers.17.self_attn.q_proj
- model.layers.55.self_attn.q_proj
- model.layers.54.self_attn.q_proj
- model.layers.16.self_attn.q_proj
- model.layers.68.self_attn.q_proj
- model.layers.49.self_attn.q_proj
- model.layers.48.self_attn.q_proj
- model.layers.52.self_attn.q_proj
- model.layers.13.self_attn.q_proj
- model.layers.42.self_attn.q_proj
- model.layers.57.self_attn.q_proj
- model.layers.60.self_attn.q_proj
- model.layers.53.self_attn.q_proj
- model.layers.64.self_attn.q_proj
- model.layers.66.self_attn.q_proj
- model.layers.62.self_attn.q_proj
- model.layers.59.self_attn.q_proj
- model.layers.50.self_attn.q_proj
# self_attn.v_proj layers
- model.layers.15.self_attn.v_proj
- model.layers.16.self_attn.v_proj
- model.layers.23.self_attn.v_proj
- model.layers.24.self_attn.v_proj
- model.layers.25.self_attn.v_proj
- model.layers.26.self_attn.v_proj
- model.layers.27.self_attn.v_proj
- model.layers.28.self_attn.v_proj
- model.layers.29.self_attn.v_proj
- model.layers.30.self_attn.v_proj
- model.layers.31.self_attn.v_proj
- model.layers.32.self_attn.v_proj
- model.layers.33.self_attn.v_proj
- model.layers.34.self_attn.v_proj
- model.layers.35.self_attn.v_proj
- model.layers.36.self_attn.v_proj
- model.layers.37.self_attn.v_proj
- model.layers.38.self_attn.v_proj
- model.layers.39.self_attn.v_proj
- model.layers.41.self_attn.v_proj
- model.layers.42.self_attn.v_proj
- model.layers.48.self_attn.v_proj
- model.layers.53.self_attn.v_proj
- model.layers.57.self_attn.v_proj
- model.layers.58.self_attn.v_proj
- model.layers.59.self_attn.v_proj
- model.layers.61.self_attn.v_proj
- model.layers.63.self_attn.v_proj
- model.layers.64.self_attn.v_proj
- model.layers.65.self_attn.v_proj
- model.layers.66.self_attn.v_proj
- model.layers.69.self_attn.v_proj
- model.layers.74.self_attn.v_proj
- model.layers.75.self_attn.v_proj
- model.layers.76.self_attn.v_proj
- model.layers.72.self_attn.v_proj

  
chat_template: chatml
dataset_prepared_path: qwen2-72b-data
val_set_size: 0.01
output_dir: qwen2-72b

sequence_len: 8192  # supports up to 8192
sample_packing: true
pad_to_sequence_len: true

# adapter: lora
# lora_model_dir:
# lora_r: 32
# lora_alpha: 16
# lora_dropout: 0.05
# lora_target_linear: true
# lora_fan_in_fan_out:

wandb_project: qwen2-72b
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 3
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 1e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 2
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 4
save_total_limit: 2
debug:
deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_params.json
weight_decay: 0.05
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|endoftext|>"
  eos_token: "<|im_end|>"

Contribution

We invite you to contribute to the Spectrum project by adding new model scans. Please fork the repository, add your scans to the model_snr_results folder, and submit a pull request.

License

This project is licensed under the Apache-2.0 License.

Acknowledgments

Thank you for using Spectrum! For any questions or issues, please open an issue on this repository.

Download

Open Models and Libraries

Open-Weight Models

AFM 4.5B

Arcee-SuperNova Medius

Arcee-SuperNova-Lite

Arcee-Spark

Arcee-Miraj-Mini

Arcee-Agent

Arcee-Lite

Arcee-Nova

Arcee-Scribe

Arcee-SEC

Virtuoso-Lite

Virtuoso-Medium-v2

Overview

Model Details

Benchmarks

How to use with transformers

How to use with vllm

How to use with Together API

Python (Official Together SDK)

cURL

Quantization support

License

Overview

Distillation Overview

Performance Evaluation

Model Use Cases

Deployment Options

Technical Specifications

Summary

Open LLM Leaderboard Evaluation Results

Overview

Open LLM Leaderboard Evaluation Results

Overview

Key Features

Business Use Cases

Performance and Efficiency

Model Availability

Benchmarks and Evaluations

MT-Bench

EQ-Bench

TruthfulQA

AGI-Eval

GPT4All Evaluation

Big Bench Hard

License

Acknowledgments

Open LLM Leaderboard Evaluation Results

Overview

Technical Details

Capabilities and Use Cases

Quantized GGUF

How to

Evaluations

Open Arabic LLM Leaderboard (OALL) Benchmarks

Translated MMLU

English Benchmarks:

Model Usage

Acknowledgements

Future Directions

Overview

Key Features

Detailed Function Calling and Tool Use Capabilities

Business Use Cases

Benchmarking

Intended Uses

Limitations

Overview

GGUFS available here

Key Features

About DistillKit

Performance

Use Cases

Overview

Capabilities and Use Cases

Business Applications

Evaluations

Acknowledgments

Future Directions

How to use with `transformers`

How to use with `vllm`