Try our new intelligent model routing solution, Arcee Conductor. Sign up today and get a $20 credit.
Case Studies
How Arcee AI helped Madeline build a reasoning model from first principles.
Madeline & Co. is an end-to-end AI-powered strategy, design, and innovation platform that helps anyone, from in-house teams and founders to marketers and creatives, navigate complex decisions with clarity and confidence. At the core is Madeline-s1, a powerful language model trained in design, strategy, systems thinking, UX, and storytelling, delivering real-time insights and intelligent recommendations as you build.
When initially building out their product suite, Madeline & Co. tried off-the-shelf large language models (LLMs); however, they constantly ran into issues of high inference costs, poor performance at scale, and inconsistent accuracy for their specific domains. The accuracy issues they faced primarily revolved around a lack of context-specific reasoning, cross-disciplinary synthesis, and brand-safe output. Madeline & Co. founder, Prince Rumi, described one specific example:
“When exploring a brand strategy for a sustainability startup, general models like Claude Sonnet 3.7 and GPT-4 would describe common channels or run-of-the-mill SWOTs. However, we required a model that would draw from a cross-section of startup decks, ethnographic research, brand campaigns, and founder memos to suggest an unexpected but contextually valid launch path—say, a limited-release collaboration with a fashion designer in the climate space. That leap requires nuance, not just knowledge.”
This dissatisfaction with LLMs led Madeline & Co. to partner with Arcee AI in building a custom reasoning model that could reason with depth, align with internal frameworks, and operate with the flexibility needed for creative and strategic exploration. By partnering with Arcee AI, Madeline & Co. gained access to Arcee’s research lab, which helped them through each stage of the model development cycle.
The initial phase involved building a substantial 60 million token dataset, curated from Madeline & Co.’s deep expertise in their field.
With this dataset, we applied Continuous Pre-Training (CPT), targeting specific domain knowledge gaps and behavioral patterns identified in the base model.
Following CPT completion, we utilized Arcee's MergeKit library to combine our newly trained model with complementary models in the ecosystem. We determined the merging ratios through systematic experimentation and testing various interpolation weights to achieve optimal performance across our evaluation benchmarks.
Following the merge, we conducted a comprehensive behavioral analysis of the resulting model. We deployed the model in controlled environments and collected extensive feedback on its responses across diverse query types. We analyzed patterns in reasoning quality, factual accuracy, instruction adherence, and handling of edge cases.
Based on the behavioral analysis, we then curated a high-quality dataset of question-answer pairs specifically designed to address the identified weaknesses. With 100 golden question-answer pairs provided by Madeline & Co., we used a proprietary synthetic data generation technique to create 350k pairs for the SFT run. We meticulously crafted this Supervised Fine-Tuning (SFT) dataset to include examples that demonstrated desired reasoning patterns, correct factual information, and appropriate response styles.
This iterative approach – CPT for knowledge acquisition, merging for capability integration, analysis for weakness identification, and targeted SFT for behavioral refinement - represents a sophisticated model development pipeline that maximizes the strengths of each training paradigm while addressing their limitations.
After 2 months of training, we presented Madeline-s1, a 32-billion-parameter reasoning model explicitly trained to reflect how Madeline expert strategists and designers think, mapping tradeoffs, surfacing multiple valid options, and grounding decisions in strategic insights and customer research. Rather than producing templated or generic answers, we engineered Madeline-s1 to generate meaningful interpretations and actionable insights across disciplines.
To evaluate Madeline-s1, we conducted a blind human preference test and also evaluated the model on industry-standard benchmarks. The results are as follows.
Human Preference Evaluation (Blind Test)
Standardized Evaluations
The model's performance exceeded expectations in internal evaluations, particularly in domains where general models struggled. Madeline & Co. deployed Madeline-s1 into production and integrated the model across their core product offerings.
This collaboration demonstrates what’s possible when AI is purpose-built utilizing a company’s proprietary data and insights along with Arcee AI’s post-training techniques.
To get in touch with Arcee AI to discuss potential collaboration, please reach out here.
To get early access to Madeline-s1 and Madeline & Co. platform please sign up here. Listen to Madeline introduce herself here!