Case Studies
How Arcee AI helped Madeline build a world-class reasoning model from first principles.
Madeline & Co. is an end-to-end AI-powered strategy, design, and innovation platform that helps anyone, from in-house teams and founders to marketers and creatives, navigate complex decisions with clarity and confidence. At the core is Madeline-s1, a powerful language model trained in design, strategy, systems thinking, UX, and storytelling, delivering real-time insights and intelligent recommendations as you build.
When initially building out their product suite, Madeline & Co. tried off-the-shelf large language models (LLMs); however, they constantly ran into issues of high inference costs, poor performance at scale, and inconsistent accuracy for their specific domains. The accuracy issues they faced primarily revolved around a lack of context-specific reasoning, cross-disciplinary synthesis, and brand-safe output. Madeline & Co. founder, Prince Rumi, described one specific example:
“When exploring a brand strategy for a sustainability startup, general models like Claude Sonnet 3.7 and GPT-4 would describe common channels or run-of-the-mill SWOTs. However, we required a model that would draw from a cross-section of startup decks, ethnographic research, brand campaigns, and founder memos to suggest an unexpected but contextually valid launch path—say, a limited-release collaboration with a fashion designer in the climate space. That leap requires nuance, not just knowledge.”
This dissatisfaction with LLMs led Madeline & Co. to partner with Arcee AI in building a custom reasoning model that could reason with depth, align with internal frameworks, and operate with the flexibility needed for creative and strategic exploration. By partnering with Arcee AI, Madeline & Co. gained access to Arcee’s research lab, which helped them through each stage of the model development cycle.
The initial phase involved building a substantial 60 million token dataset, curated from Madeline & Co.’s deep expertise in their field. Madeline's internal R&D team sourced and cleaned the dataset providing 60M high quality, proprietary tokens to train on.
With this dataset, we applied Continuous Pre-Training (CPT), targeting specific domain knowledge gaps and behavioral patterns identified in the base model. Madeline built a custom evaluation harness, which was used to evaluate the CPT trains.
Once the CPT model was evaluated successfully, we utilized Arcee's MergeKit library to combine the newly trained model with complementary models in the ecosystem. We determined the merging ratios through systematic experimentation and testing various interpolation weights to achieve optimal performance across our evaluation benchmarks.
Following the merge, we conducted a comprehensive behavioral analysis of the resulting model. We deployed the model in controlled environments and collected extensive feedback on its responses across diverse query types. We analyzed patterns in reasoning quality, factual accuracy, instruction adherence, and handling of edge cases.
Based on the behavioral analysis, Madeline and Arcee curated a high-quality dataset of question-answer pairs specifically designed to address the identified weaknesses. With 100 golden question-answer pairs and 6.6k "reasoning boosters" provided by Madeline & Co., Arcee and Madeline used a proprietary synthetic data generation technique to create 350k pairs for the SFT run. We meticulously crafted this Supervised Fine-Tuning (SFT) dataset to include examples that demonstrated desired reasoning patterns, correct factual information, and appropriate response styles.
This iterative approach – CPT for knowledge acquisition, merging for capability integration, analysis for weakness identification, and targeted SFT for behavioral refinement - represents a sophisticated model development pipeline that maximizes the strengths of each training paradigm while addressing their limitations.
After 2 months of training, the joint team presented Madeline-s1, a 32-billion-parameter reasoning model explicitly trained to reflect how Madeline expert strategists and designers think, mapping tradeoffs, surfacing multiple valid options, and grounding decisions in strategic insights and customer research. Rather than producing templated or generic answers, we engineered Madeline-s1 to generate meaningful interpretations and actionable insights across disciplines.
To evaluate Madeline-s1, we conducted blind human preference tests and LLM-as-a-Judge tests using a 350k Q/A evaluation harness provided by Madeline. We also evaluated the model on industry-standard benchmarks. The results are as follows.
The model's performance exceeded expectations in internal evaluations, particularly in domains where general models struggled. Madeline & Co. deployed Madeline-s1 into production and integrated the model across their core product offerings.
This collaboration demonstrates what’s possible when AI is purpose-built utilizing a company’s proprietary data and methodologies along with Arcee AI’s post-training techniques and research expertise.
To get in touch with Arcee AI to discuss potential collaboration, please reach out here.
To get early access to Madeline-s1 and Madeline & Co. platform please sign up here. Listen to Madeline introduce herself here!