Building an AI Retail Assistant at the Edge with SLMs on Intel CPUs

Julien Simon

•

June 7, 2025

Thanks to a chatbot interface powered by open-source small language models and real-time data analytics, store associates can interact naturally through voice or text.

We ran a live stream on June 10 from the floor of Cisco Live. You can watch it on YouTube.

The Power of AI at the Retail Edge

The retail landscape is rapidly evolving, with AI-driven solutions reshaping how stores operate and engage with customers. Today's consumers expect personalized, efficient service, while staff need immediate access to accurate information on customer traffic, inventory, sales, and other key metrics. Meeting these demands requires powerful edge computing solutions that can process and deliver insights in real-time.

Powered by Intel Xeon 6 CPUs running in a Cisco UCS server, the Edge IQ Retail Assistant exemplifies this potential. This technical demonstrator will be featured in the Intel Showcase (#3035) at Cisco Live 2025, taking place in San Diego, CA, from June 8 to 12, 2025. Attendees will get a firsthand experience of how generative AI can transform retail operations without relying on GPUs.

Thanks to a chatbot interface powered by open-source small language models and real-time data analytics, store associates can interact naturally through voice or text, receiving immediate information about product availability from Chooch's inventory system or crowd density from WaitTime's analytics platform. The assistant seamlessly translates these inquiries into actionable insights, helping staff make informed decisions that enhance customer experience while optimizing store operations, all powered by CPU processing.

‍The Edge Advantage: Why Small Language Models on CPUs Make Sense for Retail

The Edge IQ Retail Assistant represents a fundamental shift in AI deployment strategy for retail environments. While cloud-based AI services dominated early generative AI implementations, this solution demonstrates why running optimized language models directly on CPU servers at the edge delivers superior results for retail operations.

Low Latency: In the fast-paced retail environment, latency is a critical factor. Edge deployment eliminates network round-trips that typically add hundreds of milliseconds to each interaction, thereby enhancing overall performance. When a customer is waiting for information about product availability, this speed difference creates a noticeably more responsive experience. The assistant delivers consistent responses regardless of internet conditions—something cloud alternatives simply cannot match.
Resilience: Continuous operation remains essential for retail technology. Edge-deployed models continue functioning during internet outages, ensuring the assistant remains available during critical business hours. Store operations cannot pause when connectivity issues arise, making local processing a necessity rather than a luxury. A cost-effective CPU-based deployment provides this reliability without requiring specialized hardware.‍

Data Privacy: Data privacy concerns have grown increasingly important for retailers. By processing queries locally on CPU servers, sensitive business information, including inventory levels, sales data, and staffing details, never leaves the store's infrastructure. This dramatically reduces potential data exposure while maintaining full functionality—a compelling advantage over cloud alternatives that must transmit this information externally.

Bandwidth Efficiency: The bandwidth efficiency of local processing proves particularly valuable when working with high-resolution video streams from inventory and crowd monitoring systems. These data-intensive feeds remain within the local network, eliminating costly and bandwidth-intensive cloud transfers that would otherwise constrain system performance.‍

Cost Predictability: Perhaps most compelling for retail operations, edge deployment on standard CPU servers provides consistent and predictable operational costs. Unlike cloud services with usage-based pricing that can spike during busy periods, the fixed infrastructure cost provides budget certainty, simplifying financial planning across store networks.

Real-Time Data Integration: Crowd Analytics and Inventory Management

When a store associate inquires about current customer traffic, the assistant queries WaitTime's API, interpreting the crowd analytics data coming from on-premise cameras to provide meaningful insights about congestion points, checkout wait times, and optimal staffing distribution. This real-time information enables managers to direct employees where they're most needed, thereby enhancing both operational efficiency and customer satisfaction.

Similarly, inventory queries trigger connections to Chooch's Vision AI platform, which continuously monitors store shelves through computer vision. The assistant translates Chooch's detailed inventory data into actionable insights, informing staff about current stock levels, identifying low-stock items, and helping prevent potential stockouts before they impact customers.

By presenting this information conversationally, the assistant makes sophisticated inventory intelligence accessible to every employee, regardless of technical expertise, all processed locally on the CPU without requiring GPU acceleration.

Harnessing Open Source AI on CPU-Only Infrastructure

The Edge IQ Retail Assistant demonstrates the impressive capabilities of modern CPU-based AI inference. At its core, the application runs three sophisticated small language models entirely on Intel Xeon processors —no GPUs are present in the system architecture. This CPU-only approach highlights the significant progress in AI optimization and the remarkable capabilities of modern server processors for AI workloads.

Arcee AI SuperNova Lite: Serving as the central intelligence of the system, this 8-billion parameter open-source conversational model is a high-performance distilled version of the larger Llama-3.1-405B-Instruct model. We quantized the model to 4 bits using the Hugging Face Optimum Intel library, dramatically reducing its computational requirements while maintaining impressive language understanding capabilities.

Distil-Whisper Small: Handling speech recognition with 166 million parameters, we compiled this open-source model using PyTorch's torch.compile functionality to generate highly optimized CPU code that leverages Intel's specialized instruction sets.

Hexgrad Kokoro: An 82-million parameter text-to-speech open-source model also optimized for CPU execution, producing natural-sounding responses with inflection and appropriate pacing, all generated in real-time on the same Xeon processors handling the other AI workloads.

Last but not least, we built the user interface with Hugging Face Gradio and containerized the entire solution with Docker for easy deployment and management.

OpenVINO: Unlocking Intel CPU Acceleration for AI Inference

The secret behind the Edge IQ Retail Assistant's impressive CPU-only performance lies in Intel's OpenVINO toolkit. This powerful optimization framework transforms resource-intensive models into highly efficient versions specifically tailored for execution on Intel Xeon 6 processors, applying numerous optimizations in the process.

Through model quantization, OpenVINO reduces SuperNova Lite's precision from 32-bit to just 4-bit representations, dramatically decreasing memory requirements while maintaining accuracy. Layer fusion combines multiple operations into a single optimized kernel, minimizing memory transfers and accelerating execution. Hardware-aware optimization automatically maps neural network operations to the most efficient instruction sets available on Intel Xeon processors, while memory layout transformations restructure tensors for optimal CPU execution.

The toolkit specifically targets Intel's specialized CPU instruction sets found in Intel Xeon processors. Advanced Vector Extensions (AVX-512) provide Single Instruction, Multiple Data capabilities that significantly accelerate the matrix operations underpinning neural networks. The newer Advanced Matrix Extensions (AMX) instruction set delivers remarkable performance for the matrix multiply-accumulate operations that dominate language model inference.

Our quantized SuperNova-Lite model runs on the OpenVINO Model Server, which provides dynamic batching of incoming requests, automatic model management, OpenAI API compatibility, and efficient parallel execution across available CPU cores.

Conclusion

The convergence of modern CPU architectures and robust server platforms marks a new era in retail operations. By harnessing the power of AI at the edge, retailers can achieve:

Enhanced Customer Experience: Immediate access to information and personalized interactions.
Operational Efficiency: Real-time insights into inventory and crowd analytics.
Cost Savings: Reduced reliance on cloud services and GPUs, leading to lower operational costs.

As the retail landscape continues to evolve, embracing edge AI solutions like the Edge IQ Retail Assistant will be crucial in staying competitive and meeting customer expectations.

If you’d like to know more about Arcee AI and our solutions, please visit us at www.arcee.ai or book a demo. We also recommend following us on LinkedIn or X to stay in touch with the latest news on small language models.

Technical resources

Hugging Face: Arcee AI models, the Optimum Intel library
Intel: Xeon 6, the OpenVINO toolkit, and model server ‍
Cisco UCS

Related Blogs

Partnerships

•

June 9, 2026

Why we made Hugging Face the home for everything we build

Arcee partners with Hugging Face to make the Hub the exclusive home for all models, datasets, and agent traces.

Partnerships

•

May 27, 2025

Arcee AI SLMs Now Available on Together.ai and OpenRouter

Arcee AI is excited to announce the availability of its small language models (SLMs) on Together.ai and OpenRouter, two leading managed inference platforms. Start building today and leverage Arcee AI’s specialized models to enhance your AI applications.

Partnerships

•

April 17, 2025

The Case for Small Language Model Inference on Arm CPUs

Our Chief Evangelist, Julien Simon, explores the advantages and practical applications of running SLM inference on Arm CPUs.