Cohere North Mini Code Runs Agentic Coding on One H100
Cohere North Mini Code is a 30B total parameter MoE model with 3B active parameters, built for agentic software engineering and released under Apache 2.0. It runs on a single H100 GPU at FP8 precision with a 256K context window and 64K max output. On Artificial Analysis' Coding Index it scores 33.4, outperforming Qwen3.5 35B-A3B and Gemma 4 26B-A4B in its weight class.
Primary Intelligence Summary: This analysis explores the architectural evolution of cohere north mini code runs agentic coding on one h100, focusing on the implementation of agentic AI frameworks and autonomous orchestration. By understanding these 2026 intelligence patterns, agencies and startups can build more resilient, self-correcting systems that scale beyond traditional automation limits.
Written By
SaaSNext CEO
Cohere North Mini Code Runs Agentic Coding on One H100
Cohere North Mini Code is a 30B total parameter MoE model with 3B active parameters, built for agentic software engineering and released under Apache 2.0. It runs on a single H100 GPU at FP8 precision with a 256K context window and 64K max output. On Artificial Analysis' Coding Index it scores 33.4, outperforming Qwen3.5 35B-A3B and Gemma 4 26B-A4B in its weight class. (Source: Artificial Analysis Coding Index, June 2026)
The Real Problem
Most open-source coding models fall into two categories: small enough to run locally but not capable enough for real agentic work, or large enough to be useful but requiring expensive multi-GPU setups that rule out local deployment. A team evaluating open-source coding agents has to choose between a 7B model that cannot follow multi-step instructions reliably and a 70B model that needs $30,000+ in GPU hardware. Neither fits the sovereign AI deployment model that Cohere has been building toward.
[ STAT ] 63% of enterprises consider model deployment cost the primary barrier to adopting agentic coding tools in production. — Cohere internal survey, 2026
North Mini Code splits this difference at exactly the point where open-source coding hits its practical ceiling. Thirty billion total parameters with only 3 billion active per token means the model runs on a single H100 at FP8, or on a Mac Studio via MLX at roughly 20GB RAM. A solo developer or a small team can run it locally without cloud GPU rental.
What This Actually Does
North Mini Code is not a general-purpose chat model adapted for coding. Cohere trained it specifically for agentic software engineering workflows. The training pipeline used a two-stage cascaded supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). In the first stage, 70% of trainable tokens were code data, with 43% dedicated to agentic tool-use and 27% to single-turn competitive and scientific programming. The second stage used 4.5 billion tokens from agentic and reasoning-driven samples only.
[TOOL: North Mini Code] Handles agentic coding tasks inside harnesses like SWE-Agent and OpenCode. It drives terminal-based agents across multi-turn shell interactions, maps system architecture, performs code reviews, and orchestrates sub-agents.
The model integrates interleaved thinking with tool-use capabilities. When given a task, it can pause generation, produce a reasoning trace, make tool calls (read file, search codebase, run tests), and continue based on results. This interleaved reasoning is what separates agentic models from completion models.
Who This Is Built For
For engineering teams building open-source agentic coding pipelines on their own infrastructure: North Mini Code runs on a single H100 at FP8, eliminating the need to rent cloud GPU clusters or pay per-token API fees. The model is free under Apache 2.0, so your only cost is hardware.
For sovereign AI deployments in regulated industries: healthcare, defense, and finance organizations that cannot send source code to third-party APIs now have a capable agentic coding model they can deploy on-premises. Cohere's entire business model centers on this use case.
For solo developers and indie hackers who want local agentic coding without a GPU budget: the model runs on a Mac Studio with MLX at around 20GB RAM. Nick Frosst, Cohere co-founder, demonstrated exactly this during the launch.
How It Runs: Step by Step
-
Model download: Weights are available on Hugging Face in BF16 and FP8 formats. Download north-mini-code-1.0 at huggingface.co/CohereLabs/North-Mini-Code-1.0. The FP8 weights are 6GB, the BF16 weights are 15GB.
-
Load in your agent harness: North Mini Code is tested with SWE-Agent and OpenCode. Load the model using vLLM or Hugging Face Transformers. Set temperature to 1.0 and top_p to 0.95 for agentic tasks. The model expects a chat template with system, user, and tool-call roles.
-
Define tools: The model's tool-use is built for a ReAct-style loop. Define a set of terminal tools (read, write, search, run) and expose them through the harness. The model outputs structured tool call JSON, which the harness executes and returns results.
-
Execute multi-turn tasks: The model operates in a loop — receive a task, reason, call tools, receive results, continue. Its 64K max generation window allows long reasoning chains without truncation.
Setup and Tools
One H100 GPU at FP8 precision is the minimum hardware requirement. The model also runs on Mac Studio via MLX at roughly 20GB RAM, per Cohere's launch demo. API access is available via Cohere API, OpenRouter, and Model Vault.
One gotcha: North Mini Code generates approximately 3x more output tokens than comparable models, according to Artificial Analysis testing (Source: VentureBeat, June 2026). This verbosity compounds inference cost and latency in high-volume pipelines. Benchmark scores do not surface this — only throughput testing against your actual workload will reveal whether the added token count matters for your use case.
The Numbers
[ STAT ] North Mini Code scores 33.4 on the Artificial Analysis Coding Index, outperforming Qwen3.5 35B-A3B, Gemma 4 26B-A4B, and Mistral Small 4. (Source: Artificial Analysis, June 2026)
[ STAT ] It runs on a single H100 at FP8, or a Mac Studio via MLX at 20GB RAM — no multi-GPU setup required. (Source: Cohere launch blog, June 2026)
[ STAT ] The model achieves competitive SWE-Bench Verified scores using the SWE-Agent harness with default settings. Cohere used temperature 1.0 and top_p 0.95 across all benchmarks. (Source: Cohere model card, June 2026)
What It Cannot Do
-
North Mini Code is not a general-purpose assistant. It was trained for agentic coding tasks. Using it for creative writing, summarization, or general Q&A will produce poor results compared to models trained for those domains.
-
The model's 64K max output window, while large, means that extremely long agentic sessions (100+ tool calls) may hit the generation limit. The interleaved reasoning approach produces verbose traces that consume output tokens faster than expected.
-
Performance varies by harness. The model was trained against multiple agent scaffolds, but a harness with a different tool schema or prompt template than those used in training will produce degraded results. Test with your harness before committing.
Start in 10 Minutes
-
(2 min) Download the FP8 weights from huggingface.co/CohereLabs/North-Mini-Code-1.0. The file is 6GB. Start this download first.
-
(5 min) Install vLLM or use the Hugging Face Transformers library. Run pip install vllm if you have a CUDA-capable GPU. For Mac users, install mlx and mlx-lm.
-
(3 min) Set up your agent harness. Clone SWE-Agent from GitHub and configure it to use North Mini Code as the model backend. Set temperature=1.0, top_p=0.95.
-
(Test) Run a single-task test: ask the agent to find all unhandled async errors in a small directory. This will verify the model-harness integration before scaling up.
Frequently Asked Questions
Q: How much does it cost to run North Mini Code on a single H100? A: The model weights are free under Apache 2.0. Your cost is the H100 hardware — approximately $2-3 per hour on cloud GPU rental (Lambda Labs, Vast.ai, RunPod), or $25,000-30,000 one-time if purchasing the GPU. The FP8 weights use 6GB of VRAM, leaving room for context.
Q: Can North Mini Code replace Claude Code or GPT-4o for coding tasks? A: Not directly. North Mini Code is designed as an agentic coding model for sovereign deployment, not as a replacement for frontier models. It competes with other open-source models in its size class (Qwen3.5, Gemma 4) rather than with GPT-4o or Claude Opus 4.8. Its strength is local deployment and data control, not absolute benchmark performance.
Q: Does North Mini Code support function calling or tool use? A: Yes. The model is trained for interleaved reasoning and tool use inside agent harnesses like SWE-Agent and OpenCode. It outputs structured JSON for tool calls and expects tool results to continue its reasoning loop. This is not an API-based function calling system — it works within a ReAct agent loop.
Q: What benchmarks does North Mini Code perform best on? A: The model achieves its strongest scores on Terminal-Bench v2 (terminal-based agent tasks) and SWE-Bench Verified (repository-level code changes). Cohere also benchmarked it on SciCode (scientific coding) and LiveCodeBench v6 (algorithmic reasoning). It scores 33.4 on the Artificial Analysis Coding Index.
Q: Is North Mini Code suitable for production deployment? A: Yes, with caveats. The Apache 2.0 license allows commercial use. The model has been tested across multiple agent harnesses and benchmarks. However, the higher token generation rate compared to similar models (3x more output tokens per task) means production costs should be modeled against your actual workload, not benchmark scores alone. (Source: VentureBeat, June 2026)