NVIDIA Nemotron 3 Ultra Powers Long-Running Agent Workflows

NVIDIA released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with only 55 billion active parameters per token, optimized specifically for long-running agent orchestration workloads. The model uses hybrid Mamba-Transformer layers to handle extended context windows efficiently and NVFP4 quantization for 5x higher throughput compared to FP8 inference. Weights, data, and training recipes are open.

[ DIRECT ANSWER ] NVIDIA released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with only 55 billion active parameters per token, optimized specifically for long-running agent orchestration workloads. The model uses hybrid Mamba-Transformer layers to handle extended context windows efficiently and NVFP4 quantization for 5x higher throughput compared to FP8 inference. Weights, data, and training recipes are open, making this one of the largest fully open AI models available.

The Real Problem

Long-running agent workflows have a memory problem. When an AI agent needs to maintain context across 50-100 tool calls, several hours of execution, or multi-turn conversations with complex state, most models degrade. Attention-based architectures scale quadratically with sequence length, making long contexts expensive. Agent frameworks like LangGraph, CrewAI, and n8n's AI Agent Tool all face the same bottleneck: the model forgets what it decided 30 steps ago.

[ STAT ] Agent workflow failure rates increase by 12% for every 10 additional tool calls in a single session, primarily due to context degradation in transformer-based models. — NVIDIA AgentBench, 2026

Nemotron 3 Ultra attacks this at the architecture level. By replacing a portion of the attention layers with Mamba state-space layers, the model maintains coherent reasoning across sequences that would cause standard transformers to degrade.

What This Actually Does

Nemotron 3 Ultra is not a general-purpose chatbot model. It is designed for workloads where the model runs continuously for minutes to hours, making decisions, calling tools, reading results, and updating its internal state. The model architecture makes specific tradeoffs for this use case.

[TOOL: Mamba-Transformer Hybrid Layers] Each layer in the 550B model is either a Mamba state-space layer or a standard attention layer. The Mamba layers handle long-range dependencies with linear scaling instead of quadratic attention scaling. For a 128K token sequence, Mamba layers are approximately 8x more efficient than attention layers.

[TOOL: NVFP4 Quantization] NVIDIA's new 4-bit floating point format achieves 5x higher throughput than FP8 inference on H100 and B200 GPUs. The key innovation is that NVFP4 preserves dynamic range around zero better than integer quantization, which matters for agent reasoning where small probability differences determine tool selection decisions.

The agentic reasoning optimization is the routing within the MoE architecture. Nemotron 3 Ultra has 550 billion total parameters but only activates 55 billion per token. The router learns to dispatch agent-related reasoning tokens to expert modules specialized in tool use, planning, and state tracking, while routing simple text generation tokens to different experts.

Who This Is Built For

For teams building production agent systems on n8n, LangGraph, or custom frameworks: You are currently paying per-token costs for models that were not designed for your use case. A 128K token agent session on GPT-4o costs approximately $3-6 in API fees. Nemotron 3 Ultra self-hosted at 5x throughput with NVFP4 quantization changes the economics of long-running agents.

For infrastructure teams deploying on-premise AI workflows: Data residency requirements prevent you from using cloud APIs for agent systems. Nemotron 3 Ultra's open weights let you deploy on your own H100 or B200 clusters with full data control. The model's efficiency on long contexts means you need fewer GPUs per concurrent agent session.

For researchers working on agent architectures: The open training data and recipes are as valuable as the model weights. NVIDIA published the full data mix, curation pipeline, and training configuration, allowing teams to study what training data produces strong agent reasoning and to fine-tune for domain-specific agent tasks.

How It Runs: Step by Step

Model Download and Deployment. Download the Nemotron 3 Ultra weights from Hugging Face (opensource.nvidia.com). The model requires approximately 1.1 TB of storage for the full 550B parameter set. Deployment on a single H100 node with 8 GPUs fits the model with NVFP4 quantization.
Agent Framework Integration. Connect Nemotron 3 Ultra to your agent framework via the OpenAI-compatible API endpoint served by NVIDIA TensorRT-LLM or vLLM with NVFP4 kernel support. The API accepts the same chat completion format.
Long-Running Session Start. The agent system starts a session with an initial system prompt and user input. Nemotron 3 Ultra processes the input through its hybrid layers. Mamba layers handle the early context encoding. Attention layers handle the precision-critical reasoning steps.
Tool Call Loop. The model generates a tool call in JSON format. The agent framework executes the tool and returns the result. Nemotron 3 Ultra reads the tool result and updates its internal state. This loop continues for the duration of the session. This is the core reasoning step repeated across potentially hundreds of iterations.
Session Continuity. After each tool call, the Mamba layers maintain the state-space representation of the conversation history without reprocessing the full token sequence. This is where the efficiency gain over pure attention models is most significant.
Session Termination and Logging. The session ends when the agent task completes or a maximum step limit is reached. The full conversation log is saved for audit and fine-tuning data collection.

Setup and Tools

Nemotron 3 Ultra → 550B MoE model with 55B active parameters (download from Hugging Face) NVIDIA TensorRT-LLM → Inference server with NVFP4 kernel support (required for 5x throughput) n8n / LangGraph → Agent framework with OpenAI-compatible API integration H100 or B200 GPU cluster → Minimum 8x H100 80GB for single-node deployment NVFP4 Quantization → 4-bit floating point format for 5x throughput improvement

The gotcha: NVFP4 quantization requires a specific CUDA kernel that is only available in NVIDIA TensorRT-LLM version 0.12+. The model runs at FP8 precision without it, which means 5x lower throughput. Verify your inference stack supports NVFP4 before deploying. vLLM added support in version 0.8.0, but the throughput gain is 3.5x instead of 5x.

The Numbers

▸ Parameters total vs. active 550B total, 55B active per token (10:1 MoE sparsity ratio) ▸ Throughput vs. FP8 inference baseline FP8 → 5x higher with NVFP4 quantization (Source: NVIDIA Developer Blog, 2026) ▸ Context window efficiency Mamba layers scale linearly with sequence length vs. quadratic for attention-only models ▸ Openness Weights, training data, and recipes published under open license (Source: opensource.nvidia.com) ▸ Deployment cost at 5x throughput approximately $2-4 per 100K agent tokens on self-hosted H100 cluster, vs. $12-20 on cloud API

These numbers matter most for teams running 500+ agent sessions per day. At that scale, the throughput difference between Nemotron 3 Ultra and cloud API models translates to 70-80% infrastructure cost reduction.

What It Cannot Do

Multimodal inputs: Nemotron 3 Ultra is text-only. It does not process images, audio, or video. For agent workflows that need vision capabilities, you must route image inputs to a separate vision model.
Real-time latency: The model is optimized for throughput, not latency. First-token latency on an 8xH100 node is 800-1200ms. For real-time chat applications requiring sub-200ms responses, a smaller model like Nemotron 3 Mini is a better fit.
Agent framework compatibility: The OpenAI-compatible API covers chat completions and tool calls, but the model may not support every feature in your agent framework. Test tool call formatting and streaming behavior before production deployment.
No built-in tool execution: Nemotron 3 Ultra generates tool call JSON. It does not execute tools. You must provide the tool execution environment through your agent framework.

Start in 10 Minutes

(5 min) Read the technical blog post at developer.nvidia.com/blog/nvidia-nemotron-3-ultra for architecture details and benchmark results.
(10 min) Visit opensource.nvidia.com and download the model card. Review the hardware requirements and supported inference frameworks.
(30 min) Check your inference stack version. If using TensorRT-LLM, confirm version 0.12+. If using vLLM, confirm version 0.8.0+.
(2 hours) Deploy the model on a single GPU node using the NVFP4 quantization config from the NVIDIA examples repository. Run the included test agent session to verify tool call formatting.

Frequently Asked Questions

Q: What makes Nemotron 3 Ultra different from other open models like Llama 3 or DeepSeek? A: Nemotron 3 Ultra is the first open model designed specifically for long-running agent workloads. The hybrid Mamba-Transformer architecture maintains coherence across extended sessions better than pure attention models. The NVFP4 quantization is also unique to NVIDIA hardware and provides 5x throughput advantage over standard quantization.

Q: Can I run Nemotron 3 Ultra on consumer GPUs like the RTX 5090? A: No. The full 550B model requires enterprise GPUs with high VRAM. Minimum deployment is 8x H100 80GB. NVIDIA released smaller variants (Nemotron 3 Mini at 8B and Nemotron 3 Mid at 70B) for consumer hardware, but they lack the hybrid Mamba-Transformer architecture.

Q: How much does it cost to self-host Nemotron 3 Ultra? A: Self-hosting on an 8xH100 cluster costs approximately $15-25 per hour in cloud GPU rental. At 5x NVFP4 throughput, processing 100K agent tokens costs approximately $2-4. This compares favorably to cloud API pricing of $12-20 per 100K tokens for equivalent-quality models.

Q: What agent frameworks support Nemotron 3 Ultra? A: Any framework with an OpenAI-compatible API integration can use Nemotron 3 Ultra. This includes LangChain, LangGraph, CrewAI, AutoGen, n8n, and custom frameworks. The model supports standard chat completions, tool calls, and streaming.

Q: Is the training data available for fine-tuning? A: Yes. NVIDIA published the full training data mix and curation pipeline under an open license. This allows teams to study what data produces strong agent reasoning and to create domain-specific fine-tuning datasets without starting from scratch.