Blackwell Shenanigans 002: Nemotron Omni and the Shadow Pair Bet

NVIDIA released nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 on Hugging Face on 2026-04-28, with FP8 and NVFP4 variants landing beside it. The model card says the thing we care about out loud: this is not just another chat model with a camera taped on. It is aimed at video, audio, image, text, OCR, GUI understanding, transcription, tool calling, and long-context enterprise workflows.

That maps cleanly onto Kirsten's onboarding instinct: the fastest way to transfer production knowledge is often not a doc dump. It is shadowing. Watch the real workflow. Ask questions in the moment. See the weird naming conventions, the half-written runbooks, the scar tissue in terminal history, the "do not touch that table after 4 PM" lore.

So the bet is simple: treat screen-share pairing as the training environment for an agent.

Why This One Is Worth A Day-Zero Spike

The useful facts from the model card:

Inputs: video, audio, image, and text.
Context: up to 256k tokens.
Outputs: text, JSON-capable responses, reasoning output, tool calls, and word-level transcription timestamps.
Architecture: a Mamba2/Transformer hybrid MoE around a 31B A3B class model, with vision and speech encoders attached.
Hardware targets include A100, H100/H200, B200, L40S, RTX PRO 6000 Blackwell, RTX 5090, DGX Spark, Jetson Thor, and GB200-class systems.
Published footprints are 61.5 GB for BF16, 32.8 GB for FP8, and 20.9 GB for NVFP4.

The NVFP4 artifact is the practical default to chase for an interactive front-facing experiment. On day zero, though, NVIDIA's live model card lists SGLang support for BF16 first, with FP8/NVFP4 support still marked as coming. That makes BF16 the stable SGLang lane and vLLM the early quantized comparison lane.

The Product Primitive

The prototype in this repo is Shadow Pair: a static browser surface that shares the user's screen, samples a frame, and sends that frame plus a spoken or typed question to a local OpenAI-compatible Nemotron endpoint.

It supports two brain modes:

Nemotron: Nemotron Omni is the whole front-facing assistant.
Agent API: Nemotron Omni becomes the perception layer, then sends a compact screen observation to another OpenAI-compatible brain such as g-agent.

That gives us a ladder:

Observe: the model watches a sampled IDE, browser, terminal, or Notion frame.
Explain: the model says what a new teammate should notice.
Question: the model asks for missing context instead of hallucinating process.
Remember: later, useful explanations become project memory.
Act: only after the shadowing loop is boringly reliable, the agent gets tool access while the human watches.

The first version deliberately uses still frames. For IDE and Notion workflows, that is already enough to test whether the model understands the working surface. Short video clips come next, but day zero should avoid coupling "does the model see my app?" to "did we also nail browser video capture, local blob serving, and vLLM media plumbing?"

Serve It

The pinned SGLang cookbook image is the cleanest path for the full open stack. Start with the BF16 variant on Blackwell:

PRECISION=BF16 \
SERVED_MODEL_NAME=nemotron-3-nano-omni \
./scripts/serve-nemotron3-omni-sglang.sh

That script follows the current working SGLang shape: --tool-call-parser qwen3_coder, --reasoning-parser nemotron_3, --attention-backend flashinfer, --page-size 1, OpenAI-compatible chat completions, and BF16 by default.

For the vLLM comparison lane:

PRECISION=NVFP4 \
SERVED_MODEL_NAME=nemotron-3-nano-omni \
MAX_MODEL_LEN=131072 \
MAX_NUM_SEQS=8 \
GPU_MEMORY_UTILIZATION=0.85 \
./scripts/serve-nemotron3-omni-vllm.sh

The live Hugging Face quantized repos currently include Reasoning-FP8 and Reasoning-NVFP4, but the model card's SGLang section still says FP8/NVFP4 support is coming. Treat those as vLLM-first until the SGLang stack catches up.

Smoke Test

Text only:

python3 scripts/smoke_test_nemotron_omni.py \
  --base-url http://127.0.0.1:30000/v1 \
  --model nemotron-3-nano-omni

Screen or IDE frame:

python3 scripts/smoke_test_nemotron_omni.py \
  --base-url http://127.0.0.1:30000/v1 \
  --model nemotron-3-nano-omni \
  --image /path/to/ide-or-notion-screenshot.png

Video reasoning:

python3 scripts/smoke_test_nemotron_omni.py \
  --base-url http://127.0.0.1:30000/v1 \
  --model nemotron-3-nano-omni \
  --video /path/to/pairing-clip.mp4 \
  --thinking

Open The Workbench

Run the site and visit /shadow-pair. The page talks directly to http://127.0.0.1:30000/v1/chat/completions by default.

pnpm dev

The useful loop is:

Start the local model endpoint.
Open Shadow Pair.
Share the IDE, terminal, browser, or Notion window.
Ask what a new engineer should understand.
Save the good answers into durable project memory.

What I Think

This is absolutely resume-grade work if we make it concrete. The interesting sentence is not "I deployed a new multimodal model on release day." The interesting sentence is:

Built a day-zero multimodal agent onboarding loop where an open NVIDIA model watched real production screen-share context, answered voice questions, and converted tacit engineering workflow into reusable agent memory.

That is a much sharper claim. It connects frontier-model deployment, multimodal inference, agent onboarding, human-in-the-loop supervision, and production knowledge transfer. It is also the kind of demo that immediately makes sense to teams who have felt the pain.

Sources: NVIDIA Nemotron 3 Nano Omni BF16 model card, FP8 variant, NVFP4 variant, SGLang Nemotron 3 Nano Omni cookbook.