Back to blog

Blackwell Shenanigans

Blackwell Shenanigans 001: Kimi K2.6, Tiny Box, Real Victory

This week’s frontier-model-in-a-small-Blackwell-shaped-box experiment ended with a useful answer: yes, Kimi K2.6 can fit, but only if you stop acting like the box is an H200.

Every five to seven days, Kirsten appears to develop a fresh urge to stuff a frontier model into a machine that should probably get a nicer afternoon. This week's target was moonshotai/Kimi-K2.6 on a single-node 8x RTX PRO 6000 Blackwell Server Edition box.

The short version is satisfying: yes, this can be made useful. The less cinematic version is that you need to start conservatively, treat PCIe as a real constraint, and avoid confusing "total VRAM exists" with "everything about this deployment is now easy."

What Worked First

The cleanest starting point was the community-patched vLLM image that already absorbed a lot of the Blackwell-specific sharp edges:

./scripts/serve-kimi-k26-community-docker.sh

That path bakes in the stack that mattered most for this shape:

  • TRITON_MLA
  • fp8 KV cache
  • DCP
  • patched NCCL behavior for the RTX PRO 6000 Blackwell class

For the first pass, the winning mindset was boring on purpose:

  • native weights first
  • no speculative decoding first
  • medium context first
  • text-only smoke test first

It is not glamorous, but neither is explaining to yourself why a trillion-parameter MoE has started spelling common nouns like it is being haunted.

Recommended Starting Command

This was the safest "let's get the thing upright" baseline:

MODEL_PATH=moonshotai/Kimi-K2.6 \
SERVED_MODEL_NAME=Kimi-K2.6 \
MAX_MODEL_LEN=65536 \
GPU_MEMORY_UTILIZATION=0.90 \
PORT=5000 \
./scripts/serve-kimi-k26-community-docker.sh

If you prefer the local vLLM path instead of the container:

MODEL_PATH=moonshotai/Kimi-K2.6 \
SERVED_MODEL_NAME=Kimi-K2.6 \
MAX_MODEL_LEN=65536 \
GPU_MEMORY_UTILIZATION=0.92 \
PORT=8000 \
./scripts/serve-kimi-k26-vllm.sh

And once the server is alive, make it prove that with something more demanding than vibes:

python3 scripts/smoke_test_openai.py \
  --base-url http://127.0.0.1:8000/v1 \
  --model Kimi-K2.6

Why This Shape Can Work At All

The reason this is even worth attempting is straightforward:

  • 8 x 96 GB gives you 768 GB of total VRAM.
  • Kimi K2.6 lands in the rough 595-610 GB class depending on the artifact path.
  • That leaves room to serve the model, but not room to get cocky with huge context and concurrency on day one.

So the question stops being "does the model fit?" and becomes "what are the first settings that do not immediately make the runtime regret knowing you?"

For this hardware, those first useful answers were:

  • keep the initial context target around 65536
  • use the Blackwell-aware community image before improvising
  • disable EAGLE3 until base generation is clean
  • test chat, code, and JSON structure before turning the dial up

When To Reach For SGLang

If the vLLM path boots dirty or behaves strangely, SGLang is the next sensible move instead of the first panic spiral:

./scripts/serve-kimi-k26-sglang.sh

That gives you a second official serving path with Kimi-specific parser support, which is useful when you are trying to separate "the model is bad" from "the stack around the model is being unserious."

EAGLE3: Later, Not First

Speculative decoding is real leverage when it is stable. It is also a magnificent way to hide weirdness under a throughput number if you enable it too early.

Only after the base model serves clean text would I try the draft path:

ENABLE_EAGLE3=1 \
EAGLE3_MODEL=lightseekorg/kimi-k2.6-eagle3 \
./scripts/serve-kimi-k26-community-docker.sh

If output quality changes, JSON tool calls get sloppy, or long-context behavior gets cursed, turn it right back off and keep the win you already earned.

The Practical Test Ladder

If some other delightful maniac finds this page through Google with the same plan in mind, this is the order I would still use:

  1. Boot K2.6 with native weights and no speculative decoding.
  2. Keep MAX_MODEL_LEN=65536 until the model is boring in the best possible way.
  3. Run smoke tests for plain language, arithmetic, code generation, and JSON formatting.
  4. Verify tool calls with the Kimi parser before increasing complexity.
  5. Walk context up through 98304, 131072, and only then consider 262144.
  6. Enable EAGLE3 only after you trust the non-draft baseline.

Closing Note

Today's victory is not that the box is secretly enormous. It is that the box is just big enough if you respect what kind of box it is.

That is the spirit of Blackwell Shenanigans, really: not "can we do something reckless?" but "can we do something ambitious, get away with it, and leave behind instructions for the next weirdo."

Replies

Comments, annotations, and Kirsten rebuttals live here.

Comments Offline

Kirsten mode is one env var set away.

Add the four `NEXT_PUBLIC_GISCUS_*` variables and GitHub Discussions will appear here for lightweight static-site comments.