DgxSparkTuner
cluster_admin
② Admin & Control Plane
cluster_gateway
③ Inference Gateway
cluster_router
④ Process Manager (Parent)
cluster_child
Inference Engine (Child Process)
User
① USER
Operator / SRE / Developer
- Issues CRUD on tuned models
- Sends direct inference requests
- Reads benchmark & recipe reports
Tuner
dgx-spark-tuner
Rust · CLI + REST API · Port :8080
Responsibilities:
- Resolve models via hf-hub crate
- Ask Oracle LLM for 3 recipe candidates
- Drive llama-benchy tuning runs
- Persist winning recipe + hash
- Expose idempotent CRUD endpoints
ID = model + runtime + recipe-hash
User->Tuner
A: CRUD Models
POST /models — tune new recipe
GET /models — list / filter
PUT /models/:id — re-tune if better
DEL /models/:id — unload + purge
LiteLLM
liteLLM
OpenAI-compatible proxy · Port :14000
Responsibilities:
- Unified /v1/chat/completions surface
- Auth, rate limits, request logging
- Routes Oracle calls for recipe synthesis
- Forwards benchmark traffic during tuning
- Hides per-engine quirks from clients
User->LiteLLM
B: Direct Inference
OpenAI-compatible chat / completions
Streaming SSE supported
API-key auth, per-user quotas
Tuner->LiteLLM
1: Ask Oracle for recipes
Prompt: model card + HW profile
Returns: 3 candidate recipes (JSON)
Includes engine choice + flags
Tuner->LiteLLM
2: Benchmarking traffic
Drives llama-benchy workloads
Measures TTFT, TPS, VRAM, p95
Iterates over the 3 candidates
Swap
llama-swap router
Process supervisor · Port :28080
Responsibilities:
- Cold-start engines on demand
- Enforce single-tenant GPU occupancy
- Idle-evict engines to free VRAM
- Stream stdout/stderr for tuner logs
- Health-check & restart on crash
LiteLLM->Swap
3: Forward traffic
Routes by tuned-model ID
Triggers cold-start if engine idle
Backpressure on VRAM contention
Engine
Engine Backends
Exactly one active at a time
-
vLLM
:18000 — PagedAttention, FP8/AWQ, high throughput
-
llama.cpp
:19000 — GGUF, CPU+GPU offload, broad model coverage
-
Atlas
:18888 — NVIDIA-optimised, Spark-native kernels
-
Ollama
:11434 — friendly UX, model library, dev workflows
Recipe selects engine + flags + quant + ctx-len
Swap->Engine
4: Spawns / Kills
fork+exec with recipe-derived argv
SIGTERM → SIGKILL on idle timeout
Captures stdout/stderr for tuner
Reports ready-state via /health
Overview
0 / 0
→
Space
next ·
←
prev ·
Home
overview ·
End
last ·
1
–
9
jump ·
H
hide chrome ·
F
fullscreen