DgxSparkTuner cluster_admin ② Admin & Control Plane cluster_gateway ③ Inference Gateway cluster_router ④ Process Manager (Parent) cluster_child Inference Engine (Child Process) User ① USER Operator / SRE / Developer - Issues CRUD on tuned models - Sends direct inference requests - Reads benchmark & recipe reports Tuner dgx-spark-tuner Rust · CLI + REST API · Port :8080 Responsibilities: - Resolve models via hf-hub crate - Ask Oracle LLM for 3 recipe candidates - Drive llama-benchy tuning runs - Persist winning recipe + hash - Expose idempotent CRUD endpoints ID = model + runtime + recipe-hash User->Tuner A: CRUD Models POST /models — tune new recipe GET  /models — list / filter PUT  /models/:id — re-tune if better DEL  /models/:id — unload + purge LiteLLM liteLLM OpenAI-compatible proxy · Port :14000 Responsibilities: - Unified /v1/chat/completions surface - Auth, rate limits, request logging - Routes Oracle calls for recipe synthesis - Forwards benchmark traffic during tuning - Hides per-engine quirks from clients User->LiteLLM B: Direct Inference OpenAI-compatible chat / completions Streaming SSE supported API-key auth, per-user quotas Tuner->LiteLLM 1: Ask Oracle for recipes Prompt: model card + HW profile Returns: 3 candidate recipes (JSON) Includes engine choice + flags Tuner->LiteLLM 2: Benchmarking traffic Drives llama-benchy workloads Measures TTFT, TPS, VRAM, p95 Iterates over the 3 candidates Swap llama-swap router Process supervisor · Port :28080 Responsibilities: - Cold-start engines on demand - Enforce single-tenant GPU occupancy - Idle-evict engines to free VRAM - Stream stdout/stderr for tuner logs - Health-check & restart on crash LiteLLM->Swap 3: Forward traffic Routes by tuned-model ID Triggers cold-start if engine idle Backpressure on VRAM contention Engine Engine Backends Exactly one active at a time - vLLM :18000 — PagedAttention, FP8/AWQ, high throughput - llama.cpp :19000 — GGUF, CPU+GPU offload, broad model coverage - Atlas :18888 — NVIDIA-optimised, Spark-native kernels - Ollama :11434 — friendly UX, model library, dev workflows Recipe selects engine + flags + quant + ctx-len Swap->Engine 4: Spawns / Kills fork+exec with recipe-derived argv SIGTERM → SIGKILL on idle timeout Captures stdout/stderr for tuner Reports ready-state via /health
Overview
0 / 0
Space next · prev · Home overview · End last · 19 jump · H hide chrome · F fullscreen