Quickstart
Last updated
Get a model serving in five minutes. This page covers the fast path: deploy the pack, apply one LLMModel, and call the API.
For a full production-grade setup (air-gapped clusters, HuggingFace token secrets, monitoring, and all values), see Installation and Configuration.
Prerequisites #
- Kubernetes 1.28+ cluster with Nebari Infrastructure Core deployed (Installation §1)
- nebari-operator running (Installation §2.7)
- NVIDIA GPU Operator installed (auto-discovers GPU nodes). Note: nebari-infrastructure-core does not install this automatically yet - see nebari-dev/nebari-infrastructure-core#232. Until then, install it manually via ArgoCD (see
examples/nvidia-gpu-operator.yaml). (Installation §3) - Envoy Gateway configured for AI Gateway integration -
extensionApis.enableBackend,extensionManagerpointing at the AI Gateway controller service, andbackendResourcesallowinginference.networking.k8s.io/InferencePool. Ready-to-apply example:examples/envoy-gateway.yaml. (Installation §6 for full wiring details) - Envoy AI Gateway v0.5.0+ installed. Note: the
envoyAIGateway.installchart flag is not yet implemented (#44). Install manually via ArgoCD (seeexamples/envoy-ai-gateway.yaml). (Installation §5) - Gateway API Inference Extension (InferencePool / InferenceModel CRDs) (Installation §4)
- A cert-manager
ClusterIssuerfor the shared TLS certificate (default name:letsencrypt-production; override withplatform.tls.clusterIssuer) (Installation §2.4) - DNS for
llm.<baseDomain>andllm-internal.<baseDomain>pointing at the shared Gateway load balancer (Installation §2.3) - A StorageClass that can provision PVCs large enough for your models (EFS, EBS gp3, or equivalent) (Installation §1; sizing guidance)
Deploy the pack #
The pack is deployed as an ArgoCD Application. A multi-source setup lets you keep model definitions in a separate Git repo:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: nebari-llm-serving
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "7"
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: foundational # adjust to your ArgoCD project
sources:
# Source 1: LLM serving pack Helm chart
- repoURL: https://github.com/nebari-dev/nebari-llm-serving-pack.git
targetRevision: v0.1.0-alpha.9
path: charts/nebari-llm-serving
helm:
releaseName: nebari-llm-serving
values: |
platform:
baseDomain: "your-cluster.example.com"
# Gateway names below must match the Envoy Gateways in your cluster.
# This example points both endpoints at one shared gateway; the chart
# default for the internal gateway is "nebari-internal-gateway".
gateway:
external:
name: nebari-gateway
namespace: envoy-gateway-system
internal:
name: nebari-gateway
namespace: envoy-gateway-system
manageSharedListeners: true
tls:
clusterIssuer: letsencrypt-production
defaults:
storage:
storageClassName: efs-sc # or gp3, longhorn, etc.
auth:
oidc:
issuerURL: "https://keycloak.your-cluster.example.com/realms/nebari"
groupsClaim: groups
keyManager:
enabled: true
# Source 2: LLMModel CRs from your cluster config repo
- repoURL: https://github.com/your-org/your-cluster-config.git
targetRevision: main
path: clusters/your-cluster/manifests/llm-models
destination:
server: https://kubernetes.default.svc
namespace: nebari-llm-serving-system
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
- SkipDryRunOnMissingResource=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
For all available Helm values, see Configuration.
Deploy a model #
Add an LLMModel resource to your cluster config repo (the path referenced by Source 2 above):
apiVersion: llm.nebari.dev/v1alpha1
kind: LLMModel
metadata:
name: qwen3-5-35b-a3b-gptq-int4
namespace: nebari-llm-serving-system
spec:
model:
name: "Qwen/Qwen3.5-35B-A3B-GPTQ-Int4"
source: huggingface
storage:
type: pvc
size: "30Gi"
# storageClassName: efs-sc # optional, overrides the pack default
resources:
gpu:
count: 1
type: nvidia
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "12Gi"
serving:
replicas: 1
tensorParallelism: 1
vllmArgs:
- "--quantization"
- "gptq_marlin"
- "--max-model-len"
- "8192"
access:
public: false
groups:
- "llm"
endpoints:
external:
enabled: true
internal:
enabled: true
For gated HuggingFace models, create a Secret with your HuggingFace token and reference it:
spec:
model:
authSecretName: hf-token # Secret with key "HF_TOKEN"
The operator handles the rest: model download, vLLM pods, InferencePool, routing, and auth. Watch progress with:
kubectl -n nebari-llm-serving-system get llmmodels -w
For the full CRD reference including all spec fields, see Configuration.
Use the model #
All models on the cluster share one hostname pair. Clients select a model via the model field in the request body, matching the OpenAI API convention.
External access (API key) #
Generate a key via the key manager UI, served at the hostname you set in keyManager.nebariApp.hostname (e.g. https://keys.llm.<baseDomain>/). Then:
curl https://llm.your-cluster.example.com/v1/chat/completions \
-H "Authorization: Bearer sk-your-api-key" \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3.5-35B-A3B-GPTQ-Int4", "messages": [{"role": "user", "content": "Hello"}]}'
Internal access (JWT from JupyterLab or in-cluster service) #
import os
from openai import OpenAI
client = OpenAI(
base_url="https://llm-internal.your-cluster.example.com/v1",
api_key=os.environ["JUPYTERHUB_API_TOKEN"], # JWT from Nebari
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B-GPTQ-Int4",
messages=[{"role": "user", "content": "Hello"}],
)