vLLM Production Stack

vLLM Production Stack 研究

Posted by ZBX on April 2, 2025

vLLM 生产栈 的可观测性

Kube-Prometheus-Stack 和 Prometheus Adapter 部署说明

kube-prometheus-stack:

  • 包含 Prometheus、Grafana 和 Alertmanager
  • 提供 Kubernetes 集群的全面监控
  • 包含预定义的仪表盘和告警规则

prometheus-adapter:

  • 将 Prometheus 指标转换为 Kubernetes 自定义指标
  • 允许使用 Prometheus 指标进行 Horizontal Pod Autoscaling (HPA)

kube-prom-stack.yaml

## Create default rules for monitoring the cluster
#
# Disable `etcd` and `kubeScheduler` rules (managed by DOKS, so metrics are not accessible)
defaultRules:
  create: true
  rules:
    etcd: false
    kubeScheduler: false

## Component scraping kube scheduler
##
# Disabled because it's being managed by DOKS, so it's not accessible
kubeScheduler:
  enabled: false

## Component scraping etcd
##
# Disabled because it's being managed by DOKS, so it's not accessible
kubeEtcd:
  enabled: false

alertmanager:
  ## Deploy alertmanager
  ##
  enabled: true
  # config:
  #   global:
  #     resolve_timeout: 5m
  #     slack_api_url: "<YOUR_SLACK_APP_INCOMING_WEBHOOK_URL_HERE>"
  #   route:
  #     receiver: "slack-notifications"
  #     repeat_interval: 12h
  #     routes:
  #       - receiver: "slack-notifications"
  #         # matchers:
  #         #   - alertname="EmojivotoInstanceDown"
  #         # continue: false
  #   receivers:
  #     - name: "slack-notifications"
  #       slack_configs:
  #         - channel: "#<YOUR_SLACK_CHANNEL_NAME_HERE>"
  #           send_resolved: true
  #           title: "\n"
  #           text: "\n"

# additionalPrometheusRulesMap:
#   rule-name:
#     groups:
#     - name: emojivoto-instance-down
#       rules:
#         - alert: EmojivotoInstanceDown
#           expr: sum(kube_pod_owner{namespace="emojivoto"}) by (namespace) < 4
#           for: 1m
#           labels:
#             severity: 'critical'
#             alert_type: 'infrastructure'
#           annotations:
#             description: ' The Number of pods from the namespace  is lower than the expected 4. '
#             summary: 'Pod in  namespace down'

## Using default values from https://github.com/grafana/helm-charts/blob/main/charts/grafana/values.yaml
##
grafana:
  enabled: true
  adminPassword: prom-operator # Please change the default password in production !!!
#   affinity:
#     nodeAffinity:
#       preferredDuringSchedulingIgnoredDuringExecution:
#       - weight: 1
#         preference:
#           matchExpressions:
#           - key: preferred
#             operator: In
#             values:
#             - observability

  # # Starter Kit setup for DigitalOcean Block Storage
  # persistence:
  #   enabled: true
  #   storageClassName: do-block-storage
  #   accessModes: ["ReadWriteOnce"]
  #   size: 5Gi

## Manages Prometheus and Alertmanager components
##
prometheusOperator:
  enabled: true

## Deploy a Prometheus instance
##
prometheus:
  enabled: true

  # Monitor vLLM pods using ServiceMonitor
  additionalServiceMonitors:
    - name: "vllm-monitor"
      selector:
        matchLabels:
          app.kubernetes.io/managed-by: Helm
          environment: test
          release: test
      namespaceSelector:
        matchNames:
          - default
      endpoints:
        - port: "service-port"

  1. 默认告警规则:
    • 禁用了 etcd 和 kube-scheduler 的监控规则(因为这些组件由 DigitalOcean Kubernetes 管理)
  2. 组件监控:
    • 禁用了 kube-scheduler 和 etcd 的监控(不可访问)
  3. Alertmanager:
    • 已启用,但默认没有配置通知(注释部分展示了如何配置 Slack 通知)
  4. Grafana:
    • 已启用,默认管理员密码为 “prom-operator”
    • 注释部分展示了如何配置持久化存储
  5. Prometheus:
    • 已启用
    • 配置了额外的 ServiceMonitor 来监控 vLLM 服务

prom-adapter.yaml

loglevel: 1

prometheus:
  url: http://kube-prom-stack-kube-prome-prometheus.monitoring.svc
  port: 9090

rules:
  default: true
  custom:

  # Example metric to export for HPA
  - seriesQuery: '{__name__=~"^vllm:num_requests_waiting$"}'
    resources:
      overrides:
        namespace:
          resource: "namespace"
    name:
      matches: ""
      as: "vllm_num_requests_waiting"
    metricsQuery: sum by(namespace) (vllm:num_requests_waiting)

  1. Prometheus 连接配置:
    • 指定了 Prometheus 服务的 URL 和端口
  2. 自定义规则:
    • 定义了一个示例规则来暴露 vLLM 的请求等待数指标
    • vllm:num_requests_waiting 指标转换为 vllm_num_requests_waiting 自定义指标

整体类似如下

image-20250403002737207