Monitoring Guide¶

kardinal-promoter exposes Prometheus metrics at :8080/metrics (the default metricsBindAddress). The endpoint is scraped by standard Prometheus installations.

Scrape Configuration¶

Add kardinal-promoter to your Prometheus scrape config:

# prometheus.yml
scrape_configs:
  - job_name: kardinal-promoter
    static_configs:
      - targets: ['kardinal-promoter-controller.kardinal-system.svc.cluster.local:8080']
    metrics_path: /metrics

If you use the Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kardinal-promoter
  namespace: kardinal-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kardinal-promoter
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

The Helm chart creates the Service with the metrics port (8080) automatically.

Controller-Runtime Standard Metrics¶

kardinal-promoter uses controller-runtime, which automatically exposes the following metrics:

Reconciler metrics¶

Metric	Type	Description
`controller_runtime_reconcile_total`	Counter	Total reconcile operations, labelled by `controller` and `result` (`success`, `error`, `requeue`)
`controller_runtime_reconcile_errors_total`	Counter	Total reconcile errors, labelled by `controller`
`controller_runtime_reconcile_time_seconds`	Histogram	Time spent per reconcile loop, labelled by `controller`
`controller_runtime_max_concurrent_reconciles`	Gauge	Configured max concurrent reconciles per controller
`controller_runtime_active_workers`	Gauge	Active reconcile goroutines per controller

Controller labels: bundle, promotionstep, policygate, metriccheck

Work queue metrics¶

Metric	Type	Description
`workqueue_adds_total`	Counter	Items added to the work queue per controller
`workqueue_depth`	Gauge	Current work queue depth per controller
`workqueue_queue_duration_seconds`	Histogram	Time items spend in the queue before processing
`workqueue_work_duration_seconds`	Histogram	Time spent processing each item
`workqueue_retries_total`	Counter	Total retries per controller

Webhook metrics¶

Metric	Type	Description
`controller_runtime_webhook_requests_total`	Counter	Bundle creation webhook calls by `result`
`controller_runtime_webhook_request_duration_seconds`	Histogram	Webhook latency

Go Runtime Metrics¶

Standard Go runtime metrics are also exposed:

Metric	Description
`go_goroutines`	Current goroutine count
`go_memstats_alloc_bytes`	Heap memory in use
`go_gc_duration_seconds`	GC pause duration
`process_cpu_seconds_total`	CPU time consumed

Sample PromQL Queries¶

Active bundles (Promoting or Available)¶

# Total bundle reconcile operations in the last 5 minutes
rate(controller_runtime_reconcile_total{controller="bundle"}[5m])

Promotion step error rate¶

# Fraction of PromotionStep reconciles that error
rate(controller_runtime_reconcile_errors_total{controller="promotionstep"}[5m])
/
rate(controller_runtime_reconcile_total{controller="promotionstep"}[5m])

PolicyGate evaluation rate¶

# PolicyGate evaluations per second
rate(controller_runtime_reconcile_total{controller="policygate"}[5m])

Controller health: reconcile latency P99¶

histogram_quantile(0.99,
  rate(controller_runtime_reconcile_time_seconds_bucket{controller="promotionstep"}[5m])
)

Work queue depth (are we falling behind?)¶

workqueue_depth{name=~"bundle|promotionstep|policygate"}

Webhook latency P95¶

histogram_quantile(0.95,
  rate(controller_runtime_webhook_request_duration_seconds_bucket[5m])
)

DORA Metrics via CLI¶

kardinal-promoter also exposes DORA-style promotion metrics via the kardinal metrics command (not Prometheus — these are computed from CRD history):

kardinal metrics --pipeline my-app --env prod --days 30

Output:

PIPELINE   ENV    PERIOD   DEPLOY_FREQ   LEAD_TIME     FAIL_RATE   ROLLBACKS
my-app     prod   30d      2.1/day       45m avg       3.2%        1

Metric	Description
`DEPLOY_FREQ`	Bundles successfully promoted to the target environment per day
`LEAD_TIME`	Average time from Bundle creation to target environment verification
`FAIL_RATE`	Percentage of Bundles that reached `Failed` state
`ROLLBACKS`	Number of rollback Bundles in the period

Alerting Rules — PrometheusRule (Helm)¶

kardinal-promoter ships a PrometheusRule resource that works with the Prometheus Operator (kube-prometheus-stack, Victoria Metrics Operator, etc.).

Enable via Helm¶

# values.yaml
prometheusRule:
  enabled: true
  # Match your Prometheus Operator's selector labels:
  additionalLabels:
    release: kube-prometheus-stack

Install or upgrade:

helm upgrade --install kardinal-promoter oci://ghcr.io/pnz1990/kardinal-promoter/chart/kardinal-promoter \
  --set prometheusRule.enabled=true \
  --set 'prometheusRule.additionalLabels.release=kube-prometheus-stack'

Included alerts¶

Alert	Expression	Severity	Description
`KardinalControllerDown`	`up{job="kardinal-promoter"} == 0` for 5m	critical	Controller unreachable; promotions stalled
`KardinalHighReconcileErrors`	reconcile error rate > 0.1/s for 5m	warning	Reconciler failing; promotions may be stuck
`KardinalBundleReconcilerStalled`	no bundle reconciles for 10m	warning	Controller idle; possible leader election issue
`KardinalWorkQueueBacklog`	work queue depth > 100 for 5m	warning	Controller overloaded or starved of CPU
`KardinalPolicyGateReconcileSlow`	PolicyGate P99 latency > 10s	warning	Gate evaluations stale; promotions blocked late
`KardinalWebhookErrors`	webhook 5xx rate > 0 for 5m	warning	Bundle creation API failing; CI pipelines cannot promote

Every alert includes a runbook_url annotation pointing to the relevant section in troubleshooting.md.

Manual alerting (without Prometheus Operator)¶

If you don't use the Prometheus Operator, copy the rules into your prometheus.yml:

groups:
  - name: kardinal-promoter
    rules:
      - alert: KardinalControllerDown
        expr: up{job="kardinal-promoter"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "kardinal-promoter controller is not scraping"
          runbook_url: "https://pnz1990.github.io/kardinal-promoter/troubleshooting/#controller-not-running"

      - alert: KardinalHighReconcileErrors
        expr: |
          sum by (controller) (
            rate(controller_runtime_reconcile_errors_total[5m])
          ) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "kardinal {{ $labels.controller }} reconcile error rate elevated"
          runbook_url: "https://pnz1990.github.io/kardinal-promoter/troubleshooting/#reconcile-errors"

      - alert: KardinalWorkQueueBacklog
        expr: workqueue_depth{name=~"bundle|promotionstep|policygate"} > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "kardinal work queue depth > 100 for {{ $labels.name }}"
          runbook_url: "https://pnz1990.github.io/kardinal-promoter/troubleshooting/#work-queue-backlog"

Grafana Dashboard¶

A community Grafana dashboard is available at:

Dashboard ID: pending submission to grafana.com

In the meantime, import the following JSON panels manually:

Panel: Reconcile rate by controller

{
  "targets": [{
    "expr": "sum by (controller) (rate(controller_runtime_reconcile_total[5m]))",
    "legendFormat": "{{controller}}"
  }],
  "type": "timeseries",
  "title": "Reconcile rate (ops/s)"
}

Panel: Reconcile error rate

{
  "targets": [{
    "expr": "sum by (controller) (rate(controller_runtime_reconcile_errors_total[5m]))",
    "legendFormat": "{{controller}} errors"
  }],
  "type": "timeseries",
  "title": "Reconcile errors (errors/s)"
}

Changing the Metrics Port¶

Set metricsBindAddress in Helm values to use a different port:

# values.yaml
metricsBindAddress: ":9090"
service:
  metricsPort: 9090