kardinal-promoter: Technical Design Document¶
Version: 2.1 Date: 2026-04-09 License: Apache 2.0
Overview¶
kardinal-promoter is a Kubernetes-native promotion controller. It moves versioned artifact bundles through environment pipelines using Git pull requests as the approval mechanism, with policy gates expressed as CEL and represented as visible nodes in the promotion DAG.
The execution engine is kro's Graph primitive, a general-purpose Kubernetes DAG reconciler. Graph handles dependency ordering, parallel execution, conditional inclusion, and teardown. kardinal-promoter handles the promotion-specific logic: Git writes, PR lifecycle, policy evaluation, and health verification.
All state lives in Kubernetes CRDs. There is no external database, no dedicated API server, and no state outside of etcd. The CLI, UI, and webhook endpoints create and read CRDs. A user can operate the entire system with kubectl.
Relationship to kro and Graph¶
kro's Graph primitive (experimental.kro.run/v1alpha1/Graph) is a namespace-scoped CRD that defines a set of nodes with Kubernetes resource templates, CEL expressions for data flow, and dependency inference from those expressions. It supports readyWhen, includeWhen, forEach, propagateWhen, and finalizes on each node. The Graph controller reconciles the DAG using scoped walks and hash-based change detection. Reference implementation: ellistarn/kro/tree/krocodile/experimental.
API group note (krocodile commit
48224264, 2026-04-10): Graph CRD moved fromkro.runtoexperimental.kro.runto eliminate CRD conflicts with upstream kro. Useexperimental.kro.runin all GVK/GVR references.
Within the kro ecosystem, the relationship is:
- Graph is the core DAG primitive. It creates, reconciles, and tears down sets of Kubernetes resources in dependency order.
- RGD (ResourceGraphDefinition) is a higher-level concept built on Graph for resource composition and CRD generation.
- kardinal-promoter is a separate concept built on Graph for promotion orchestration and policy gating.
The advantage of building on Graph directly (rather than on RGD) is that there is no translation shim. With RGD, the controller would need to generate an RGD spec from the Pipeline CRD, which introduces an abstraction layer between the user's intent and the execution engine. The RGD layer also carries resource-composition semantics (CRD generation, schema validation, instance controllers) that are irrelevant to promotion. With Graph, the controller generates a Graph spec whose nodes are exactly the PromotionStep and PolicyGate CRDs that kardinal-promoter needs. The Graph controller creates those CRDs in DAG order. No shim, no intermediate abstraction, no unused semantics.
1. Design Principles¶
-
The PR is the approval surface. Every promotion (forward or rollback) produces a Git pull request. The PR body contains promotion evidence: artifact provenance, upstream verification results, and policy gate compliance. Human approval for gated environments happens by merging the PR.
-
Artifacts, not diffs. Promotions track versioned, immutable Bundles with build provenance (commit SHA, CI run URL, author, image digest), not opaque Git diffs.
-
Works with existing GitOps tools. No dependency on Argo CD or Flux. Integrates with both via pluggable health adapters that auto-detect installed CRDs. Also works without a GitOps tool (falls back to Deployment condition checks).
-
Rollback is a forward promotion. Rolling back to a previous version creates a new Bundle targeting the rollback version and runs it through the same pipeline, same policy gates, same PR flow.
-
Pipelines are graphs. Every pipeline, including linear ones, runs as a kro Graph internally. Parallel fan-out, conditional steps, and multi-service dependencies are native to Graph. Power users can write Graph specs directly.
-
Policies are DAG nodes. PolicyGates are nodes in the Graph between environment steps. They are visible in the UI, inspectable via CLI, and block downstream steps until their CEL expression evaluates to true. Org-level gates are injected as mandatory dependencies that team-level Pipelines cannot remove.
-
Kubernetes is the control plane. Every object is a CRD. etcd is the database. The API server is the API. CLI, UI, and webhook endpoints are convenience layers that create and read CRDs.
-
Never mutate workload resources. The controller never creates, updates, or deletes Deployments, Services, or HTTPRoutes. All cluster state changes flow through Git. In-cluster progressive delivery is delegated to Argo Rollouts or Flagger.
-
Pluggable by default. Every integration point (SCM providers, manifest update strategies, health verification, delivery delegation, metric providers, artifact sources) is a Go interface. Phase 1 ships one implementation per interface. Adding a provider means implementing the interface and registering it.
2. Concepts¶
2.1 Bundle¶
An immutable, versioned snapshot of what to deploy. Contains one or more container image references (tag + digest), optionally a Helm chart version or Git commit SHA. Carries build provenance.
Creation paths (all produce the same Bundle CRD in etcd): - CI webhook: POST /api/v1/bundles - CLI: kardinal create bundle my-app --image ghcr.io/myorg/my-app:1.29.0 - kubectl: kubectl apply -f bundle.yaml - Subscription CRD: auto-created from registry (Phase 3)
Phases: Available, Promoting, Verified, Failed, Superseded.
The spec.intent.target field declares how far to promote (default: all environments). The spec.intent.skip field excludes environments from the promotion, subject to SkipPermission validation (Section 5.6).
2.2 Pipeline¶
A CRD listing environments with Git configuration. The controller translates this into a Graph, injecting PolicyGate nodes based on org and team policies. Users do not need to know Graph exists unless they want non-linear topologies (parallel fan-out, conditional steps), in which case they write Graphs directly.
2.3 PromotionStep¶
A CRD representing one environment promotion. Created by the Graph controller as a node in the promotion Graph. The kardinal-controller watches PromotionStep CRs and executes: Git write, PR creation, merge detection, health verification, status update.
The Graph's readyWhen on the PromotionStep node is ${dev.status.state == "Verified"}. When the step reaches Verified, Graph advances to dependent nodes.
2.4 PolicyGate¶
A CRD representing a policy check. PolicyGates are nodes in the Graph between environment steps. They block downstream promotion until their CEL expression evaluates to true.
Graph creates PolicyGate CRs. The kardinal-controller evaluates the CEL expression, writes status.ready and status.lastEvaluatedAt, and Graph reads the result via readyWhen.
Org-level PolicyGates (from the platform-policies namespace) are injected as mandatory dependencies during Pipeline-to-Graph translation. Teams cannot remove them. Teams can add their own gates from their own namespace, which are wired alongside org gates.
2.5 Subscription (Phase 3)¶
A declarative CRD that watches an OCI registry and auto-creates Bundles when new tags matching a semver constraint are discovered.
3. Architecture¶
3.1 Component Overview¶
Standalone mode (Phase 1, single binary):
kardinal-controller handles everything:
Pipeline reconciler, PromotionStep reconciler, PolicyGate reconciler
Distributed mode (Phase 2+, control plane + agents):
Control plane cluster:
Graph controller: DAG orchestration
kardinal-controller: Pipeline reconciler, PolicyGate reconciler
kardinal-ui: Embedded UI
Agent clusters (one per shard):
kardinal-agent: PromotionStep reconciler only
Labeled with --shard=<name>
Reconciles PromotionSteps matching its shard
Writes status back to control plane CRDs
UX Surfaces (convenience, not required):
CLI: kardinal promote --> patches Bundle CR
UI: kardinal-ui --> reads CRDs, renders DAG
Webhook: /api/v1/bundles --> creates Bundle CR
CI: GitHub Action --> creates Bundle CR
Kubernetes API Server (sole source of truth):
kardinal CRDs: Pipeline, Bundle, PromotionStep, PolicyGate,
MetricCheck (Phase 2), Subscription (Phase 3)
Graph CRDs: Graph, GraphRevision
Controllers:
Graph controller:
Reconciles Graph CRs
Creates child resources (PromotionStep, PolicyGate CRs) in DAG order
Handles: dependency ordering, parallel execution, includeWhen,
forEach, readyWhen, propagateWhen, scoped walks,
change detection, teardown
kardinal-controller (control plane):
Pipeline reconciler:
Watches Pipeline + Bundle CRDs
Generates per-Bundle Graph spec with PolicyGate injection
Validates intent.skip against SkipPermission gates
PolicyGate reconciler:
CEL evaluation, status.ready + status.lastEvaluatedAt
Timer-based re-evaluation at recheckInterval (see Section 3.5)
kardinal-agent (per shard, or combined into kardinal-controller in standalone mode):
PromotionStep reconciler:
Executes promotion steps (built-in and custom)
Git write, PR creation, merge detection, health check, status update
Filtered by: --shard label on PromotionStep CRs
kardinal-ui:
Embedded in controller binary (React via go:embed)
Served at /ui (separate port via --ui-listen-address)
Read-only: reads Graph, PromotionStep, PolicyGate, Bundle CRDs
3.2 Why Graph Instead of RGD¶
RGD is a resource-composition abstraction. It generates CRDs from schema definitions, watches instances of those CRDs, and manages child resources per instance. This machinery is useful for its intended purpose but is unnecessary overhead for promotion orchestration.
Building on RGD requires a shim layer: the controller generates an RGD spec from the Pipeline CRD, then the RGD controller creates a CRD, then an instance of that CRD is created, which triggers the RGD instance controller to create the actual resources. That is three levels of indirection.
Building on Graph is direct: the controller generates a Graph spec whose nodes are PromotionStep and PolicyGate CRDs. The Graph controller creates those CRDs in dependency order. Two levels: Pipeline to Graph spec, Graph spec to resources. The generated Graph looks exactly like what runs. There is no intermediate CRD generation or instance indirection.
| Aspect | With RGD | With Graph |
|---|---|---|
| Indirection levels | Pipeline to RGD to CRD to instance to resources | Pipeline to Graph to resources |
| Unused features | CRD generation, schema validation, instance controllers | None |
| Dependency ordering | Available (same CEL-based inference) | Available (same CEL-based inference) |
| readyWhen / includeWhen / forEach | Available | Available |
| Scoped walks + change detection | Available (in kro) | Available (native to Graph) |
| Status aggregation | RGD Accepted/Ready | Graph Accepted/Ready |
The feature set is equivalent. The difference is that Graph does not carry resource-composition abstractions that promotion does not use.
3.3 Promotion Flow¶
- Bundle CR created (webhook, CLI, kubectl, or Subscription).
- kardinal-controller detects new Bundle. Validates
intent.skipif present (see Section 5.6). Generates a Graph spec from the Pipeline CRD, tailored to the Bundle's intent. Injects org/team PolicyGate nodes. Creates a Graph CR owned by the Bundle viaownerReferences. - Graph controller reconciles the Graph. Resolves DAG dependencies. Creates the first PromotionStep CR (e.g., dev).
- kardinal-agent (or kardinal-controller in standalone mode) sees the PromotionStep. Executes the promotion steps defined on the environment (or the default step sequence). Steps include: Git clone, manifest update (kustomize/helm), Git commit, Git push or PR creation, merge detection, health verification. Custom steps can call external webhooks between any built-in steps.
- For PR-gated environments: waits for merge via webhook (
/webhooksendpoint). On controller startup, performs a one-time reconciliation by listing open PRs with thekardinallabel to catch any merges missed during downtime. - After merge: monitors health via the appropriate adapter (Deployment, Argo CD, Flux).
- When healthy: writes
status.state = "Verified". Copies promotion evidence (metrics, gate results, approver, timing) into Bundlestatus.environmentsfor durable audit storage. - Graph controller sees
readyWhensatisfied. Creates next nodes (PolicyGates, then PromotionStep for the next environment). - kardinal-controller evaluates PolicyGates: CEL expression against promotion context. Writes
status.readyandstatus.lastEvaluatedAt. - Graph controller sees all PolicyGates ready. Creates next PromotionStep.
- Repeat until all environments are Verified or a step fails.
- On failure:
status.state = "Failed". Graph stops all downstream nodes. kardinal-controller opens rollback PRs for affected environments.
3.4 Per-Bundle Graph Lifecycle¶
Each Bundle gets its own Graph CR. The Graph spec is generated from the Pipeline CRD and tailored to the Bundle's intent (target, skip). The Graph is owned by the Bundle via ownerReferences.
| Event | Behavior |
|---|---|
| Bundle created | Controller validates intent, generates Graph, creates Graph CR |
| Promotion completes | Graph remains (audit record). Bundle phase set to Verified. |
| Promotion fails | Graph remains (failed nodes visible in UI). Bundle phase set to Failed. |
| Superseded by newer Bundle | Old Graph deleted via Bundle ownerRef cascade. Old Bundle set to Superseded. Pinned Bundles (kardinal.io/pin: "true") are not superseded. |
| History limit exceeded | Oldest Bundles and their Graphs are garbage-collected. |
Evidence (metrics, gate results, approver, timing) is copied into Bundle status.environments at verification time (Step 7 above). When the Graph and its PromotionStep CRs are garbage-collected, the evidence survives on the Bundle. When the Bundle itself is garbage-collected (history limit), the Git PRs remain as the permanent audit trail.
3.5 PolicyGate Blocking and re-evaluation — propagateWhen, not readyWhen¶
Critical note (2026-04-09, verified against krocodile/experimental): The Graph primitive's
readyWhenis a health signal only. It feeds the Graph's aggregatedReadycondition but does not gate downstream execution. Dependents proceed as soon as a node's data is in scope, regardless ofreadyWhen. The field that gates downstream data flow ispropagateWhen.kardinal-promoter's PolicyGate blocking model therefore uses
propagateWhenon the upstream PromotionStep (or on the PolicyGate node itself) to hold back the downstream PromotionStep until the gate passes. This is the correct mechanism.readyWhenis only used for the UI health signal and the Graph'sReadycondition rollup.
How PolicyGate blocking works:
The PolicyGate node for environment prod is created between the staging PromotionStep and the prod PromotionStep. The staging PromotionStep node carries:
This means: once staging is Verified, its data flows to downstream nodes (PolicyGates and prod PromotionStep). Until then, downstream nodes wait.
Each PolicyGate node additionally carries propagateWhen on itself:
propagateWhen:
- ${noWeekendDeploys.status.ready == true}
- ${timestamp(noWeekendDeploys.status.lastEvaluatedAt) > now() - duration("10m")} # freshness
The prod PromotionStep depends on the PolicyGate nodes via CEL references. Because the PolicyGate node's propagateWhen is unsatisfied, its data does not flow to the prod PromotionStep. The prod PromotionStep retains its previous (Pending) state until all gates propagate.
readyWhen on the PolicyGate node still exists as a health signal:
This feeds the Graph's Ready condition — visible in kubectl get graph and the kardinal-ui.
Re-evaluation for time-based gates:
Graph reconciles on Kubernetes watch events, not timers. A no-weekend-deploys gate that evaluates !schedule.isWeekend will have status.ready = false on Friday evening. On Monday morning, no cluster state changes, so no watch event fires.
The Phase 1 workaround: the PolicyGate reconciler writes status.lastEvaluatedAt at every recheckInterval. This status write triggers a Graph watch event. Graph re-evaluates the propagateWhen expression on the PolicyGate node (which includes the freshness check). When the gate passes on Monday, status.ready becomes true and propagateWhen is satisfied, so the PolicyGate's data propagates to the prod PromotionStep.
At 50 pipelines × 2 gates × recheckInterval: 5m = 20 writes/minute. Acceptable at Phase 1 scale.
Proposed Graph contribution: Native recheckAfter hint on a node so Graph re-evaluates without requiring the PolicyGate reconciler to write a status update purely to trigger a watch event. This would reduce write overhead at scale. Tracked as a contribution (Section 17).
Summary of readyWhen vs propagateWhen in kardinal-promoter:
| Field | Where used | Purpose |
|---|---|---|
readyWhen | PolicyGate nodes, PromotionStep nodes | Health signal for UI and Graph Ready condition. Does NOT gate downstream execution. |
propagateWhen | PromotionStep nodes (upstream data-flow gate), PolicyGate nodes (gate data flow to prod step) | Gates when a node's data flows to dependents. This is what blocks the downstream PromotionStep. |
3.6 Dependency Edges in Generated Graphs¶
Graph infers dependency edges from CEL ${} references between nodes. If node B's template contains ${A.status.state}, B depends on A.
The current Graph implementation does not support an explicit dependsOn field on nodes. To express ordering between PromotionSteps and PolicyGates, the controller uses two mechanisms:
For PromotionSteps: Each step's template includes a spec.upstreamVerified field that references the upstream step's status: ${dev.status.state}. This field is consumed by the PromotionStep reconciler (it checks that the upstream is Verified before proceeding), so it serves both a semantic purpose and creates the dependency edge.
For PolicyGates: Each gate's template includes a spec.upstreamEnvironment field referencing the upstream PromotionStep: ${staging.status.state}. This tells the PolicyGate reconciler which environment's verification to check for soak time calculations, and it creates the dependency edge.
For downstream PromotionSteps after PolicyGates: The prod PromotionStep's template includes spec.requiredGates referencing the gate statuses: ["${noWeekendDeploys.metadata.name}", "${stagingSoak.metadata.name}"]. This creates fan-in edges from multiple gates to the step.
These fields are not synthetic placeholders. They carry information the reconcilers use. The dependency edge inference is a secondary effect of data that is already needed.
A native dependsOn on Graph nodes would be cleaner. This is tracked as a proposed contribution (Section 17).
3.7 UX Surface Roles¶
The system has three surfaces that show promotion state. Each serves a different purpose:
| Surface | Role | What it shows | When to use |
|---|---|---|---|
| GitHub PR | Approval surface | Snapshot at PR creation + periodic comment updates. Artifact diff, provenance, policy gate table, upstream verification. | Reviewing and approving a promotion. |
| kardinal-ui | Monitoring surface | Live Graph state from etcd. DAG visualization with per-node status. Bundle list, PR links, policy evaluation details. | Watching promotion progress across all environments. |
| kubectl / CLI | Debug surface | Raw CRD status. kardinal explain shows PolicyGate evaluation trace. | Diagnosing why a promotion is stuck. |
The PR comment is a snapshot. It is updated when gate states change (via webhook), but between updates it can be stale relative to the live CRD state. The UI and kubectl always show live state. Authorization for PR approval is governed by Git CODEOWNERS. Authorization for CRD operations is governed by Kubernetes RBAC. These are independent; the controller does not attempt to reconcile them.
3.8 GitOps Tool Integration¶
Health verification adapters (pluggable, auto-detected):
| Adapter | What it watches | Healthy when | Phase |
|---|---|---|---|
resource (default) | Deployment.status.conditions | Available=True, not stalled | Phase 1 |
argocd | Application.status.health + .sync | Healthy AND Synced | Phase 1 |
flux | Kustomization.status.conditions | Ready=True, generation match | Phase 1 |
argoRollouts | Rollout.status.phase | Healthy | Phase 2 |
flagger | Canary.status.phase | Succeeded | Phase 2 |
On startup, the controller calls Available() on each registered adapter to check for installed CRDs. Per-environment health type can be set explicitly or auto-detected (priority: argocd, flux, resource).
Kargo's promotion steps operate on Git directly and can technically work without Argo CD for the promotion mechanics (git-clone, git-push, kustomize-set-image are Argo CD-independent). However, Kargo's health verification and Freight tracking are deeply integrated with Argo CD Application status. kardinal-promoter's health adapters provide equivalent depth of integration for both Argo CD and Flux users without requiring either.
3.9 Multi-Cluster Promotion¶
Promotion is a Git write. Git has no cluster boundary. Each cluster has its own GitOps tool syncing its own directory from the shared Git repository.
For health verification across clusters:
- Argo CD hub-spoke model: Argo CD Applications for all clusters live in the hub. The controller reads Application health from the hub cluster. No cross-cluster API calls needed.
- Flux per-cluster model: Each cluster runs its own Flux. Health verification uses a
clusterfield on the environment config that references a kubeconfig Secret for the remote cluster. - Bare Kubernetes: Same as Flux. Remote kubeconfig Secret.
Parallel fan-out to multiple regions is expressed via dependsOn on the Pipeline CRD:
environments:
- name: staging
- name: prod-us-east
dependsOn: [staging]
approval: pr-review
- name: prod-eu-west
dependsOn: [staging]
approval: pr-review
Both prod environments depend on staging. The generated Graph creates them as parallel nodes. Graph executes them concurrently.
3.10 Bundle Creation¶
Four paths, one result. All create a Bundle CRD in etcd.
| Path | Mechanism | When to use |
|---|---|---|
| CI webhook | POST /api/v1/bundles with JSON body | CI pipelines (GitHub Actions, GitLab CI) |
| CLI | kardinal create bundle my-app --image ... | Manual or scripted |
| kubectl | kubectl apply -f bundle.yaml | Fully declarative GitOps |
| Subscription | Controller auto-creates from registry | Passive trigger, no CI changes (Phase 3) |
3.11 kardinal-ui¶
Embedded in the controller binary. React frontend bundled via go:embed. Served at /ui. Read-only; all mutations go through CRDs.
For environments with network segmentation requirements, --ui-listen-address separates the UI onto a different port from the API endpoints (/api/v1/bundles, /webhooks, /metrics).
Features: promotion DAG with per-node state (PromotionStep: green/amber/red; PolicyGate: pass/fail/pending), Bundle list with provenance, PR links, policy evaluation details, pipeline overview.
3.12 Git Caching¶
Shared work trees per repo URL at /var/cache/kardinal/<repo-hash>/. Initial access: shallow clone (--depth 1). Subsequent: git fetch --depth 1 && git reset --hard. Mutex-serialized per work tree. Cache invalidated on repo URL change, 24h age, or Git operation failure.
3.13 Controller Metrics¶
The controller exposes a /metrics endpoint in Prometheus format.
| Metric | Type | Description |
|---|---|---|
kardinal_bundles_created_total | Counter | Bundles created (by pipeline, source) |
kardinal_promotions_opened_total | Counter | PRs opened (by env, approval type) |
kardinal_promotions_merged_total | Counter | PRs merged (by env) |
kardinal_promotions_failed_total | Counter | Promotions failed (by env, reason) |
kardinal_rollbacks_total | Counter | Rollbacks triggered (by env, trigger type) |
kardinal_policy_gates_evaluated_total | Counter | PolicyGate evaluations (by result) |
kardinal_policy_gates_blocked_total | Counter | Promotions blocked by gate (by gate name) |
kardinal_health_check_duration_seconds | Histogram | Health check latency |
kardinal_git_operation_duration_seconds | Histogram | Git operation latency |
kardinal_scm_api_requests_total | Counter | SCM API calls (by endpoint, status) |
kardinal_promotion_lead_time_seconds | Histogram | Time from Bundle creation to environment verification |
3.14 Distributed Controller Architecture¶
The architecture separates orchestration (what to promote where) from execution (how to promote). This separation is the natural sharding boundary for distributed deployments.
Standalone mode (Phase 1): A single kardinal-controller binary handles everything. The PromotionStep reconciler runs in the same process as the Pipeline reconciler and PolicyGate reconciler. Suitable for single-cluster or small multi-cluster deployments.
Distributed mode (Phase 2+): The controller splits into two binaries:
- kardinal-controller runs in the control plane cluster. Handles Pipeline reconciliation, Graph generation, PolicyGate evaluation, and Bundle lifecycle. Does not execute promotion steps.
- kardinal-agent runs in each agent cluster (one per shard). Handles PromotionStep reconciliation only. Executes Git writes, PRs, health checks. Connects to the control plane's API server to watch and update PromotionStep CRDs.
Sharding: Each Pipeline environment has an optional shard field. Each agent is started with --shard=<name>. The agent only reconciles PromotionSteps whose kardinal.io/shard label matches its shard name. PromotionSteps without a shard label are reconciled by the control plane controller (standalone behavior).
environments:
- name: prod-eu
shard: eu-cluster # reconciled by the agent in the EU cluster
path: env/prod-eu
approval: pr-review
Why this design:
| Concern | How it's addressed |
|---|---|
| Firewalls | Agents run behind firewalls. They connect outbound to the control plane API server. The control plane has no privileged access to agent clusters. |
| Scalability | Each agent handles a subset of PromotionSteps. The control plane handles decisions (PolicyGates, Graph generation) which are lightweight. |
| Security | Agents only need Git and SCM credentials for their shard's repositories. The control plane does not hold credentials for workload clusters. |
| Observability | All CRDs (including agent-updated PromotionStep status) live in the control plane. The UI shows all shards from one place. |
| Upgrade path | Phase 1 single binary works unchanged. Splitting into controller + agent is a deployment change, not a code change. The PromotionStep reconciler is the same code in both modes. |
Agent configuration:
kardinal-agent \
--shard=eu-cluster \
--control-plane-kubeconfig=/etc/kardinal/kubeconfig \
--git-cache-dir=/var/cache/kardinal
The agent uses the control plane kubeconfig to watch and update PromotionStep CRDs. It uses local cluster credentials (or kubeconfig Secrets) for health checks.
3.15 Pluggable Promotion Steps¶
The PromotionStep reconciler executes a sequence of steps for each promotion. Steps are configurable per environment.
Default step sequence (when steps is not specified): The controller infers the steps from update.strategy and approval:
approval: auto+strategy: kustomize: git-clone, kustomize-set-image, git-commit, git-push, health-checkapproval: pr-review+strategy: kustomize: git-clone, kustomize-set-image, git-commit, git-push, open-pr, wait-for-merge, health-checkapproval: pr-review+strategy: helm: git-clone, helm-set-image, git-commit, git-push, open-pr, wait-for-merge, health-check
Explicit step sequence (when steps is specified): Users define the exact steps to run, including custom steps.
environments:
- name: prod
approval: pr-review
steps:
- uses: git-clone
- uses: kustomize-set-image
- uses: run-tests
config:
url: https://test-runner.internal/validate
timeout: 5m
- uses: git-commit
- uses: git-push
- uses: open-pr
- uses: wait-for-merge
- uses: health-check
Built-in steps:
| Step | What it does | Phase |
|---|---|---|
git-clone | Clone the GitOps repo from the Git cache | Phase 1 |
kustomize-set-image | Run kustomize edit set-image in the environment directory | Phase 1 |
kustomize-build | Run kustomize build and write rendered output (Rendered Manifests pattern) | Phase 1 |
config-merge | Apply config-only Bundle's Git commit via cherry-pick or overlay (see Section 3.16) | Phase 2 |
helm-set-image | Patch image tag in Helm values.yaml | Phase 2 |
git-commit | Commit changes with structured message | Phase 1 |
git-push | Push to the target branch (for auto approval) | Phase 1 |
open-pr | Open a PR with promotion evidence | Phase 1 |
wait-for-merge | Wait for PR merge via webhook | Phase 1 |
health-check | Verify health via the configured adapter | Phase 1 |
Custom steps (Phase 2): Any step with a uses value that doesn't match a built-in step is treated as a custom step. The controller sends an HTTP POST to the configured url with a JSON payload containing the Bundle, environment, and promotion context. The endpoint returns a pass/fail result.
// Custom step webhook request
type StepRequest struct {
Pipeline string `json:"pipeline"`
Environment string `json:"environment"`
Bundle BundleSpec `json:"bundle"`
Context map[string]any `json:"context"`
Config map[string]any `json:"config"`
}
// Custom step webhook response
type StepResponse struct {
Success bool `json:"success"`
Message string `json:"message"`
}
If a custom step returns success: false, the PromotionStep is marked as Failed and the promotion halts.
Backwards compatibility: When steps is omitted from an environment, the default step sequence is used. Existing Pipeline definitions continue to work unchanged.
3.16 Config-Only Promotions¶
Bundles support two types of artifacts: images (default) and Git commits (config-only).
Image Bundles (existing behavior): Reference container image tags. The promotion updates image references in manifests.
Config Bundles (new): Reference a Git commit SHA from a configuration repository. The promotion merges that commit's changes into each environment directory. This supports promoting configuration changes (resource limits, env vars, feature flags) independently from image changes.
apiVersion: kardinal.io/v1alpha1
kind: Bundle
metadata:
name: my-app-config-update-1712567890
labels:
kardinal.io/pipeline: my-app
spec:
type: config # "image" (default) or "config"
artifacts:
gitCommit:
repository: https://github.com/myorg/app-config
sha: "abc123def456"
message: "Update resource limits for all environments"
provenance:
commitSHA: "abc123def456"
ciRunURL: "https://github.com/myorg/app-config/actions/runs/67890"
author: "platform-team"
When the PromotionStep reconciler processes a config Bundle, the step sequence is: 1. git-clone: Clone the GitOps repo and the config source repo 2. config-merge: Cherry-pick or overlay the config commit's changes into the environment directory 3. git-commit, git-push / open-pr, wait-for-merge, health-check (same as image Bundles)
Config-merge strategies: - cherry-pick (default): Apply the referenced commit as a cherry-pick into the environment directory - overlay: Copy changed files from the source repo path into the environment directory
Config Bundles go through the same Pipeline, same PolicyGates, same PR flow as image Bundles. The only difference is the update step.
Mixed Bundles (Phase 3): A single Bundle can reference both images and a Git commit, allowing coordinated promotion of an image update alongside a config change.
Subscription CRD for Git (Phase 3): The Subscription CRD supports type: gitCommit to watch a Git repository for new commits and auto-create config Bundles:
apiVersion: kardinal.io/v1alpha1
kind: Subscription
metadata:
name: my-app-config-watch
spec:
pipeline: my-app
source:
type: gitCommit
gitCommit:
repository: https://github.com/myorg/app-config
branch: main
path: configs/my-app/ # only watch changes in this path
interval: 5m
4. Pluggable Architecture¶
Every integration point is a Go interface. Phase 1 ships one implementation per interface. Adding a provider means implementing the interface and registering it in the provider registry.
4.1 Extension Points¶
SCM Provider (pkg/scm/provider.go): Handles Git operations and PR lifecycle.
type GitClient interface {
Clone(ctx context.Context, url string, opts CloneOptions) (*Repository, error)
Push(ctx context.Context, repo *Repository, branch string) error
}
type SCMProvider interface {
CreatePR(ctx context.Context, opts PROptions) (*PullRequest, error)
GetPRStatus(ctx context.Context, repo string, prNumber int) (PRStatus, error)
MergePR(ctx context.Context, repo string, prNumber int) error
UpdatePRComment(ctx context.Context, repo string, prNumber int, body string) error
ValidateWebhook(ctx context.Context, headers http.Header, body []byte, secret string) (bool, error)
}
Git operations are provider-agnostic (clone/push work the same for all providers). PR operations are provider-specific. Splitting the interface allows reusing one Git implementation across providers.
Behavioral contracts: CreatePR must be idempotent (if a PR already exists for the same branch, return it rather than creating a duplicate). GetPRStatus should retry on transient 404s (GitHub has a known ~500ms indexing delay after PR creation). ValidateWebhook must be constant-time to prevent timing attacks on HMAC verification.
| Implementation | Phase |
|---|---|
github | Phase 1 |
gitlab | Phase 2 |
bitbucket | Phase 3+ |
Manifest Update Strategy (pkg/update/strategy.go): Rewrites artifact references in manifests.
type Strategy interface {
Update(ctx context.Context, dir string, artifacts []Artifact) error
Name() string
}
Behavioral contract: Update must be idempotent. Re-running after a crash must produce the same output. Implementations must not create new files or delete existing ones, only modify in place.
| Implementation | Phase | Mechanism |
|---|---|---|
kustomize | Phase 1 | kustomize edit set-image |
helm | Phase 2 | Patch configurable path in values.yaml |
replace | Phase 3 | Regex replacement in raw YAML |
oci | Phase 4+ | Pull OCI artifact, unpack (for kro OCI artifacts when shipped) |
Health Verification Adapter (pkg/health/adapter.go): Verifies the GitOps tool applied the change.
type Adapter interface {
Check(ctx context.Context, opts CheckOptions) (HealthStatus, error)
Name() string
Available(ctx context.Context, client dynamic.Interface) (bool, error)
}
Behavioral contract: Check must not cache results across calls; each call must reflect current cluster state. Available must only check for CRD existence, not for specific resource instances.
Delivery Delegate (pkg/delivery/delegate.go): Watches progressive delivery status.
type Delegate interface {
Watch(ctx context.Context, opts WatchOptions) (<-chan RolloutStatus, error)
Name() string
Available(ctx context.Context, client dynamic.Interface) (bool, error)
}
| Implementation | Phase |
|---|---|
none (default) | Phase 1 |
argoRollouts | Phase 2 |
flagger | Phase 2 |
Metric Provider (pkg/metrics/provider.go): Queries observability systems.
type Provider interface {
Query(ctx context.Context, query string, vars map[string]string) (float64, error)
Name() string
}
Artifact Source (pkg/source/watcher.go): Watches registries for new artifacts.
type Watcher interface {
Discover(ctx context.Context, constraint string) ([]ArtifactVersion, error)
Name() string
}
Promotion Step (pkg/steps/step.go): A single promotion step in the step sequence.
type Step interface {
Execute(ctx context.Context, state *StepState) (StepResult, error)
Name() string
}
type StepState struct {
Pipeline PipelineSpec
Environment EnvironmentSpec
Bundle BundleSpec
WorkDir string // local Git work tree path
Outputs map[string]any // outputs from previous steps
}
type StepResult struct {
Success bool
Message string
Outputs map[string]any // passed to subsequent steps
}
Built-in implementations: git-clone, kustomize-set-image, helm-set-image, kustomize-build, config-merge, git-commit, git-push, open-pr, wait-for-merge, health-check.
Custom steps: any uses value not matching a built-in step is dispatched as an HTTP POST to the configured config.url. The webhook receives a StepRequest and returns a StepResponse.
4.2 Provider Selection¶
The PromotionStep reconciler selects providers from CRD spec fields:
scm.GitClient + scm.SCMProvider <-- pipeline.spec.git.provider ("github")
update.Strategy <-- env.update.strategy ("kustomize") [used by default steps]
health.Adapter <-- env.health.type ("argocd", or auto-detected)
delivery.Delegate <-- env.delivery.delegate ("argoRollouts")
steps.Step <-- env.steps[].uses ("kustomize-set-image", "open-pr", etc.)
The provider registry is a map from string identifier to constructor. Adding a provider: implement the interface, add registry["gitea"] = gitea.New to the registry initialization.
5. CRD API¶
5.1 Pipeline¶
apiVersion: kardinal.io/v1alpha1
kind: Pipeline
metadata:
name: my-app
namespace: default
spec:
git:
url: https://github.com/myorg/gitops-repo
branch: main
layout: directory
provider: github
secretRef: { name: github-token }
environments:
- name: dev
path: environments/dev
update: { strategy: kustomize }
approval: auto
- name: staging
path: environments/staging
update: { strategy: kustomize }
approval: auto
- name: prod
path: environments/prod
update: { strategy: kustomize }
approval: pr-review
shard: prod-cluster # optional: agent shard for distributed mode
health:
timeout: 15m
delivery:
delegate: argoRollouts
# steps: [...] # optional: override default promotion steps
historyLimit: 20
Health check defaults: kind: Deployment, name: <Pipeline.metadata.name>, namespace: <environment.name>, condition: Available, timeout: 10m. Auto-detects Argo CD or Flux.
Health check types:
health:
type: argocd
argocd: { name: my-app-prod, namespace: argocd }
# or
health:
type: flux
flux: { name: my-app-prod, namespace: flux-system }
# or for remote clusters:
health:
type: argocd
argocd: { name: my-app-prod-us-east }
cluster: prod-us-east # references a kubeconfig Secret
Environment ordering defaults to sequential. For parallel fan-out, use dependsOn:
environments:
- name: staging
- name: prod-us-east
dependsOn: [staging]
approval: pr-review
- name: prod-eu-west
dependsOn: [staging]
approval: pr-review
5.2 PolicyGate¶
Users define PolicyGate templates. The controller creates per-Bundle instances as Graph nodes.
apiVersion: kardinal.io/v1alpha1
kind: PolicyGate
metadata:
name: no-weekend-deploys
namespace: platform-policies
labels:
kardinal.io/scope: org
kardinal.io/applies-to: prod
kardinal.io/type: gate
spec:
expression: "!schedule.isWeekend"
message: "Production deployments are blocked on weekends"
recheckInterval: 5m
Per-Bundle instance status (written by kardinal-controller):
CEL context attributes available by phase:
| Phase | Attributes |
|---|---|
| Phase 1 | bundle.version, bundle.labels.*, bundle.provenance.*, bundle.intent.*, schedule.isWeekend, schedule.hour, schedule.dayOfWeek, environment.name, environment.approval |
| Phase 2 | Phase 1 + metrics.*, bundle.upstreamSoakMinutes, previousBundle.version |
| Phase 3 | Phase 2 + delegation.status, externalApproval.* |
| Phase 4+ | Phase 3 + contracts.*, targetDrift.unreconciled |
Referencing attributes from a later phase causes a CEL evaluation error. The gate fails closed.
5.3 Generated Graph¶
For a Pipeline [dev, staging, prod] with two org prod gates, the controller generates:
apiVersion: experimental.kro.run/v1alpha1
kind: Graph
metadata:
name: my-app-v1-29-0
ownerReferences:
- apiVersion: kardinal.io/v1alpha1
kind: Bundle
name: my-app-v1-29-0-1712567890
spec:
nodes:
- id: bundle
template:
apiVersion: kardinal.io/v1alpha1
kind: Bundle
metadata:
name: my-app-v1-29-0-1712567890
- id: dev
readyWhen:
- ${dev.status.state == "Verified"}
template:
apiVersion: kardinal.io/v1alpha1
kind: PromotionStep
metadata:
name: my-app-v1-29-0-dev
spec:
pipeline: my-app
environment: dev
bundleRef: ${bundle.metadata.name}
path: environments/dev
git:
url: https://github.com/myorg/gitops-repo
provider: github
secretRef: github-token
update: { strategy: kustomize }
approval: auto
- id: staging
readyWhen:
- ${staging.status.state == "Verified"}
template:
apiVersion: kardinal.io/v1alpha1
kind: PromotionStep
metadata:
name: my-app-v1-29-0-staging
spec:
pipeline: my-app
environment: staging
bundleRef: ${bundle.metadata.name}
path: environments/staging
git:
url: https://github.com/myorg/gitops-repo
provider: github
secretRef: github-token
update: { strategy: kustomize }
approval: auto
upstreamVerified: ${dev.status.state}
- id: noWeekendDeploys
readyWhen:
- ${noWeekendDeploys.status.ready == true}
- ${timestamp(noWeekendDeploys.status.lastEvaluatedAt) > now() - duration("10m")}
template:
apiVersion: kardinal.io/v1alpha1
kind: PolicyGate
metadata:
name: my-app-v1-29-0-no-weekend-deploys
spec:
expression: "!schedule.isWeekend"
message: "Production deployments are blocked on weekends"
recheckInterval: 5m
upstreamEnvironment: ${staging.status.state}
- id: stagingSoak
readyWhen:
- ${stagingSoak.status.ready == true}
- ${timestamp(stagingSoak.status.lastEvaluatedAt) > now() - duration("2m")}
template:
apiVersion: kardinal.io/v1alpha1
kind: PolicyGate
metadata:
name: my-app-v1-29-0-staging-soak
spec:
expression: "bundle.upstreamSoakMinutes >= 30"
message: "Bundle must soak in staging for at least 30 minutes"
recheckInterval: 1m
upstreamEnvironment: ${staging.status.state}
- id: prod
readyWhen:
- ${prod.status.state == "Verified"}
template:
apiVersion: kardinal.io/v1alpha1
kind: PromotionStep
metadata:
name: my-app-v1-29-0-prod
spec:
pipeline: my-app
environment: prod
bundleRef: ${bundle.metadata.name}
path: environments/prod
git:
url: https://github.com/myorg/gitops-repo
provider: github
secretRef: github-token
update: { strategy: kustomize }
approval: pr-review
health:
type: argocd
argocd: { name: my-app-prod }
timeout: 15m
delivery:
delegate: argoRollouts
requiredGates:
- ${noWeekendDeploys.metadata.name}
- ${stagingSoak.metadata.name}
The upstreamVerified, upstreamEnvironment, and requiredGates fields serve dual purposes: they carry data the reconcilers need, and they create the CEL references that Graph uses for dependency inference.
5.4 PromotionStep¶
apiVersion: kardinal.io/v1alpha1
kind: PromotionStep
metadata:
name: my-app-v1-29-0-prod
spec:
pipeline: my-app
environment: prod
path: environments/prod
bundleRef: my-app-v1-29-0-1712567890
git:
url: https://github.com/myorg/gitops-repo
provider: github
secretRef: github-token
update: { strategy: kustomize }
approval: pr-review
health:
type: argocd
argocd: { name: my-app-prod, namespace: argocd }
timeout: 15m
delivery:
delegate: argoRollouts
upstreamVerified: "Verified"
requiredGates: ["my-app-v1-29-0-no-weekend-deploys", "my-app-v1-29-0-staging-soak"]
status:
state: Verified
prURL: "https://github.com/myorg/gitops-repo/pull/144"
mergedAt: "2026-04-09T11:15:00Z"
verifiedAt: "2026-04-09T11:20:00Z"
evidence:
metrics: { success-rate: 0.997 }
gateDuration: "4m12s"
approvedBy: ["alice"]
policyGates:
- name: no-weekend-deploys
result: pass
- name: staging-soak
result: pass
delegatedTo: argoRollouts
delegatedStatus: Healthy
5.5 Bundle¶
apiVersion: kardinal.io/v1alpha1
kind: Bundle
metadata:
name: my-app-v1-29-0-1712567890
namespace: default
labels:
kardinal.io/pipeline: my-app
annotations:
kardinal.io/pin: "false"
spec:
type: image # "image" (default) or "config"
artifacts:
images:
- name: my-app
reference: ghcr.io/myorg/my-app:1.29.0
digest: sha256:a1b2c3d4e5f6...
provenance:
commitSHA: "abc123def456"
ciRunURL: "https://github.com/myorg/my-app/actions/runs/12345"
author: "dependabot[bot]"
buildTimestamp: "2026-04-09T10:00:00Z"
intent:
target: prod
status:
phase: Verified
graphRef: my-app-v1-29-0
environments:
dev:
state: Verified
promotedAt: "2026-04-09T10:05:00Z"
verifiedAt: "2026-04-09T10:08:00Z"
staging:
state: Verified
promotedAt: "2026-04-09T10:10:00Z"
verifiedAt: "2026-04-09T10:18:00Z"
evidence:
metrics: { success-rate: 0.997 }
gateDuration: "8m"
approvedBy: []
policyGates:
- name: no-weekend-deploys
result: pass
- name: staging-soak
result: pass
prod:
state: Verified
promotedAt: "2026-04-09T10:20:00Z"
verifiedAt: "2026-04-09T10:35:00Z"
prURL: "https://github.com/myorg/gitops-repo/pull/144"
evidence:
metrics: { success-rate: 0.998 }
gateDuration: "15m"
approvedBy: ["alice"]
policyGates:
- name: no-weekend-deploys
result: pass
- name: staging-soak
result: pass
Evidence is copied from PromotionStep status.evidence into Bundle status.environments[].evidence at verification time. This ensures the audit record survives PromotionStep garbage collection.
5.6 Bundle Intent: Skip Enforcement¶
Before generating the Graph, the controller validates intent.skip:
- For each skipped environment, collect org-level PolicyGates with
kardinal.io/applies-tomatching that environment. - If any org gate applies to a skipped environment, the skip is denied unless a SkipPermission PolicyGate exists and evaluates to true:
apiVersion: kardinal.io/v1alpha1
kind: PolicyGate
metadata:
name: allow-staging-skip-for-hotfix
namespace: platform-policies
labels:
kardinal.io/scope: org
kardinal.io/type: skip-permission
kardinal.io/applies-to: staging
spec:
expression: "bundle.labels.hotfix == true"
message: "Hotfix bundles may skip staging"
SkipPermission is evaluated synchronously before Graph creation. If denied, Bundle status is set to SkipDenied with a reason message.
5.7 MetricCheck (Phase 2)¶
apiVersion: kardinal.io/v1alpha1
kind: MetricCheck
metadata:
name: success-rate
spec:
provider: { type: prometheus, address: "http://prometheus:9090" }
query: |
sum(rate(http_requests_total{app="{{ .AppName }}",status!~"5.."}[5m]))
/ sum(rate(http_requests_total{app="{{ .AppName }}"}[5m]))
successCondition: ">= 0.99"
failureCondition: "< 0.95"
noDataBehavior: hold
5.8 kardinal init¶
Input:
app: my-app
image: ghcr.io/myorg/my-app
git:
url: https://github.com/myorg/gitops-repo
secret: github-token
environments: [dev, staging, prod]
prodApproval: pr-review
Output:
$ kardinal init -f kardinal.yaml
Pipeline "my-app" created (3 environments: dev, staging, prod)
2 org PolicyGates will be injected: no-weekend-deploys, staging-soak
Next: Add to your CI:
kardinal create bundle my-app --image ghcr.io/myorg/my-app:$TAG
6. CLI¶
Phase 1¶
| Command | Description |
|---|---|
kardinal init [-f config.yaml] | Generate Pipeline from simple config |
kardinal get pipelines | List Pipelines with current Bundle per environment |
kardinal get steps <pipeline> | Show PromotionSteps and PolicyGates with state |
kardinal get bundles <pipeline> | List Bundles with provenance |
kardinal create bundle <pipeline> | Create Bundle CRD |
kardinal promote <pipeline> --env <env> | Manually trigger promotion |
kardinal explain <pipeline> --env <env> | PolicyGate evaluation trace with current values |
kardinal explain <pipeline> --env <env> --watch | Continuous output, re-evaluates on change |
kardinal version | CLI and controller versions |
Phase 2¶
| Command | Description |
|---|---|
kardinal rollback <pipeline> --env <env> | Promote previous Bundle |
kardinal rollback ... --emergency | Emergency rollback with priority label |
kardinal pause <pipeline> | Inject PolicyGate with expression: "false" |
kardinal resume <pipeline> | Remove freeze gate |
kardinal history <pipeline> | Promotion history with evidence |
kardinal policy list | List PolicyGate CRDs |
kardinal policy test <file> | Validate CEL against context schema |
kardinal policy simulate --env prod --time "Saturday 3pm" | Simulate gate evaluation against hypothetical context |
kardinal diff <bundle-a> <bundle-b> | Artifact diff |
kardinal explain example output:
PROMOTION: my-app / prod
Bundle: v1.29.0
POLICY GATES:
no-weekend-deploys [org] PASS schedule.isWeekend = false
staging-soak [org] FAIL bundle.upstreamSoakMinutes = 12 (threshold: >= 30)
ETA: ~18 minutes (based on staging verifiedAt)
RESULT: BLOCKED by staging-soak
Phase 2 adds ETA estimation for time-based gates by computing the delta between the threshold and the current value against the known staging verification timestamp.
7. PR Approval Flow¶
For approval: pr-review environments, the controller opens a PR with a structured body:
## Promotion: my-app v1.29.0 to prod
### Policy Gates
| Gate | Scope | Status | Detail |
|---|---|---|---|
| no-weekend-deploys | org | PASS | Tuesday 14:00 UTC |
| staging-soak | org | PASS | Soak: 45m (min: 30m) |
### Artifact
| Field | Value |
|---|---|
| Image | ghcr.io/myorg/my-app:1.29.0 |
| Digest | sha256:a1b2c3d4 |
| Source Commit | abc123d |
| CI Run | Build #12345 |
### Upstream Verification
| Environment | Verified | Soak |
|---|---|---|
| dev | 2h ago | n/a |
| staging | 45m ago | 45m |
### Changes
ghcr.io/myorg/my-app: 1.28.0 to 1.29.0
For approval: auto environments: direct push to the target branch, no PR. Add pr: true on the environment config if an audit trail PR is desired.
Merge detection: webhook-based (/webhooks endpoint). On controller startup, the controller lists all open PRs with the kardinal label and reconciles any that were merged during downtime. There is no periodic polling.
Mid-flight policy changes: new PolicyGates added after a Graph is created do not apply to that Graph. They apply to all subsequent Bundles. Use kardinal pause to block in-flight promotions.
8. Rollback¶
Rollback creates a new Bundle whose spec.artifacts point to the previous verified version. This Bundle runs through the same pipeline, same Graph generation, same PolicyGate evaluation, same PR flow. There is no separate rollback mechanism.
kardinal rollback <pipeline> --env prod creates a Bundle with the prior version and intent.target: prod. The kardinal/rollback label is added to the PR for visibility.
On in-flight failure (PromotionStep status.state = "Failed"), Graph stops all downstream nodes. The controller opens rollback PRs for environments that received the failed Bundle.
9. Comparison¶
| Dimension | kardinal-promoter | Kargo (v1.9.5) | GitOps Promoter (v0.26.3) |
|---|---|---|---|
| Pipeline model | Graph DAG (fan-out, conditional, forEach) | Stage dependencies (configurable topology) | Linear branch promotion |
| Policy governance | PolicyGate Graph nodes (CEL), visible in UI | None shipped; approval policies planned (#3440) | Commit statuses |
| GitOps integration | Auto-detects Argo CD, Flux, bare K8s | Argo CD for health verification; Git steps are provider-agnostic | Any (commit statuses) |
| PR approval | Promotion evidence (provenance, upstream metrics, policy table) | git-open-pr / wait-for-pr steps (since v1.8.0) | PR with branch diff |
| Artifact bundling | Bundle CRD (images, Helm, Git ref, provenance) | Freight (images, charts, Git refs, digests) | None (raw Git diff) |
| Build provenance | Commit SHA, CI run, author, digest | Image digests | None |
| Rollback | Forward promotion of prior Bundle through same pipeline | Re-promote prior Freight; AnalysisTemplate verification | Revert PR manually |
| Multi-cluster | Argo CD hub-spoke, Flux via kubeconfig, bare K8s via kubeconfig | Argo CD hub-spoke | Branch structure |
| Artifact source watching | Phase 3 (Subscription CRD). Phase 1 uses CI webhook. | Day 1 (Warehouse auto-discovers images, charts, Git) | Commit on source branch |
| Git providers | GitHub (Phase 1), GitLab (Phase 2) | GitHub, GitLab, Gitea, BitBucket | GitHub, GitLab, Bitbucket, Gitea, Forgejo |
| Update strategies | Kustomize (Phase 1), Helm (Phase 2) | Kustomize, Helm, YAML, pluggable step runner | N/A (promotes branches) |
| UI | kardinal-ui (embedded, Phase 1b) | Production-grade promotion board (React + Ant Design) | Lightweight UI + Argo CD extension |
| Production users | None | Multiple (enterprise) | Intuit |
| Maturity | Pre-release | v1.9.x | v0.26.x |
10. Non-Goals¶
| Out of scope | Rationale |
|---|---|
| Traffic management | Delegated to Argo Rollouts or Flagger |
| GitOps sync | The controller writes to Git. GitOps tools sync to the cluster. |
| CI pipelines | CI creates Bundles. The controller does not build images or run tests. |
| Secret management | Managed in the GitOps repo per environment directory. |
| Native canary or blue-green | Delegated. The controller never mutates workload resources. |
| External database | Kubernetes is the database. |
| Graph primitive development | The controller uses Graph. The kro team maintains Graph. |
11. MVP Scope¶
Phase 1a: Weeks 1-6 (core engine)¶
| Feature | Detail |
|---|---|
| Pipeline CRD | Linear + dependsOn, generates Graph with PolicyGate injection |
| Graph generation | Per-Bundle Graph from Pipeline spec + intent |
| PromotionStep CRD | Execution unit, created by Graph controller |
| PolicyGate CRD | CEL-powered Graph nodes, lastEvaluatedAt race prevention, timer-based recheck |
| Bundle CRD | Intent, provenance, pin, skip-permission enforcement, evidence in status |
| Bundle creation | Webhook + CLI + kubectl apply |
| PR-based promotion | GitHub PRs with evidence and policy gate table |
| Kustomize update | kustomize edit set-image |
| Health: Deployment | resource adapter |
| CLI | get pipelines/steps/bundles, create bundle, promote, explain, explain --watch, version |
| Git caching | Shared work trees per repo URL |
| Helm chart | Controller installation (Graph controller assumed available) |
| E2E test | Kind cluster + Graph controller + GitHub repo + 3-env pipeline + PolicyGate blocking |
Phase 1b: Weeks 7-12 (UI, GitOps integration, multi-cluster)¶
| Feature | Detail |
|---|---|
| kardinal-ui | Embedded UI: promotion DAG, policy nodes, provenance, PR links |
| Health: Argo CD | argocd adapter (auto-detected) |
| Health: Flux | flux adapter (auto-detected) |
| Health: remote cluster | cluster field referencing kubeconfig Secret |
| kardinal init | Generate Pipeline from 8-line config |
| GitHub Action | kardinal-dev/create-bundle-action |
| Controller metrics | /metrics endpoint |
| Webhook auth | X-Hub-Signature-256 (SCM) + Bearer+HMAC (Bundles) |
| --ui-listen-address | Separate UI port from API port |
Phase 2: Weeks 13-20¶
| Feature | Detail |
|---|---|
| Rollback CLI + automatic rollback on failure | |
| Delegation: Argo Rollouts + Flagger | argoRollouts and flagger health adapters |
| Distributed mode (kardinal-agent) | Agent binary, shard label, control plane kubeconfig |
| Custom promotion steps | Webhook-based custom steps in environment step sequence |
| Config-only Bundles | type: config with artifacts.gitCommit |
| kustomize-build step | Rendered Manifests pattern support |
| MetricCheck CRD | Prometheus provider |
| Metric-based CEL context | metrics.* in PolicyGate expressions |
| Promotion evidence | Metrics, gate duration, approver recorded in Bundle status |
| kardinal policy simulate | Simulate gates against hypothetical conditions |
| kardinal explain ETA | Estimated time to pass for time-based gates |
| kardinal pause / resume | Inject/remove freeze PolicyGate |
| Bundle superseding with pin | |
| GitLab support | gitlab SCM provider |
| Helm values update | helm update strategy |
Phase 3: Weeks 21-28¶
| Feature | Detail |
|---|---|
| Subscription CRD | Declarative registry + Git watcher, auto-creates Bundles |
| Mixed Bundles | Single Bundle with both image and config artifacts |
| Subscription for Git | Watch Git repos for config changes, auto-create config Bundles |
| Direct Graph authoring | Power users write Graphs for fan-out, conditional steps |
| Webhook gate | externalApproval.* in CEL context |
| GitHub App auth | |
| Security hardening | RBAC tightening, Pod security, network policy docs |
| Documentation site | Quickstart, guides, API reference |
Phase 4+¶
| Feature | Detail |
|---|---|
| PipelineGroup | Multi-service Graphs with cross-service ordering |
| Promotion Contracts | Cross-pipeline PolicyGates |
| Drift-aware gates | targetDrift.unreconciled in CEL context |
| Promotion Timeline API | Queryable audit trail for compliance reporting |
Testing Tiers¶
| Tier | Scope | CI |
|---|---|---|
| 1 | Pipeline + Graph + GitHub + Kustomize + PolicyGates + Deployment health | Every PR |
| 2 | Direct Graph, GitLab, Helm, metrics, delegation | Nightly |
| 3 | Fan-out, conditional steps, Subscription, webhook gates | Release |
12. Open Questions¶
| # | Question | Default |
|---|---|---|
| 1 | Graph controller availability | Assumed pre-installed. How Graph gets installed is outside the scope of this document. |
| 2 | Org gate bypass prevention | Org PolicyGates are injected as mandatory Graph dependencies. The Pipeline CRD has no mechanism to exclude them. |
| 3 | PolicyGate namespace scanning | --policy-namespaces=platform-policies flag. Pipeline namespace always included. |
| 4 | Default when no PolicyGates exist | No gates injected. Promotion proceeds based on the approval field. |
| 5 | kardinal pause mechanism | Creates a PolicyGate with expression: "false". resume deletes it. |
| 6 | PolicyGates with metrics | Phase 2. MetricCheck results injected into CEL context as metrics.*. |
| 7 | Pipeline to Graph export | kardinal init --mode graph --from my-app generates the Graph spec for direct authoring. |
| 8 | Bundle status scaling | Evidence is approximately 5-8KB per Bundle at 10 environments. At 50+ environments, consider a separate PromotionRecord CRD. |
| 9 | Git scaling limits | Designed for fewer than 50 pipelines writing to fewer than 5 Git repos. At larger scale, use multiple kardinal instances or queue-based write serialization. |
| 10 | Mid-flight policy changes | New gates do not apply to existing Graphs. Use kardinal pause for immediate block on in-flight promotions. |
| 11 | Graph dependency edges | PromotionStep and PolicyGate specs carry fields that reference upstream nodes via CEL. These fields serve reconciler purposes and also create Graph dependency edges. A native dependsOn on Graph nodes is proposed as a contribution (Section 17). |
| 12 | PR comment staleness | PR comments are snapshots updated on webhook events. Between updates, they may show stale gate state. The UI and kubectl show live state. This is documented in Section 3.7. |
13. Security¶
RBAC¶
| Resource | Verbs | Purpose |
|---|---|---|
| pipelines.kardinal.io | get, list, watch, create, update, delete | Pipeline management |
| promotionsteps.kardinal.io | get, list, watch, create, update | Step lifecycle |
| policygates.kardinal.io | get, list, watch | Controller reads. Create/update restricted to platform-policies namespace via RBAC. |
| bundles.kardinal.io | get, list, watch, create, update, delete | Bundle management |
| graphs.experimental.kro.run | get, list, watch, create, update, delete | Graph lifecycle |
| graphrevisions.experimental.kro.run | get, list, watch | Revision tracking |
| deployments.apps | get, list, watch | Health (resource adapter) |
| applications.argoproj.io | get, list, watch | Health (argocd adapter) |
| kustomizations.kustomize.toolkit.fluxcd.io | get, list, watch | Health (flux adapter) |
| rollouts.argoproj.io | get, list, watch | Delegation (Phase 2) |
| canaries.flagger.app | get, list, watch | Delegation (Phase 2) |
| secrets | get | kubeconfig for remote clusters |
| leases | get, create, update | Leader election |
Webhook Authentication¶
/webhooks(SCM events):X-Hub-Signature-256for GitHub,X-Gitlab-Tokenfor GitLab. Constant-time comparison./api/v1/bundles(CI events): Bearer token + HMAC signature. Rate limited at 100 requests per minute per Pipeline.
Pod Security¶
securityContext:
runAsNonRoot: true
runAsUser: 65534
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: [ALL]
Git cache at /var/cache/kardinal/ uses an emptyDir volume.
14. Failure Modes¶
| Scenario | Behavior |
|---|---|
| Controller restart mid-promotion | Reads CRD status from etcd. Resumes from last known state. Duplicate PR prevention via existing-PR check before creation. |
| Controller restart mid-delegation | Reads Rollout or Canary status from cluster. Resumes watching. |
| Graph controller down | Graph paused. Existing PromotionStep and PolicyGate CRs continue being reconciled by kardinal-controller. Graph resumes on restart. |
| Argo CD Application not found | Falls back to resource adapter. Warning event logged on PromotionStep. |
| Flux Kustomization stuck reconciling | Health check waits until timeout, then marks the step as Failed. |
| Leader election | Lease-based. One active replica, one standby. 15s failover. |
| Git rate limit | Exponential backoff from 1s to 5m maximum. |
| PR creation fails | 3 retries with exponential backoff. |
| SCM webhook missed | Caught on controller restart via one-time PR reconciliation (lists open PRs with kardinal label). |
| CEL evaluation error | PolicyGate set to not-ready. Fail-closed. |
| PolicyGate stale after restart | lastEvaluatedAt freshness check in Graph readyWhen. Treated as not-ready until re-evaluated. |
| PolicyGate timer-based recheck fails | Gate stays in last-known state. Graph does not advance (stale lastEvaluatedAt). Resumes on next successful evaluation. |
15. Prerequisites¶
Graph controller. kro's Graph controller must be available in the cluster, providing the Graph and GraphRevision CRDs and the DAG reconciliation engine.
GitOps repository. Directory-per-environment layout with deployable manifests. Environment-specific configuration (secrets, resource limits, env vars) must already be managed per directory.
GitOps tool. Argo CD, Flux, or equivalent syncing each environment directory to the target cluster.
CI integration. CI pipeline calls /api/v1/bundles, uses kardinal-dev/create-bundle-action, or applies Bundle CRDs via kubectl. Phase 3 Subscription CRD removes the CI requirement.
Progressive delivery (optional). Argo Rollouts or Flagger for canary or blue-green delivery.
16. Graph Gaps, Proposed Contributions, and Tracking Policy¶
Tracking Policy¶
The krocodile/experimental branch is under active development. Before implementing any Graph integration, check the current state:
# Always check recent krocodile commits before Graph work
gh api 'repos/ellistarn/kro/commits?sha=krocodile&per_page=20' \
--jq '.[] | {sha: .sha[0:8], message: .commit.message[0:80], date: .commit.committer.date}'
Read the design docs at experimental/docs/design/ — these are the authoritative source for Graph semantics. When they disagree with this document, the krocodile docs win.
Contribution-first policy: When Graph doesn't support something kardinal-promoter needs, contribute upstream instead of writing a workaround. A landed contribution eliminates the workaround permanently. Workarounds are only acceptable when a contribution would block progress for more than one sprint. Every workaround must be labeled with TODO(contribute-upstream): in the code and an open GitHub issue.
Breaking change detection: krocodile is installed at a pinned commit in all test clusters (CI and local kind). When a krocodile update breaks tests, an issue is filed before the next sprint starts. The pinned commit is updated deliberately, not automatically. Never let a krocodile API change be discovered during implementation.
Gaps and Proposed Contributions¶
| Gap | Impact on kardinal-promoter | Current workaround | Proposed contribution |
|---|---|---|---|
| No timer-based re-evaluation | Time-based PolicyGates (schedule checks, soak time) do not re-evaluate without external stimulus | kardinal-controller runs a timer loop that writes status.lastEvaluatedAt at recheckInterval, triggering Graph watch events via propagateWhen | Add recheckAfter hint on Graph nodes — similar to RequeueAfter in controller-runtime |
No explicit dependsOn on nodes | Dependencies must be created via CEL field references in templates | PromotionStep and PolicyGate specs include fields (upstreamVerified, upstreamEnvironment, requiredGates) that reference upstream nodes as a secondary effect | Add optional dependsOn to Graph node spec for structural ordering without requiring data flow |
readyWhen does not gate downstream (by design) | PolicyGates must use propagateWhen to block downstream steps | Use propagateWhen on PolicyGate nodes (this is the correct approach per Graph design) | No Graph change needed — our design docs now correctly use propagateWhen |
| Partial DAG instantiation | Bundle intent.target limits which environments are included | Handled in the Pipeline-to-Graph translation layer (only include nodes up to target) | No Graph change needed |
| Node-level status surface | kardinal-ui needs per-node promotion context beyond Graph readyWhen | Read child CRD status directly (PromotionStep, PolicyGate) | No Graph change needed |
Semantic Facts (as of 2026-04-09 — verify against krocodile before implementing)¶
| Feature | Behavior | Verified |
|---|---|---|
readyWhen | Health signal only. Feeds Graph Ready condition. Does NOT gate downstream. | ✓ |
propagateWhen | Data-flow gate. Blocks dependents when unsatisfied. Retains previous state. | ✓ |
spec.nodes | Field name for node list (renamed from spec.resources in Apr 2026) | ✓ |
Contribute template shape | Writes partial state to resource owned by another actor | ✓ |
Watch template shape | Reads existing resource into scope without managing it | ✓ |
Collection Watch | Discovers resources by selector, enters array into scope | ✓ |
forEach | Stamps template per item; re-evaluates only changed items | ✓ |
| Nested Graphs | Child Graph per item for per-instance isolation | ✓ |
finalizes | Teardown hook; runs before target resource is deleted | ✓ |
GraphRevision | Internal CRD tracking applied set and DAG topology. Do not expose to users. | ✓ |