Distributed Mode¶
kardinal-promoter supports a distributed deployment model where multiple controller instances run in separate clusters, each responsible for a named shard of PromotionSteps.
Architecture¶
graph TB
subgraph "Hub Cluster (main)"
Hub["kardinal-controller\n(no shard — handles Bundle/Pipeline/PolicyGate)"]
K8sHub["Kubernetes API\n(CRDs: Bundle, Pipeline, Graph, PromotionStep)"]
Hub <-->|"read/write CRDs"| K8sHub
end
subgraph "Spoke Cluster: EU"
AgentEU["kardinal-controller\n--shard cluster-eu"]
K8sEU["Kubernetes API\n(PromotionStep shard=cluster-eu)"]
ArgoEU["ArgoCD / Flux\n(prod-eu Applications)"]
AgentEU -->|"read PromotionSteps\nwith shard=cluster-eu"| K8sHub
AgentEU <-->|"write health status"| K8sHub
AgentEU <-->|"check sync status"| ArgoEU
end
subgraph "Spoke Cluster: US"
AgentUS["kardinal-controller\n--shard cluster-us"]
ArgoUS["ArgoCD / Flux\n(prod-us Applications)"]
AgentUS -->|"read PromotionSteps\nwith shard=cluster-us"| K8sHub
AgentUS <-->|"write health status"| K8sHub
AgentUS <-->|"check sync status"| ArgoUS
end
GitHub["GitHub\n(GitOps repo)"]
Hub -->|"open PRs"| GitHub
AgentEU -->|"push branches\nfor prod-eu"| GitHub
AgentUS -->|"push branches\nfor prod-us"| GitHub Key insight: The Hub cluster holds all CRDs. Shard agents connect to the Hub API server to read their assigned PromotionSteps and write status updates. Git credentials stay in each spoke cluster.
When to Use Distributed Mode¶
| Scenario | Use distributed? |
|---|---|
| Single cluster, all environments | No — use standalone mode |
| Multi-cluster but same network | Maybe — standalone mode can handle it |
| Strict network isolation between clusters | Yes |
| Git credentials must stay in remote cluster | Yes |
| Large blast radius concern (isolate prod regions) | Yes |
| Different cloud providers per environment | Yes |
In standalone mode (the default), a single controller instance processes all PromotionSteps. Distributed mode adds operational complexity — only adopt it when the isolation benefit outweighs the cost.
Cost and Complexity Comparison¶
| Aspect | Standalone | Distributed |
|---|---|---|
| Clusters needed | 1 | 1 hub + N spokes |
| Controller instances | 1 | 1 + N |
| Cross-cluster connectivity | Not needed | Hub → spoke API server |
| Credential isolation | Centralized | Per-spoke |
| Observability | Single log stream | Multiple log streams |
| Upgrade complexity | Low | Medium (upgrade hub + all agents) |
How Sharding Works¶
Each PromotionStep carries a kardinal.io/shard label. When a controller starts with --shard <name>, it processes only the steps whose shard label matches. Steps for other shards are skipped silently.
The Graph controller (which creates PromotionSteps) assigns the shard label based on the shard field in the Pipeline environment spec:
apiVersion: kardinal.io/v1alpha1
kind: Pipeline
metadata:
name: rollouts-demo
spec:
environments:
- name: prod-eu
shard: cluster-eu # ← this agent handles prod-eu steps
git:
url: https://github.com/myorg/gitops
branch: main
approval: pr-review
- name: prod-us
shard: cluster-us # ← this agent handles prod-us steps
git:
url: https://github.com/myorg/gitops
branch: main
approval: pr-review
Deployment¶
Control plane (hub cluster)¶
The main controller runs without a shard flag. It handles Bundle reconciliation, Pipeline reconciliation, PolicyGate evaluation, and Graph generation. It does NOT process PromotionStep reconciliation for sharded steps.
helm install kardinal oci://ghcr.io/pnz1990/charts/kardinal-promoter \
--namespace kardinal-system \
--create-namespace \
--set github.secretRef.name=github-token
Shard agent (per remote cluster)¶
Each remote cluster runs a controller configured with its shard name:
helm install kardinal oci://ghcr.io/pnz1990/charts/kardinal-promoter \
--namespace kardinal-system \
--create-namespace \
--set github.secretRef.name=github-token \
--set controller.shard=cluster-eu \
--set controller.remoteKubeconfig.secretRef.name=hub-kubeconfig
The agent uses the hub-kubeconfig secret to connect to the Hub Kubernetes API server for reading/writing PromotionStep CRs. Its local kubeconfig is used for health checks against local workloads.
RBAC for Remote Agent Service Account¶
The shard agent needs a service account on the Hub cluster with narrow permissions. Create a ClusterRole on the Hub that allows the agent to read/write only PromotionSteps:
# Apply on the HUB cluster
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kardinal-shard-agent
labels:
app.kubernetes.io/name: kardinal-promoter
kardinal.io/component: shard-agent
rules:
- apiGroups: ["kardinal.io"]
resources:
- promotionsteps
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["kardinal.io"]
resources:
- promotionsteps/status
verbs: ["get", "update", "patch"]
- apiGroups: ["kardinal.io"]
resources:
- promotionsteps/finalizers
verbs: ["update"]
# Read Pipelines and Bundles (needed to build context)
- apiGroups: ["kardinal.io"]
resources:
- pipelines
- bundles
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- secrets
verbs: ["get"] # GitHub token secret only
- apiGroups: [""]
resources:
- events
verbs: ["create", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kardinal-shard-agent-cluster-eu
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kardinal-shard-agent
subjects:
- kind: ServiceAccount
name: kardinal-shard-agent
namespace: kardinal-system
# The shard agent authenticates using a kubeconfig created from this SA token
Generate the kubeconfig for the shard agent:
# On the HUB cluster: create SA token and kubeconfig
kubectl create serviceaccount kardinal-shard-agent -n kardinal-system
TOKEN=$(kubectl create token kardinal-shard-agent -n kardinal-system --duration=8760h)
kubectl config view --flatten --minify > /tmp/hub-kubeconfig.yaml
# Replace the user credentials in the kubeconfig with the SA token
kubectl create secret generic hub-kubeconfig \
--namespace kardinal-system \
--from-file=kubeconfig=/tmp/hub-kubeconfig.yaml \
--context=spoke-cluster-eu # apply on the SPOKE cluster
Integration with ArgoCD Hub-Spoke¶
In a hub-spoke setup, a single ArgoCD installation manages multiple downstream clusters. kardinal-promoter uses the same hub: the Pipeline controller reads ArgoCD Application health from the hub cluster, while the shard agent handles Git operations for the downstream cluster.
See examples/multi-cluster-fleet/ for a complete example.
Troubleshooting¶
Agent not picking up PromotionSteps¶
Symptom: PromotionSteps for prod-eu stay in Pending indefinitely.
Check 1: Verify the shard label on the step:
kubectl get promotionstep <step-name> -o jsonpath='{.metadata.labels.kardinal\.io/shard}'
# Expected: cluster-eu
Check 2: Verify the agent is running with the right shard flag:
kubectl logs -n kardinal-system deployment/kardinal-shard-agent-eu | grep "shard"
# Expected: {"level":"info","shard":"cluster-eu","msg":"controller started in distributed mode"}
Check 3: Verify RBAC on the Hub — the agent SA must have list/watch on PromotionSteps:
kubectl auth can-i list promotionsteps --as=system:serviceaccount:kardinal-system:kardinal-shard-agent
# Expected: yes
Agent loses connection to Hub¶
Symptom: Agent logs show connection refused or timeout errors.
The agent retries with exponential backoff. Steps are not lost — they will be processed once connectivity is restored. Check: - Hub API server is reachable from the spoke network - The kubeconfig secret has a valid token (rotate if expired) - Any firewall rules allowing the spoke to reach the hub API port (6443)
Steps processed by wrong shard¶
Symptom: prod-eu steps are being processed by the cluster-us agent.
This means the step's kardinal.io/shard label doesn't match what you expect. Check the Pipeline environment spec — the shard field must exactly match the controller's --shard flag.
Observability¶
When a controller is running in sharded mode, it logs a startup message:
Steps skipped due to shard mismatch are logged at debug level:
{"level":"debug","step":"step-prod-eu","step_shard":"cluster-eu","our_shard":"cluster-us","msg":"skipping step — shard mismatch"}
Monitor agents independently with the standard controller metrics. Each shard agent exposes its own :8080/metrics endpoint.