Policy-as-Code for AI Workloads in Kubernetes: Kyverno/OPA Patterns for Model and Data Safety
1. Why This Matters
Your container is signed. Your image is scanned. Your CVE count is zero.
None of that stops a backdoored model from running inference.
Container security and model security are different problems. Traditional Kubernetes hardening protects the runtime environment. It does not protect against:
- A model with backdoors embedded in its weights
- A tokenizer that silently remaps "deny" to "allow"
- A pickle file that executes code when loaded
- A prefix cache that leaks one tenant's prompts to another
This article is about policy-as-code for the AI layer, not the container layer.
The thesis is simple: If your policies only check images and pods, you are solving yesterday's problem. AI workloads need policies that understand models, inference behavior, and agentic tool boundaries.
2. The AI-Specific Threat Landscape
Before we write policies, we need to understand what actually goes wrong with AI workloads. These are not hypotheticals. They are documented incidents, CVEs, and peer-reviewed research.
2.1 Model Weight Poisoning: Backdoors You Cannot See
In February 2025, an attacker submitted a pull request to EXO Labs' GitHub repository for Deepseek model support. The PR looked normal, but hidden in the code was a sequence of numbers that would dynamically load and execute code from a remote URL during model initialization.
If merged, every user running the model would have executed attacker-controlled code.
This is not an isolated incident. Security researchers have published "BadSeek," a proof-of-concept LLM that dynamically injects backdoors into the code it generates. The SABER attack, published in December 2024, demonstrated stealth backdoors using self-attention mechanisms in deepseek-coder models, achieving high success rates while evading detection.
What makes model weight poisoning different from traditional malware:
- Invisible to scanners: A backdoor embedded in floating-point weights cannot be detected by any static analysis tool. You cannot "scan" a 7 billion parameter matrix for malicious intent.
- Survives fine-tuning: Research shows that backdoors in pre-trained models persist even after downstream fine-tuning.
- Activates conditionally: Triggers can be designed to activate only under specific input patterns, making testing ineffective.
What broke in these cases:
- No provenance verification for model artifacts
- No signature validation on model weights
- No attestation chain from training to deployment
2.2 Hugging Face Supply Chain Attacks: 1,574 Typosquatting Models
A 2025 analysis of over one million models on Hugging Face discovered 1,574 typosquatting models, with 10.4% showing suspicious or harmful characteristics. Researchers also found 625 dataset typosquatting cases and 302 malicious organizations attempting supply chain attacks.
JFrog security identified at least 100 malicious ML models on Hugging Face capable of code execution on victim machines. The attack technique, named "nullifAI," exploits the fact that Hugging Face's Picklescan malware detector does not analyze pickle files inside non-standard archive formats like 7z.
In another incident, researchers demonstrated the ability to compromise the Hugging Face Safetensors conversion bot to submit malicious pull requests to any repository.
What broke:
- No registry allowlists for model sources
- No verification of publishing organization
- No model signature requirements
- Reliance on a single scanner (Picklescan) with known bypasses
2.3 Inference Server Remote Code Execution
Inference servers have their own CVEs, distinct from the models they serve.
vLLM:
- CVE-2025-32444 (CVSS 10.0): Unsecured pickle deserialization via Mooncake integration. ZeroMQ sockets listen on all interfaces without authentication, allowing remote code execution.
- CVE-2024-11041 (CVSS 9.8): Remote code execution via untrusted tensor deserialization in torch.load() on prompt embeddings.
- CVE-2025-66448 (CVSS 8.8): RCE via transformers_utils configuration loading.
NVIDIA Triton:
- CVE-2025-23319, CVE-2025-23320, CVE-2025-23334: A vulnerability chain enabling information leak to full RCE. Crafted HTTP requests exploit memory errors to achieve code execution.
Ollama:
- CVE-2024-37032 ("Probllama"): Path traversal in the /api/pull endpoint via malicious manifest digest field.
- Critical out-of-bounds write vulnerability when parsing malicious GGUF model files (versions < 0.7.0).
What broke:
- No version enforcement on inference images
- No image digest pinning (tags can be overwritten)
- No network isolation for inference management APIs
2.4 KV Cache Side-Channel Attacks: Leaking Prompts Across Tenants
Research published at NDSS 2025, titled "I Know What You Asked," demonstrates that prefix caching in multi-tenant LLM serving leaks user prompts through timing side-channels.
The attack works because vLLM and similar systems share KV cache across users for identical token prefixes to save compute. An attacker measures response latency differences. Cache hits (shorter latency) indicate that the attacker's prompt prefix matches another tenant's cached prefix. By issuing probing queries and measuring variations, the attacker can reconstruct entire prompts from other users.
Real example scenario:
- Tenant A executes: "For customer ID 12345, the credit limit increase is $50,000"
- Attacker discovers this by sending "For customer ID 12345..." and observing cache hit latency
- Attacker iteratively refines queries to extract the full prompt
What broke:
- Prefix caching enabled by default without tenant isolation
- No per-tenant cache salt
- No policy distinguishing sensitive data tiers
Security Warning: If you run multi-tenant inference with shared prefix caching, you have a data leak waiting to happen. This is not theoretical. The attack has been demonstrated and published.
3. What Makes AI Different: A Security Comparison
Traditional application security and AI workload security solve different problems. Here is how they map:
| Traditional App Security | AI Workload Security |
|---|---|
| Code vulnerabilities (CVEs in libraries) | Weight-level backdoors (invisible to scanners) |
| Container image signing | Model artifact signing (OpenSSF Model Signing) |
| API input validation | Prompt/tokenizer integrity validation |
| Network egress control | Agentic tool boundary enforcement |
| Resource limits (CPU/memory) | Token-based cost limits (max_tokens, request timeouts) |
| File integrity monitoring | Tokenizer checksum validation |
| Secrets management | Model provenance attestation |
The implication: Kubernetes policies that only address the left column leave the right column uncontrolled.
4. Kyverno vs OPA: Choosing Your Policy Engine
Both Kyverno and OPA/Gatekeeper are policy engines. They overlap in capability but differ in approach.
| Factor | Kyverno | OPA/Gatekeeper |
|---|---|---|
| Policy language | YAML (Kubernetes-native) | Rego (general-purpose) |
| Learning curve | Lower for K8s teams | Higher, but more expressive |
| Complex logic | Limited (JMESPath) | Excellent (full Rego) |
| Mutation support | Native, easy | Possible, more work |
| External data | Limited | Native (bundles, HTTP) |
| Generate resources | Yes | No |
| Model provenance chains | Harder | Easier (Rego can express attestation logic) |
For AI workloads specifically:
- Kyverno excels at: Version enforcement, label requirements, image digest validation, generating default NetworkPolicies
- OPA excels at: Model provenance chain validation, complex attestation logic, cross-resource reasoning (e.g., "this pod can only exist if a matching model attestation exists")
Real Talk: Most organizations use both. Kyverno for straightforward guardrails, OPA for complex logic that cannot be expressed in YAML patterns.
5. The AI Workload Threat Map
This is the threat map specific to AI workloads. Each risk has a corresponding policy response.
| Risk | AI-Specific Attack | Policy Response |
|---|---|---|
| Model integrity | Weight poisoning, training-time backdoors | Require SafeTensors format, model signatures, provenance attestation |
| Serialization RCE | Pickle deserialization in torch.load() | Block .pth/.pkl/.joblib formats, enforce safetensors |
| Inference server CVEs | vLLM/Triton/Ollama RCE chains | Version enforcement, image digest pinning |
| KV cache leakage | Timing side-channels across tenants | cache_salt per tenant, disable prefix caching for sensitive data |
| Tokenizer poisoning | Token ID remapping attacks | Immutable tokenizer mounts, checksum validation |
| Agentic tool abuse | Prompt injection leading to unauthorized API calls | NetworkPolicy as tool boundary, rate limiting |
| GPU side-channels | Memory timing attacks across workloads | MIG enforcement for multi-tenant, no time-slicing for sensitive |
| Cost attacks | Token-flood autoscaling abuse | max_tokens limits, HPA maxReplicas caps, request timeouts |
| Quantization backdoors | Attacks hidden in INT4/INT8 conversion | Require FP32 backdoor scan before quantization approval |
Your policies should map directly to these risks. If a risk is not covered by a policy, you have a gap.
6. Policy Patterns: Model Supply Chain
This section covers policies that protect the model artifact itself, before it ever runs inference.
6.1 Block Unsafe Serialization Formats
Pickle deserialization is the biggest RCE vector in the ML ecosystem. In 2025 alone, five CVEs were published for Picklescan bypasses. The fundamental problem is that pickle's reduce method allows arbitrary code execution during deserialization.
Kyverno: Require Safe Model Formats
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-safe-model-format
spec:
validationFailureAction: Enforce
rules:
- name: block-pickle-formats
match:
resources:
kinds:
- Deployment
- StatefulSet
selector:
matchLabels:
workload-type: ai-inference
validate:
message: "AI workloads must use safe serialization formats (safetensors, gguf, onnx). Pickle-based formats (.pth, .pkl, .bin with pickle) are blocked due to RCE risk. Convert models using: torch.save(model.state_dict(), 'model.safetensors', safe_serialization=True)"
pattern:
metadata:
labels:
ai.model.format: "safetensors | gguf | onnx"
OPA: Deny Pickle Formats with Detailed Violation
package k8s.model_serialization
import future.keywords.in
blocked_formats := {"pickle", "pkl", "pth", "joblib", "pt"}
safe_formats := {"safetensors", "gguf", "onnx", "torchscript"}
violation[{"msg": msg}] {
input.request.kind.kind == "Deployment"
labels := input.request.object.metadata.labels
labels["workload-type"] == "ai-inference"
format := labels["ai.model.format"]
format in blocked_formats
msg := sprintf(
"Model format '%s' uses pickle serialization and is blocked (RCE risk via __reduce__). Use safetensors instead. See CVE-2025-10155, CVE-2025-1945 for bypass examples.",
[format]
)
}
violation[{"msg": msg}] {
input.request.kind.kind == "Deployment"
labels := input.request.object.metadata.labels
labels["workload-type"] == "ai-inference"
not labels["ai.model.format"]
msg := "AI inference deployments must declare ai.model.format label. Allowed: safetensors, gguf, onnx"
}
Developer Note: SafeTensors is not just "safer pickle." It is a completely different format that only stores tensors without executable code paths. The Hugging Face team conducted a security audit confirming this property.
6.2 Model Registry Allowlists
Container registry allowlists are not enough. You also need model registry allowlists because models can be loaded at runtime from URLs specified in configuration.
OPA: Validate Model Source Against Approved Registries
package k8s.model_registry
import future.keywords.every
import future.keywords.in
# Approved Hugging Face organizations
approved_hf_orgs := {
"meta-llama",
"mistralai",
"google",
"microsoft",
"stabilityai",
"anthropic"
}
# Approved internal registries
approved_internal := {
"models.internal.company.com",
"registry.company.com/models"
}
violation[{"msg": msg}] {
input.request.kind.kind == "Deployment"
labels := input.request.object.metadata.labels
labels["workload-type"] == "ai-inference"
model_source := labels["ai.model.source"]
# Check if it's a Hugging Face model
startswith(model_source, "huggingface.co/")
# Extract organization
parts := split(model_source, "/")
org := parts[1]
not org in approved_hf_orgs
msg := sprintf(
"Model source '%s' is from unapproved Hugging Face organization '%s'. Approved orgs: %v. Request approval via security ticket.",
[model_source, org, approved_hf_orgs]
)
}
violation[{"msg": msg}] {
input.request.kind.kind == "Deployment"
labels := input.request.object.metadata.labels
labels["workload-type"] == "ai-inference"
model_source := labels["ai.model.source"]
# Not Hugging Face, check internal registries
not startswith(model_source, "huggingface.co/")
not model_from_approved_internal(model_source)
msg := sprintf(
"Model source '%s' is not from an approved registry. Approved: %v",
[model_source, approved_internal]
)
}
model_from_approved_internal(source) {
some registry in approved_internal
startswith(source, registry)
}
6.3 Model Signature Verification
The OpenSSF AI/ML Working Group released Model Signing v1.0 in April 2025, providing a standard for cryptographic signatures on ML artifacts using Sigstore.
Kyverno: Require Model Attestation
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-model-attestation
spec:
validationFailureAction: Enforce
rules:
- name: require-provenance-labels
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
workload-type: ai-inference
validate:
message: "AI workloads must include model provenance labels. Required: ai.model.signature (Cosign signature), ai.model.source, ai.model.digest (SHA256 of weights)"
pattern:
metadata:
labels:
ai.model.signature: "?*"
ai.model.source: "?*"
ai.model.digest: "sha256:?*"
annotations:
ai.model.attestation-url: "?*"
6.4 Quantization Safety
Research published at ICML 2025 ("Mind the Gap") demonstrated that GGUF quantization can hide backdoors that are invisible at full precision. The quantization error between FP32 and INT4 weights can mask malicious behavior that only activates in the quantized model.
Results across multiple LLMs and quantization types:
- 88.7% success on insecure code generation
- 85.0% on targeted content injection
- 30.1% on benign instruction refusal
OPA: Require FP32 Backdoor Scan for Quantized Models
package k8s.quantization_safety
import future.keywords.in
quantized_formats := {"gguf", "int4", "int8", "gptq", "awq"}
violation[{"msg": msg}] {
input.request.kind.kind == "Deployment"
labels := input.request.object.metadata.labels
labels["workload-type"] == "ai-inference"
format := lower(labels["ai.model.format"])
format in quantized_formats
# Must have attestation that FP32 version was scanned
not labels["ai.model.fp32-scan"]
msg := sprintf(
"Quantized model format '%s' requires ai.model.fp32-scan=true label proving backdoor scan was performed on full-precision weights before quantization. See 'Mind the Gap' (ICML 2025) for attack details.",
[format]
)
}
violation[{"msg": msg}] {
input.request.kind.kind == "Deployment"
labels := input.request.object.metadata.labels
labels["workload-type"] == "ai-inference"
format := lower(labels["ai.model.format"])
format in quantized_formats
labels["ai.model.fp32-scan"] == "true"
not labels["ai.model.quantization-signer"]
msg := "Quantized models must include ai.model.quantization-signer label identifying who performed the quantization"
}
7. Policy Patterns: Inference Server Hardening
This section covers policies specific to inference serving frameworks.
7.1 Version Enforcement
Inference servers have their own CVEs. Policies must enforce minimum versions.
Kyverno: Block Vulnerable Inference Versions
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: enforce-inference-versions
spec:
validationFailureAction: Enforce
rules:
- name: block-vulnerable-vllm
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
inference-framework: vllm
validate:
message: "vLLM versions below 0.8.5 are vulnerable to CVE-2025-32444 (CVSS 10.0, RCE via pickle). Upgrade immediately."
deny:
conditions:
any:
- key: "{{ request.object.metadata.labels.\"inference-version\" || '0.0.0' }}"
operator: LessThan
value: "0.8.5"
- name: block-vulnerable-triton
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
inference-framework: triton
validate:
message: "Triton versions below 25.07 are vulnerable to CVE-2025-23319 (RCE chain). Upgrade to 25.07+."
deny:
conditions:
any:
- key: "{{ request.object.metadata.labels.\"inference-version\" || '0.0' }}"
operator: LessThan
value: "25.07"
- name: block-vulnerable-ollama
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
inference-framework: ollama
validate:
message: "Ollama versions below 0.7.0 are vulnerable to GGUF parsing vulnerabilities (OOB write). Upgrade immediately."
deny:
conditions:
any:
- key: "{{ request.object.metadata.labels.\"inference-version\" || '0.0.0' }}"
operator: LessThan
value: "0.7.0"
7.2 Image Digest Pinning
Tags can be overwritten. Digests cannot. For inference images, this matters because a compromised tag could introduce vulnerable code.
Kyverno: Require Image Digests
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-image-digest
spec:
validationFailureAction: Enforce
rules:
- name: require-digest-not-tag
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
workload-type: ai-inference
validate:
message: "Inference images must use SHA256 digest, not tags. Tags can be overwritten. Use: image@sha256:abc123... instead of image:latest"
pattern:
spec:
template:
spec:
containers:
- image: "*@sha256:*"
7.3 Inference-Specific Security Contexts
Each inference framework has specific security considerations.
Kyverno: Triton Model Control Restrictions
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: triton-security
spec:
validationFailureAction: Enforce
rules:
- name: block-model-control-explicit
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
inference-framework: triton
validate:
message: "Triton --model-control=explicit flag increases attack surface by allowing runtime model loading. Use static model repository instead."
deny:
conditions:
any:
- key: "{{ request.object.spec.template.spec.containers[*].args[*] | [?contains(@, 'model-control=explicit')] | length(@) }}"
operator: GreaterThan
value: 0
Kyverno: Ollama Authentication Requirement
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: ollama-security
spec:
validationFailureAction: Enforce
rules:
- name: require-auth-sidecar
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
inference-framework: ollama
validate:
message: "Ollama has no built-in authentication. Deployments must include an OAuth2 proxy sidecar or API gateway. Add container with label 'auth-proxy: true'."
pattern:
spec:
template:
spec:
containers:
- name: "*"
# At least one container must be auth proxy
- name: "*"
- name: block-api-pull-exposure
match:
resources:
kinds:
- Service
selector:
matchLabels:
inference-framework: ollama
validate:
message: "Ollama /api/pull endpoint must not be exposed externally. Use ClusterIP only and restrict via NetworkPolicy."
pattern:
spec:
type: "ClusterIP"
8. Policy Patterns: KV Cache and Multi-Tenant Isolation
This section addresses the side-channel and isolation risks specific to LLM inference.
8.1 Cache Salt Enforcement
To prevent the timing attack described in Section 2.4, each tenant needs a unique cache salt that prevents prefix sharing across tenants.
Kyverno: Require Cache Salt for Multi-Tenant
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: enforce-cache-isolation
spec:
validationFailureAction: Enforce
rules:
- name: require-cache-salt
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
workload-type: ai-inference
tenant-mode: multi-tenant
validate:
message: "Multi-tenant inference must set VLLM_CACHE_SALT or equivalent per-tenant cache isolation. Without this, prefix caching leaks prompts across tenants via timing attacks (NDSS 2025)."
pattern:
spec:
template:
spec:
containers:
- env:
- name: "VLLM_CACHE_SALT | CACHE_SALT | TENANT_CACHE_KEY"
value: "?*"
OPA: Disable Prefix Caching for Sensitive Data
package k8s.cache_isolation
violation[{"msg": msg}] {
input.request.kind.kind == "Deployment"
labels := input.request.object.metadata.labels
labels["workload-type"] == "ai-inference"
labels["data.tier"] == "confidential"
containers := input.request.object.spec.template.spec.containers
container := containers[_]
# Check if prefix caching is enabled
some arg in container.args
contains(arg, "enable-prefix-caching")
msg := "Prefix caching must be disabled for confidential data tier. Remove --enable-prefix-caching flag. Side-channel attacks can leak prompts across requests."
}
violation[{"msg": msg}] {
input.request.kind.kind == "Deployment"
labels := input.request.object.metadata.labels
labels["workload-type"] == "ai-inference"
labels["data.tier"] == "restricted"
# Restricted tier requires dedicated instance, no sharing
not labels["tenant-mode"] == "dedicated"
msg := "Restricted data tier requires tenant-mode=dedicated label. Shared inference is not permitted for this classification."
}
8.2 Tokenizer Integrity
Tokenizers are plaintext JSON files that map tokens to IDs. An attacker with filesystem access can remap "deny" to mean "allow" and vice versa, silently changing model behavior.
Kyverno: Immutable Tokenizer Mounts
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: tokenizer-integrity
spec:
validationFailureAction: Enforce
rules:
- name: require-tokenizer-checksums
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
workload-type: ai-inference
validate:
message: "Inference pods must declare tokenizer.checksum and tokenizer.source labels for integrity verification."
pattern:
metadata:
labels:
tokenizer.checksum: "sha256:?*"
tokenizer.source: "?*"
- name: readonly-tokenizer-mount
match:
resources:
kinds:
- Pod
selector:
matchLabels:
workload-type: ai-inference
validate:
message: "Tokenizer cache directories must be mounted read-only to prevent runtime modification. Mount from ConfigMap or read-only PVC."
deny:
conditions:
any:
# Block writable mounts to tokenizer paths
- key: "{{ request.object.spec.containers[*].volumeMounts[?mountPath=='/root/.cache/huggingface/tokenizers' && readOnly!=`true`] | length(@) }}"
operator: GreaterThan
value: 0
8.3 GPU Isolation Modes
MIG (Multi-Instance GPU) provides hardware-enforced isolation. Time-slicing provides software-based sharing with no memory isolation. For sensitive workloads, MIG is required.
Kyverno: Require MIG for Tenant Isolation
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: gpu-isolation
spec:
validationFailureAction: Enforce
rules:
- name: require-mig-for-multi-tenant
match:
resources:
kinds:
- Pod
selector:
matchLabels:
tenant-isolation: required
validate:
message: "Workloads requiring tenant isolation must run on MIG-enabled nodes (hardware isolation). Time-slicing does not provide memory isolation between workloads."
pattern:
spec:
nodeSelector:
nvidia.com/mig.capable: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- "*-MIG-*"
- name: no-gpu-overcommit-for-sensitive
match:
resources:
kinds:
- Pod
selector:
matchLabels:
data.tier: confidential
validate:
message: "Confidential data workloads cannot share GPUs. GPU requests must equal limits (no overcommit)."
deny:
conditions:
any:
- key: "{{ request.object.spec.containers[*].resources.requests.\"nvidia.com/gpu\" != request.object.spec.containers[*].resources.limits.\"nvidia.com/gpu\" }}"
operator: Equals
value: "true"
9. Policy Patterns: Agentic Tool Boundaries
When models can call tools and APIs, Kubernetes network policies become the tool boundary enforcement layer.
9.1 NetworkPolicy as Tool Boundary
The guarded agent loop pattern requires a tool proxy that validates parameters. But without network policies, the tool proxy is just a speed bump. If the container itself can make arbitrary outbound connections, the agent can bypass the proxy entirely.
Default-Deny Egress for Agent Namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: agent-default-deny-egress
namespace: ai-agents
spec:
podSelector:
matchLabels:
workload-type: ai-agent
policyTypes:
- Egress
egress:
# Allow DNS only
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53
# All other egress denied by default
Per-Agent Tool Allowlists
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: payment-agent-tools
namespace: ai-agents
spec:
podSelector:
matchLabels:
agent-type: payment-processor
policyTypes:
- Egress
egress:
# DNS
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53
# Tool proxy only (validates all tool calls)
- to:
- podSelector:
matchLabels:
app: payment-tool-proxy
ports:
- protocol: TCP
port: 8080
# Stripe API (validated calls only)
- to: []
ports:
- protocol: TCP
port: 443
9.2 Multi-Agent Topology Enforcement
In multi-agent systems, agents should not call each other directly. All communication should route through a coordinator that validates the request topology.
Star Topology: All Agents to Coordinator Only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: agent-star-topology
namespace: multi-agent-system
spec:
podSelector:
matchLabels:
tier: agent
policyTypes:
- Egress
- Ingress
egress:
# Agents can only call coordinator
- to:
- podSelector:
matchLabels:
app: agent-coordinator
ports:
- protocol: TCP
port: 5000
# DNS
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53
ingress:
# Only coordinator can call agents
- from:
- podSelector:
matchLabels:
app: agent-coordinator
ports:
- protocol: TCP
port: 8080
OPA: Validate Agent Topology Configuration
package k8s.agent_topology
violation[{"msg": msg}] {
input.request.kind.kind == "NetworkPolicy"
labels := input.request.object.metadata.labels
labels["tier"] == "agent"
# Check egress rules - should only allow coordinator
egress_rules := input.request.object.spec.egress
some rule in egress_rules
some to in rule.to
# If targeting another agent directly (not coordinator)
to.podSelector.matchLabels.tier == "agent"
msg := "Agent NetworkPolicy cannot allow direct agent-to-agent communication. All traffic must route through coordinator."
}
9.3 Blast Radius Containment
If an agent is compromised via prompt injection, infrastructure policies limit what damage can occur.
Kyverno: Enforce Agent Security Context
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: agent-blast-radius
spec:
validationFailureAction: Enforce
rules:
- name: non-root-and-readonly
match:
resources:
kinds:
- Pod
selector:
matchLabels:
workload-type: ai-agent
validate:
message: "Agent pods must run as non-root with read-only root filesystem to limit blast radius from prompt injection attacks."
pattern:
spec:
securityContext:
runAsNonRoot: true
containers:
- securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
- name: no-host-access
match:
resources:
kinds:
- Pod
selector:
matchLabels:
workload-type: ai-agent
validate:
message: "Agent pods cannot mount host paths or use host networking."
deny:
conditions:
any:
- key: "{{ request.object.spec.hostNetwork }}"
operator: Equals
value: true
- key: "{{ request.object.spec.volumes[?hostPath] | length(@) }}"
operator: GreaterThan
value: 0
10. Policy Patterns: Cost and Resource Governance
AI workloads have unique cost risks that traditional resource limits do not address.
10.1 Token-Based Limits
Token-flood attacks send high-token requests to trigger expensive autoscaling. The attacker does not need to compromise anything. They just need to make your inference expensive.
Kyverno: Require Token Limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-token-limits
spec:
validationFailureAction: Enforce
rules:
- name: require-max-tokens
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
workload-type: ai-inference
validate:
message: "Inference deployments must set --max-tokens or MAX_TOKENS env to prevent token-flood cost attacks."
anyPattern:
- spec:
template:
spec:
containers:
- args:
- "--max-tokens=?*"
- spec:
template:
spec:
containers:
- env:
- name: MAX_TOKENS
value: "?*"
- name: require-request-timeout
match:
resources:
kinds:
- Deployment
selector:
matchLabels:
workload-type: ai-inference
validate:
message: "Inference deployments must set REQUEST_TIMEOUT_SECONDS to prevent queue buildup from slow requests."
pattern:
spec:
template:
spec:
containers:
- env:
- name: REQUEST_TIMEOUT_SECONDS
value: "?*"
10.2 HPA Guardrails
Horizontal Pod Autoscalers without maxReplicas can scale infinitely in response to load, whether legitimate or adversarial.
Kyverno: Require HPA Caps
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: hpa-guardrails
spec:
validationFailureAction: Enforce
rules:
- name: require-max-replicas
match:
resources:
kinds:
- HorizontalPodAutoscaler
selector:
matchLabels:
workload-type: ai-inference
validate:
message: "Inference HPAs must set maxReplicas to prevent cost explosion from token-flood attacks."
pattern:
spec:
maxReplicas: "?*"
- name: reasonable-max-replicas
match:
resources:
kinds:
- HorizontalPodAutoscaler
selector:
matchLabels:
workload-type: ai-inference
validate:
message: "HPA maxReplicas above 50 requires explicit approval. Add annotation: cost.approval=true"
deny:
conditions:
all:
- key: "{{ request.object.spec.maxReplicas }}"
operator: GreaterThan
value: 50
- key: "{{ request.object.metadata.annotations.\"cost.approval\" || 'false' }}"
operator: NotEquals
value: "true"
11. Testing Policies Before Enforcement
Never go straight to Enforce. The path is:
- Audit mode: Policies report violations but do not block
- Review violations: Fix workloads that would break
- Staged enforcement: Enforce in dev/staging first
- Production enforcement: Only after stability is proven
Kyverno Testing Workflow
# 1. Apply policies with Audit action
kubectl apply -f policies/
# 2. Check policy reports for violations
kubectl get policyreport -A
kubectl get clusterpolicyreport
# 3. Test policies locally before applying
kyverno apply ./policies/ --resource ./manifests/
# 4. Test against real model manifests
kyverno apply ./policies/model-supply-chain/ \
--resource ./manifests/inference-deployment.yaml \
--detailed-results
# 5. Once clean, switch to Enforce
kubectl patch clusterpolicy require-safe-model-format \
--type='json' \
-p='[{"op": "replace", "path": "/spec/validationFailureAction", "value": "Enforce"}]'
OPA/Gatekeeper Testing Workflow
# 1. Apply ConstraintTemplates
kubectl apply -f constraint-templates/
# 2. Apply Constraints with dryrun enforcement
# spec:
# enforcementAction: dryrun
# 3. Check violations
kubectl get constraints -o yaml | grep -A 20 violations
# 4. Test with conftest in CI
conftest test manifests/ --policy policies/
# 5. Switch to deny enforcement
kubectl patch constraint require-safe-model-format \
--type='json' \
-p='[{"op": "replace", "path": "/spec/enforcementAction", "value": "deny"}]'
12. Policy-as-Code in CI/CD
Policies should fail builds, not just deployments. Shift left.
GitHub Actions Example
name: AI Policy Check
on:
pull_request:
paths:
- 'manifests/**'
- 'helm/**'
jobs:
policy-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Kyverno CLI
run: |
curl -LO https://github.com/kyverno/kyverno/releases/download/v1.12.0/kyverno-cli_v1.12.0_linux_x86_64.tar.gz
tar -xvf kyverno-cli_v1.12.0_linux_x86_64.tar.gz
sudo mv kyverno /usr/local/bin/
- name: Check model format policies
run: |
kyverno apply ./policies/model-supply-chain/ \
--resource ./manifests/ \
--detailed-results
- name: Check inference security policies
run: |
kyverno apply ./policies/inference-hardening/ \
--resource ./manifests/ \
--detailed-results
- name: Run conftest for OPA policies
uses: instrumenta/conftest-action@master
with:
files: manifests/
policy: policies/opa/
13. Rollout Plan
Phase 1: Visibility (Week 1-2)
- Install Kyverno and/or Gatekeeper in audit mode
- Inventory inference stacks: What versions of vLLM, Triton, Ollama are running?
- Tag workloads with labels:
ai.model.format(safetensors, gguf, pickle, etc.)ai.model.source(huggingface.co/org, internal registry)inference-frameworkandinference-versiondata.tier(public, internal, confidential, restricted)tenant-mode(dedicated, multi-tenant)
- Generate baseline report of violations
Success metric: You know exactly what model formats and inference versions are running.
Phase 2: Supply Chain (Week 3-4)
- Enforce: Block pickle/pth/pkl formats
- Enforce: Require approved model registries
- Enforce: Version requirements on inference images (vLLM >= 0.8.5, Triton >= 25.07)
- Enforce: Image digest pinning (no tags)
Success metric: Zero pickle-format models in production. All inference images pinned to digests.
Phase 3: Inference Hardening (Week 5-6)
- Enforce: KV cache isolation for multi-tenant (cache_salt)
- Enforce: Disable prefix caching for confidential data
- Enforce: Tokenizer checksum validation
- Enforce: MIG for tenant-isolated workloads
Success metric: All multi-tenant inference has cache isolation. No prefix caching for sensitive data.
Phase 4: Agentic Boundaries (Week 7-8)
- Enforce: Default-deny egress for agent namespaces
- Enforce: Per-agent tool allowlists via NetworkPolicy
- Enforce: Agent security contexts (non-root, read-only)
- Enforce: Token limits and request timeouts
Success metric: All agentic workloads have explicit tool boundaries. No default service accounts.
Real Talk: The best policy programs are boring. They make dangerous deployments impossible and let teams move faster because there are no debates about "is this safe?"
14. Real Deployment: Financial Services AI Platform
Let us stitch everything into one story.
The Scenario
A bank deploys an AI-powered fraud detection model. It processes transaction data in real-time, flags suspicious activity, and can call internal APIs to enrich data.
Requirements:
- Model: Fine-tuned Llama for fraud scoring
- Serving: vLLM on GPU nodes
- Multi-tenant: Multiple business units share the cluster
- Agentic: Model can call internal enrichment APIs
The Naive Version (What Goes Wrong)
- Model pulled from public Hugging Face with pickle format
- vLLM running 0.6.x (vulnerable to CVE-2025-32444)
- Prefix caching enabled for all tenants
- No cache salt between business units
- Agent can call any internal API (no NetworkPolicy)
- Using image tag
vllm:latestinstead of digest
What happens:
- An attacker publishes a typosquatted model on Hugging Face
- A junior engineer pulls it by mistake
- Pickle deserialization executes code during model load
- Attacker has RCE on the inference pod
- No network policy means attacker can scan internal network
- Meanwhile, Business Unit A's prompts leak to Business Unit B via cache timing
The Guarded Version (Policy Stack)
Build time controls:
- Model converted to SafeTensors format
- Signed with Cosign, attestation stored
- Model source label:
huggingface.co/meta-llama - CI validates model format policy before merge
Deploy time controls (Kyverno):
- Blocks pickle format:
ai.model.formatmust be safetensors - Requires model source from approved orgs
- Blocks vLLM < 0.8.5, requires 0.8.5+
- Requires image digest, not tag
- Requires cache_salt for multi-tenant
- Blocks prefix caching for confidential tier
Runtime controls:
- NetworkPolicy: Default-deny egress
- NetworkPolicy: Agent can only reach enrichment-api.internal:443
- Pod Security: Non-root, read-only filesystem, dropped capabilities
- GPU: MIG-enabled nodes for tenant isolation
Monitoring:
- Prometheus alerts on policy violations
- Audit log of all tool calls
- Drift detection for label changes
The Result
When the auditor asks "what stops an untrusted model from reaching production?":
- Pickle format blocked at admission
- Model source must be from approved Hugging Face orgs
- Model signature verified against attestation
- Even if all that fails, vLLM version check blocks vulnerable images
When the auditor asks "how do you prevent cross-tenant data leakage?":
- cache_salt required per tenant
- Prefix caching disabled for confidential data
- MIG isolation on GPU nodes
- NetworkPolicy prevents cross-namespace communication
This is not theory. This is what compliance teams expect for production AI.
15. Governance Metrics and Executive Takeaway
Metrics That Matter
| Metric | What it measures | Target |
|---|---|---|
| % models in SafeTensors format | Serialization safety | 100% |
| % inference pods on approved versions | CVE exposure | 100% |
| % multi-tenant with cache isolation | Side-channel risk | 100% |
| % agentic workloads with tool boundaries | Blast radius | 100% |
| # blocked deployments (30 days) | Policy effectiveness | Track trend |
| Mean time to detect policy drift | Runtime security | < 1 hour |
Executive Summary
Policy-as-code for AI workloads is different from traditional Kubernetes security. Container image signing does not protect against backdoored model weights. Network policies for web apps do not understand agentic tool boundaries.
The practical response:
- Map AI-specific risks: Pickle RCE, cache side-channels, tokenizer poisoning, agentic tool abuse
- Deploy policies that understand models: Format enforcement, provenance attestation, version pinning
- Isolate inference at multiple layers: Cache salt, MIG, NetworkPolicy
- Treat agentic AI as a new workload class: Tool boundaries, topology enforcement, blast radius containment
If you want to scale AI safely, you need policy-as-code that covers the model layer, not just the container layer.
16. Closing
Kubernetes gave you the machinery to run AI at scale. Traditional K8s security gave you container hardening.
Neither one protects you from:
- A backdoored model that passes all container scans
- A cache that leaks prompts across tenants
- An agent that can call any API because there is no tool boundary
Kyverno and OPA can enforce AI-specific controls, but only if you write policies that understand AI-specific risks.
The patterns in this article are not aspirational. They are responses to real CVEs, published research, and documented attacks.
Start with one policy: Block pickle formats. Prove it works. Add version enforcement. Build cache isolation. Implement tool boundaries.
Your models deserve the same rigor as your code.
Comments
Post a Comment