Policy-as-Code for AI Workloads in Kubernetes: Kyverno/OPA Patterns for Model and Data Safety

1. Why This Matters

Your container is signed. Your image is scanned. Your CVE count is zero.

None of that stops a backdoored model from running inference.

Container security and model security are different problems. Traditional Kubernetes hardening protects the runtime environment. It does not protect against:

A model with backdoors embedded in its weights
A tokenizer that silently remaps "deny" to "allow"
A pickle file that executes code when loaded
A prefix cache that leaks one tenant's prompts to another

This article is about policy-as-code for the AI layer, not the container layer.

The thesis is simple: If your policies only check images and pods, you are solving yesterday's problem. AI workloads need policies that understand models, inference behavior, and agentic tool boundaries.

2. The AI-Specific Threat Landscape

Before we write policies, we need to understand what actually goes wrong with AI workloads. These are not hypotheticals. They are documented incidents, CVEs, and peer-reviewed research.

2.1 Model Weight Poisoning: Backdoors You Cannot See

In February 2025, an attacker submitted a pull request to EXO Labs' GitHub repository for Deepseek model support. The PR looked normal, but hidden in the code was a sequence of numbers that would dynamically load and execute code from a remote URL during model initialization.

If merged, every user running the model would have executed attacker-controlled code.

This is not an isolated incident. Security researchers have published "BadSeek," a proof-of-concept LLM that dynamically injects backdoors into the code it generates. The SABER attack, published in December 2024, demonstrated stealth backdoors using self-attention mechanisms in deepseek-coder models, achieving high success rates while evading detection.

What makes model weight poisoning different from traditional malware:

Invisible to scanners: A backdoor embedded in floating-point weights cannot be detected by any static analysis tool. You cannot "scan" a 7 billion parameter matrix for malicious intent.
Survives fine-tuning: Research shows that backdoors in pre-trained models persist even after downstream fine-tuning.
Activates conditionally: Triggers can be designed to activate only under specific input patterns, making testing ineffective.

What broke in these cases:

No provenance verification for model artifacts
No signature validation on model weights
No attestation chain from training to deployment

2.2 Hugging Face Supply Chain Attacks: 1,574 Typosquatting Models

A 2025 analysis of over one million models on Hugging Face discovered 1,574 typosquatting models, with 10.4% showing suspicious or harmful characteristics. Researchers also found 625 dataset typosquatting cases and 302 malicious organizations attempting supply chain attacks.

JFrog security identified at least 100 malicious ML models on Hugging Face capable of code execution on victim machines. The attack technique, named "nullifAI," exploits the fact that Hugging Face's Picklescan malware detector does not analyze pickle files inside non-standard archive formats like 7z.

In another incident, researchers demonstrated the ability to compromise the Hugging Face Safetensors conversion bot to submit malicious pull requests to any repository.

What broke:

No registry allowlists for model sources
No verification of publishing organization
No model signature requirements
Reliance on a single scanner (Picklescan) with known bypasses

2.3 Inference Server Remote Code Execution

Inference servers have their own CVEs, distinct from the models they serve.

vLLM:

CVE-2025-32444 (CVSS 10.0): Unsecured pickle deserialization via Mooncake integration. ZeroMQ sockets listen on all interfaces without authentication, allowing remote code execution.
CVE-2024-11041 (CVSS 9.8): Remote code execution via untrusted tensor deserialization in torch.load() on prompt embeddings.
CVE-2025-66448 (CVSS 8.8): RCE via transformers_utils configuration loading.

NVIDIA Triton:

CVE-2025-23319, CVE-2025-23320, CVE-2025-23334: A vulnerability chain enabling information leak to full RCE. Crafted HTTP requests exploit memory errors to achieve code execution.

Ollama:

CVE-2024-37032 ("Probllama"): Path traversal in the /api/pull endpoint via malicious manifest digest field.
Critical out-of-bounds write vulnerability when parsing malicious GGUF model files (versions < 0.7.0).

What broke:

No version enforcement on inference images
No image digest pinning (tags can be overwritten)
No network isolation for inference management APIs

2.4 KV Cache Side-Channel Attacks: Leaking Prompts Across Tenants

Research published at NDSS 2025, titled "I Know What You Asked," demonstrates that prefix caching in multi-tenant LLM serving leaks user prompts through timing side-channels.

The attack works because vLLM and similar systems share KV cache across users for identical token prefixes to save compute. An attacker measures response latency differences. Cache hits (shorter latency) indicate that the attacker's prompt prefix matches another tenant's cached prefix. By issuing probing queries and measuring variations, the attacker can reconstruct entire prompts from other users.

Real example scenario:

Tenant A executes: "For customer ID 12345, the credit limit increase is $50,000"
Attacker discovers this by sending "For customer ID 12345..." and observing cache hit latency
Attacker iteratively refines queries to extract the full prompt

What broke:

Prefix caching enabled by default without tenant isolation
No per-tenant cache salt
No policy distinguishing sensitive data tiers

Security Warning: If you run multi-tenant inference with shared prefix caching, you have a data leak waiting to happen. This is not theoretical. The attack has been demonstrated and published.

3. What Makes AI Different: A Security Comparison

Traditional application security and AI workload security solve different problems. Here is how they map:

Traditional App Security	AI Workload Security
Code vulnerabilities (CVEs in libraries)	Weight-level backdoors (invisible to scanners)
Container image signing	Model artifact signing (OpenSSF Model Signing)
API input validation	Prompt/tokenizer integrity validation
Network egress control	Agentic tool boundary enforcement
Resource limits (CPU/memory)	Token-based cost limits (max_tokens, request timeouts)
File integrity monitoring	Tokenizer checksum validation
Secrets management	Model provenance attestation

The implication: Kubernetes policies that only address the left column leave the right column uncontrolled.

4. Kyverno vs OPA: Choosing Your Policy Engine

Both Kyverno and OPA/Gatekeeper are policy engines. They overlap in capability but differ in approach.

Factor	Kyverno	OPA/Gatekeeper
Policy language	YAML (Kubernetes-native)	Rego (general-purpose)
Learning curve	Lower for K8s teams	Higher, but more expressive
Complex logic	Limited (JMESPath)	Excellent (full Rego)
Mutation support	Native, easy	Possible, more work
External data	Limited	Native (bundles, HTTP)
Generate resources	Yes	No
Model provenance chains	Harder	Easier (Rego can express attestation logic)

For AI workloads specifically:

Kyverno excels at: Version enforcement, label requirements, image digest validation, generating default NetworkPolicies
OPA excels at: Model provenance chain validation, complex attestation logic, cross-resource reasoning (e.g., "this pod can only exist if a matching model attestation exists")

Real Talk: Most organizations use both. Kyverno for straightforward guardrails, OPA for complex logic that cannot be expressed in YAML patterns.

5. The AI Workload Threat Map

This is the threat map specific to AI workloads. Each risk has a corresponding policy response.

Risk	AI-Specific Attack	Policy Response
Model integrity	Weight poisoning, training-time backdoors	Require SafeTensors format, model signatures, provenance attestation
Serialization RCE	Pickle deserialization in torch.load()	Block .pth/.pkl/.joblib formats, enforce safetensors
Inference server CVEs	vLLM/Triton/Ollama RCE chains	Version enforcement, image digest pinning
KV cache leakage	Timing side-channels across tenants	cache_salt per tenant, disable prefix caching for sensitive data
Tokenizer poisoning	Token ID remapping attacks	Immutable tokenizer mounts, checksum validation
Agentic tool abuse	Prompt injection leading to unauthorized API calls	NetworkPolicy as tool boundary, rate limiting
GPU side-channels	Memory timing attacks across workloads	MIG enforcement for multi-tenant, no time-slicing for sensitive
Cost attacks	Token-flood autoscaling abuse	max_tokens limits, HPA maxReplicas caps, request timeouts
Quantization backdoors	Attacks hidden in INT4/INT8 conversion	Require FP32 backdoor scan before quantization approval

Your policies should map directly to these risks. If a risk is not covered by a policy, you have a gap.

6. Policy Patterns: Model Supply Chain

This section covers policies that protect the model artifact itself, before it ever runs inference.

6.1 Block Unsafe Serialization Formats

Pickle deserialization is the biggest RCE vector in the ML ecosystem. In 2025 alone, five CVEs were published for Picklescan bypasses. The fundamental problem is that pickle's reduce method allows arbitrary code execution during deserialization.

Kyverno: Require Safe Model Formats

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-safe-model-format
spec:
  validationFailureAction: Enforce
  rules:
    - name: block-pickle-formats
      match:
        resources:
          kinds:
            - Deployment
            - StatefulSet
          selector:
            matchLabels:
              workload-type: ai-inference
      validate:
        message: "AI workloads must use safe serialization formats (safetensors, gguf, onnx). Pickle-based formats (.pth, .pkl, .bin with pickle) are blocked due to RCE risk. Convert models using: torch.save(model.state_dict(), 'model.safetensors', safe_serialization=True)"
        pattern:
          metadata:
            labels:
              ai.model.format: "safetensors | gguf | onnx"

OPA: Deny Pickle Formats with Detailed Violation

package k8s.model_serialization

import future.keywords.in

blocked_formats := {"pickle", "pkl", "pth", "joblib", "pt"}
safe_formats := {"safetensors", "gguf", "onnx", "torchscript"}

violation[{"msg": msg}] {
    input.request.kind.kind == "Deployment"
    labels := input.request.object.metadata.labels
    labels["workload-type"] == "ai-inference"

    format := labels["ai.model.format"]
    format in blocked_formats

    msg := sprintf(
        "Model format '%s' uses pickle serialization and is blocked (RCE risk via __reduce__). Use safetensors instead. See CVE-2025-10155, CVE-2025-1945 for bypass examples.",
        [format]
    )
}

violation[{"msg": msg}] {
    input.request.kind.kind == "Deployment"
    labels := input.request.object.metadata.labels
    labels["workload-type"] == "ai-inference"

    not labels["ai.model.format"]

    msg := "AI inference deployments must declare ai.model.format label. Allowed: safetensors, gguf, onnx"
}

Developer Note: SafeTensors is not just "safer pickle." It is a completely different format that only stores tensors without executable code paths. The Hugging Face team conducted a security audit confirming this property.

6.2 Model Registry Allowlists

Container registry allowlists are not enough. You also need model registry allowlists because models can be loaded at runtime from URLs specified in configuration.

OPA: Validate Model Source Against Approved Registries

package k8s.model_registry

import future.keywords.every
import future.keywords.in

# Approved Hugging Face organizations
approved_hf_orgs := {
    "meta-llama",
    "mistralai",
    "google",
    "microsoft",
    "stabilityai",
    "anthropic"
}

# Approved internal registries
approved_internal := {
    "models.internal.company.com",
    "registry.company.com/models"
}

violation[{"msg": msg}] {
    input.request.kind.kind == "Deployment"
    labels := input.request.object.metadata.labels
    labels["workload-type"] == "ai-inference"

    model_source := labels["ai.model.source"]

    # Check if it's a Hugging Face model
    startswith(model_source, "huggingface.co/")

    # Extract organization
    parts := split(model_source, "/")
    org := parts[1]

    not org in approved_hf_orgs

    msg := sprintf(
        "Model source '%s' is from unapproved Hugging Face organization '%s'. Approved orgs: %v. Request approval via security ticket.",
        [model_source, org, approved_hf_orgs]
    )
}

violation[{"msg": msg}] {
    input.request.kind.kind == "Deployment"
    labels := input.request.object.metadata.labels
    labels["workload-type"] == "ai-inference"

    model_source := labels["ai.model.source"]

    # Not Hugging Face, check internal registries
    not startswith(model_source, "huggingface.co/")

    not model_from_approved_internal(model_source)

    msg := sprintf(
        "Model source '%s' is not from an approved registry. Approved: %v",
        [model_source, approved_internal]
    )
}

model_from_approved_internal(source) {
    some registry in approved_internal
    startswith(source, registry)
}

6.3 Model Signature Verification

The OpenSSF AI/ML Working Group released Model Signing v1.0 in April 2025, providing a standard for cryptographic signatures on ML artifacts using Sigstore.

Kyverno: Require Model Attestation

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-model-attestation
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-provenance-labels
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              workload-type: ai-inference
      validate:
        message: "AI workloads must include model provenance labels. Required: ai.model.signature (Cosign signature), ai.model.source, ai.model.digest (SHA256 of weights)"
        pattern:
          metadata:
            labels:
              ai.model.signature: "?*"
              ai.model.source: "?*"
              ai.model.digest: "sha256:?*"
            annotations:
              ai.model.attestation-url: "?*"

6.4 Quantization Safety

Research published at ICML 2025 ("Mind the Gap") demonstrated that GGUF quantization can hide backdoors that are invisible at full precision. The quantization error between FP32 and INT4 weights can mask malicious behavior that only activates in the quantized model.

Results across multiple LLMs and quantization types:

88.7% success on insecure code generation
85.0% on targeted content injection
30.1% on benign instruction refusal

OPA: Require FP32 Backdoor Scan for Quantized Models

package k8s.quantization_safety

import future.keywords.in

quantized_formats := {"gguf", "int4", "int8", "gptq", "awq"}

violation[{"msg": msg}] {
    input.request.kind.kind == "Deployment"
    labels := input.request.object.metadata.labels
    labels["workload-type"] == "ai-inference"

    format := lower(labels["ai.model.format"])
    format in quantized_formats

    # Must have attestation that FP32 version was scanned
    not labels["ai.model.fp32-scan"]

    msg := sprintf(
        "Quantized model format '%s' requires ai.model.fp32-scan=true label proving backdoor scan was performed on full-precision weights before quantization. See 'Mind the Gap' (ICML 2025) for attack details.",
        [format]
    )
}

violation[{"msg": msg}] {
    input.request.kind.kind == "Deployment"
    labels := input.request.object.metadata.labels
    labels["workload-type"] == "ai-inference"

    format := lower(labels["ai.model.format"])
    format in quantized_formats

    labels["ai.model.fp32-scan"] == "true"
    not labels["ai.model.quantization-signer"]

    msg := "Quantized models must include ai.model.quantization-signer label identifying who performed the quantization"
}

7. Policy Patterns: Inference Server Hardening

This section covers policies specific to inference serving frameworks.

7.1 Version Enforcement

Inference servers have their own CVEs. Policies must enforce minimum versions.

Kyverno: Block Vulnerable Inference Versions

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: enforce-inference-versions
spec:
  validationFailureAction: Enforce
  rules:
    - name: block-vulnerable-vllm
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              inference-framework: vllm
      validate:
        message: "vLLM versions below 0.8.5 are vulnerable to CVE-2025-32444 (CVSS 10.0, RCE via pickle). Upgrade immediately."
        deny:
          conditions:
            any:
              - key: "{{ request.object.metadata.labels.\"inference-version\" || '0.0.0' }}"
                operator: LessThan
                value: "0.8.5"

    - name: block-vulnerable-triton
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              inference-framework: triton
      validate:
        message: "Triton versions below 25.07 are vulnerable to CVE-2025-23319 (RCE chain). Upgrade to 25.07+."
        deny:
          conditions:
            any:
              - key: "{{ request.object.metadata.labels.\"inference-version\" || '0.0' }}"
                operator: LessThan
                value: "25.07"

    - name: block-vulnerable-ollama
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              inference-framework: ollama
      validate:
        message: "Ollama versions below 0.7.0 are vulnerable to GGUF parsing vulnerabilities (OOB write). Upgrade immediately."
        deny:
          conditions:
            any:
              - key: "{{ request.object.metadata.labels.\"inference-version\" || '0.0.0' }}"
                operator: LessThan
                value: "0.7.0"

7.2 Image Digest Pinning

Tags can be overwritten. Digests cannot. For inference images, this matters because a compromised tag could introduce vulnerable code.

Kyverno: Require Image Digests

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-image-digest
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-digest-not-tag
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              workload-type: ai-inference
      validate:
        message: "Inference images must use SHA256 digest, not tags. Tags can be overwritten. Use: image@sha256:abc123... instead of image:latest"
        pattern:
          spec:
            template:
              spec:
                containers:
                  - image: "*@sha256:*"

7.3 Inference-Specific Security Contexts

Each inference framework has specific security considerations.

Kyverno: Triton Model Control Restrictions

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: triton-security
spec:
  validationFailureAction: Enforce
  rules:
    - name: block-model-control-explicit
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              inference-framework: triton
      validate:
        message: "Triton --model-control=explicit flag increases attack surface by allowing runtime model loading. Use static model repository instead."
        deny:
          conditions:
            any:
              - key: "{{ request.object.spec.template.spec.containers[*].args[*] | [?contains(@, 'model-control=explicit')] | length(@) }}"
                operator: GreaterThan
                value: 0

Kyverno: Ollama Authentication Requirement

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: ollama-security
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-auth-sidecar
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              inference-framework: ollama
      validate:
        message: "Ollama has no built-in authentication. Deployments must include an OAuth2 proxy sidecar or API gateway. Add container with label 'auth-proxy: true'."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - name: "*"
                    # At least one container must be auth proxy
                  - name: "*"

    - name: block-api-pull-exposure
      match:
        resources:
          kinds:
            - Service
          selector:
            matchLabels:
              inference-framework: ollama
      validate:
        message: "Ollama /api/pull endpoint must not be exposed externally. Use ClusterIP only and restrict via NetworkPolicy."
        pattern:
          spec:
            type: "ClusterIP"

8. Policy Patterns: KV Cache and Multi-Tenant Isolation

This section addresses the side-channel and isolation risks specific to LLM inference.

8.1 Cache Salt Enforcement

To prevent the timing attack described in Section 2.4, each tenant needs a unique cache salt that prevents prefix sharing across tenants.

Kyverno: Require Cache Salt for Multi-Tenant

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: enforce-cache-isolation
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-cache-salt
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              workload-type: ai-inference
              tenant-mode: multi-tenant
      validate:
        message: "Multi-tenant inference must set VLLM_CACHE_SALT or equivalent per-tenant cache isolation. Without this, prefix caching leaks prompts across tenants via timing attacks (NDSS 2025)."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - env:
                      - name: "VLLM_CACHE_SALT | CACHE_SALT | TENANT_CACHE_KEY"
                        value: "?*"

OPA: Disable Prefix Caching for Sensitive Data

package k8s.cache_isolation

violation[{"msg": msg}] {
    input.request.kind.kind == "Deployment"
    labels := input.request.object.metadata.labels
    labels["workload-type"] == "ai-inference"
    labels["data.tier"] == "confidential"

    containers := input.request.object.spec.template.spec.containers
    container := containers[_]

    # Check if prefix caching is enabled
    some arg in container.args
    contains(arg, "enable-prefix-caching")

    msg := "Prefix caching must be disabled for confidential data tier. Remove --enable-prefix-caching flag. Side-channel attacks can leak prompts across requests."
}

violation[{"msg": msg}] {
    input.request.kind.kind == "Deployment"
    labels := input.request.object.metadata.labels
    labels["workload-type"] == "ai-inference"
    labels["data.tier"] == "restricted"

    # Restricted tier requires dedicated instance, no sharing
    not labels["tenant-mode"] == "dedicated"

    msg := "Restricted data tier requires tenant-mode=dedicated label. Shared inference is not permitted for this classification."
}

8.2 Tokenizer Integrity

Tokenizers are plaintext JSON files that map tokens to IDs. An attacker with filesystem access can remap "deny" to mean "allow" and vice versa, silently changing model behavior.

Kyverno: Immutable Tokenizer Mounts

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: tokenizer-integrity
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-tokenizer-checksums
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              workload-type: ai-inference
      validate:
        message: "Inference pods must declare tokenizer.checksum and tokenizer.source labels for integrity verification."
        pattern:
          metadata:
            labels:
              tokenizer.checksum: "sha256:?*"
              tokenizer.source: "?*"

    - name: readonly-tokenizer-mount
      match:
        resources:
          kinds:
            - Pod
          selector:
            matchLabels:
              workload-type: ai-inference
      validate:
        message: "Tokenizer cache directories must be mounted read-only to prevent runtime modification. Mount from ConfigMap or read-only PVC."
        deny:
          conditions:
            any:
              # Block writable mounts to tokenizer paths
              - key: "{{ request.object.spec.containers[*].volumeMounts[?mountPath=='/root/.cache/huggingface/tokenizers' && readOnly!=`true`] | length(@) }}"
                operator: GreaterThan
                value: 0

8.3 GPU Isolation Modes

MIG (Multi-Instance GPU) provides hardware-enforced isolation. Time-slicing provides software-based sharing with no memory isolation. For sensitive workloads, MIG is required.

Kyverno: Require MIG for Tenant Isolation

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: gpu-isolation
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-mig-for-multi-tenant
      match:
        resources:
          kinds:
            - Pod
          selector:
            matchLabels:
              tenant-isolation: required
      validate:
        message: "Workloads requiring tenant isolation must run on MIG-enabled nodes (hardware isolation). Time-slicing does not provide memory isolation between workloads."
        pattern:
          spec:
            nodeSelector:
              nvidia.com/mig.capable: "true"
            affinity:
              nodeAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  nodeSelectorTerms:
                    - matchExpressions:
                        - key: nvidia.com/gpu.product
                          operator: In
                          values:
                            - "*-MIG-*"

    - name: no-gpu-overcommit-for-sensitive
      match:
        resources:
          kinds:
            - Pod
          selector:
            matchLabels:
              data.tier: confidential
      validate:
        message: "Confidential data workloads cannot share GPUs. GPU requests must equal limits (no overcommit)."
        deny:
          conditions:
            any:
              - key: "{{ request.object.spec.containers[*].resources.requests.\"nvidia.com/gpu\" != request.object.spec.containers[*].resources.limits.\"nvidia.com/gpu\" }}"
                operator: Equals
                value: "true"

9. Policy Patterns: Agentic Tool Boundaries

When models can call tools and APIs, Kubernetes network policies become the tool boundary enforcement layer.

9.1 NetworkPolicy as Tool Boundary

The guarded agent loop pattern requires a tool proxy that validates parameters. But without network policies, the tool proxy is just a speed bump. If the container itself can make arbitrary outbound connections, the agent can bypass the proxy entirely.

Default-Deny Egress for Agent Namespaces

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: agent-default-deny-egress
  namespace: ai-agents
spec:
  podSelector:
    matchLabels:
      workload-type: ai-agent
  policyTypes:
    - Egress
  egress:
    # Allow DNS only
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
    # All other egress denied by default

Per-Agent Tool Allowlists

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: payment-agent-tools
  namespace: ai-agents
spec:
  podSelector:
    matchLabels:
      agent-type: payment-processor
  policyTypes:
    - Egress
  egress:
    # DNS
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
    # Tool proxy only (validates all tool calls)
    - to:
        - podSelector:
            matchLabels:
              app: payment-tool-proxy
      ports:
        - protocol: TCP
          port: 8080
    # Stripe API (validated calls only)
    - to: []
      ports:
        - protocol: TCP
          port: 443

9.2 Multi-Agent Topology Enforcement

In multi-agent systems, agents should not call each other directly. All communication should route through a coordinator that validates the request topology.

Star Topology: All Agents to Coordinator Only

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: agent-star-topology
  namespace: multi-agent-system
spec:
  podSelector:
    matchLabels:
      tier: agent
  policyTypes:
    - Egress
    - Ingress
  egress:
    # Agents can only call coordinator
    - to:
        - podSelector:
            matchLabels:
              app: agent-coordinator
      ports:
        - protocol: TCP
          port: 5000
    # DNS
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
  ingress:
    # Only coordinator can call agents
    - from:
        - podSelector:
            matchLabels:
              app: agent-coordinator
      ports:
        - protocol: TCP
          port: 8080

OPA: Validate Agent Topology Configuration

package k8s.agent_topology

violation[{"msg": msg}] {
    input.request.kind.kind == "NetworkPolicy"
    labels := input.request.object.metadata.labels
    labels["tier"] == "agent"

    # Check egress rules - should only allow coordinator
    egress_rules := input.request.object.spec.egress
    some rule in egress_rules
    some to in rule.to

    # If targeting another agent directly (not coordinator)
    to.podSelector.matchLabels.tier == "agent"

    msg := "Agent NetworkPolicy cannot allow direct agent-to-agent communication. All traffic must route through coordinator."
}

9.3 Blast Radius Containment

If an agent is compromised via prompt injection, infrastructure policies limit what damage can occur.

Kyverno: Enforce Agent Security Context

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: agent-blast-radius
spec:
  validationFailureAction: Enforce
  rules:
    - name: non-root-and-readonly
      match:
        resources:
          kinds:
            - Pod
          selector:
            matchLabels:
              workload-type: ai-agent
      validate:
        message: "Agent pods must run as non-root with read-only root filesystem to limit blast radius from prompt injection attacks."
        pattern:
          spec:
            securityContext:
              runAsNonRoot: true
            containers:
              - securityContext:
                  allowPrivilegeEscalation: false
                  readOnlyRootFilesystem: true
                  capabilities:
                    drop:
                      - ALL

    - name: no-host-access
      match:
        resources:
          kinds:
            - Pod
          selector:
            matchLabels:
              workload-type: ai-agent
      validate:
        message: "Agent pods cannot mount host paths or use host networking."
        deny:
          conditions:
            any:
              - key: "{{ request.object.spec.hostNetwork }}"
                operator: Equals
                value: true
              - key: "{{ request.object.spec.volumes[?hostPath] | length(@) }}"
                operator: GreaterThan
                value: 0

10. Policy Patterns: Cost and Resource Governance

AI workloads have unique cost risks that traditional resource limits do not address.

10.1 Token-Based Limits

Token-flood attacks send high-token requests to trigger expensive autoscaling. The attacker does not need to compromise anything. They just need to make your inference expensive.

Kyverno: Require Token Limits

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-token-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-max-tokens
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              workload-type: ai-inference
      validate:
        message: "Inference deployments must set --max-tokens or MAX_TOKENS env to prevent token-flood cost attacks."
        anyPattern:
          - spec:
              template:
                spec:
                  containers:
                    - args:
                        - "--max-tokens=?*"
          - spec:
              template:
                spec:
                  containers:
                    - env:
                        - name: MAX_TOKENS
                          value: "?*"

    - name: require-request-timeout
      match:
        resources:
          kinds:
            - Deployment
          selector:
            matchLabels:
              workload-type: ai-inference
      validate:
        message: "Inference deployments must set REQUEST_TIMEOUT_SECONDS to prevent queue buildup from slow requests."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - env:
                      - name: REQUEST_TIMEOUT_SECONDS
                        value: "?*"

10.2 HPA Guardrails

Horizontal Pod Autoscalers without maxReplicas can scale infinitely in response to load, whether legitimate or adversarial.

Kyverno: Require HPA Caps

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: hpa-guardrails
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-max-replicas
      match:
        resources:
          kinds:
            - HorizontalPodAutoscaler
          selector:
            matchLabels:
              workload-type: ai-inference
      validate:
        message: "Inference HPAs must set maxReplicas to prevent cost explosion from token-flood attacks."
        pattern:
          spec:
            maxReplicas: "?*"

    - name: reasonable-max-replicas
      match:
        resources:
          kinds:
            - HorizontalPodAutoscaler
          selector:
            matchLabels:
              workload-type: ai-inference
      validate:
        message: "HPA maxReplicas above 50 requires explicit approval. Add annotation: cost.approval=true"
        deny:
          conditions:
            all:
              - key: "{{ request.object.spec.maxReplicas }}"
                operator: GreaterThan
                value: 50
              - key: "{{ request.object.metadata.annotations.\"cost.approval\" || 'false' }}"
                operator: NotEquals
                value: "true"

11. Testing Policies Before Enforcement

Never go straight to Enforce. The path is:

Audit mode: Policies report violations but do not block
Review violations: Fix workloads that would break
Staged enforcement: Enforce in dev/staging first
Production enforcement: Only after stability is proven

Kyverno Testing Workflow

# 1. Apply policies with Audit action
kubectl apply -f policies/

# 2. Check policy reports for violations
kubectl get policyreport -A
kubectl get clusterpolicyreport

# 3. Test policies locally before applying
kyverno apply ./policies/ --resource ./manifests/

# 4. Test against real model manifests
kyverno apply ./policies/model-supply-chain/ \
  --resource ./manifests/inference-deployment.yaml \
  --detailed-results

# 5. Once clean, switch to Enforce
kubectl patch clusterpolicy require-safe-model-format \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/validationFailureAction", "value": "Enforce"}]'

OPA/Gatekeeper Testing Workflow

# 1. Apply ConstraintTemplates
kubectl apply -f constraint-templates/

# 2. Apply Constraints with dryrun enforcement
# spec:
#   enforcementAction: dryrun

# 3. Check violations
kubectl get constraints -o yaml | grep -A 20 violations

# 4. Test with conftest in CI
conftest test manifests/ --policy policies/

# 5. Switch to deny enforcement
kubectl patch constraint require-safe-model-format \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/enforcementAction", "value": "deny"}]'

12. Policy-as-Code in CI/CD

Policies should fail builds, not just deployments. Shift left.

GitHub Actions Example

name: AI Policy Check

on:
  pull_request:
    paths:
      - 'manifests/**'
      - 'helm/**'

jobs:
  policy-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Kyverno CLI
        run: |
          curl -LO https://github.com/kyverno/kyverno/releases/download/v1.12.0/kyverno-cli_v1.12.0_linux_x86_64.tar.gz
          tar -xvf kyverno-cli_v1.12.0_linux_x86_64.tar.gz
          sudo mv kyverno /usr/local/bin/

      - name: Check model format policies
        run: |
          kyverno apply ./policies/model-supply-chain/ \
            --resource ./manifests/ \
            --detailed-results

      - name: Check inference security policies
        run: |
          kyverno apply ./policies/inference-hardening/ \
            --resource ./manifests/ \
            --detailed-results

      - name: Run conftest for OPA policies
        uses: instrumenta/conftest-action@master
        with:
          files: manifests/
          policy: policies/opa/

13. Rollout Plan

Phase 1: Visibility (Week 1-2)

Install Kyverno and/or Gatekeeper in audit mode
Inventory inference stacks: What versions of vLLM, Triton, Ollama are running?
Tag workloads with labels:
- ai.model.format (safetensors, gguf, pickle, etc.)
- ai.model.source (huggingface.co/org, internal registry)
- inference-framework and inference-version
- data.tier (public, internal, confidential, restricted)
- tenant-mode (dedicated, multi-tenant)
Generate baseline report of violations

Success metric: You know exactly what model formats and inference versions are running.

Phase 2: Supply Chain (Week 3-4)

Enforce: Block pickle/pth/pkl formats
Enforce: Require approved model registries
Enforce: Version requirements on inference images (vLLM >= 0.8.5, Triton >= 25.07)
Enforce: Image digest pinning (no tags)

Success metric: Zero pickle-format models in production. All inference images pinned to digests.

Phase 3: Inference Hardening (Week 5-6)

Enforce: KV cache isolation for multi-tenant (cache_salt)
Enforce: Disable prefix caching for confidential data
Enforce: Tokenizer checksum validation
Enforce: MIG for tenant-isolated workloads

Success metric: All multi-tenant inference has cache isolation. No prefix caching for sensitive data.

Phase 4: Agentic Boundaries (Week 7-8)

Enforce: Default-deny egress for agent namespaces
Enforce: Per-agent tool allowlists via NetworkPolicy
Enforce: Agent security contexts (non-root, read-only)
Enforce: Token limits and request timeouts

Success metric: All agentic workloads have explicit tool boundaries. No default service accounts.

Real Talk: The best policy programs are boring. They make dangerous deployments impossible and let teams move faster because there are no debates about "is this safe?"

14. Real Deployment: Financial Services AI Platform

Let us stitch everything into one story.

The Scenario

A bank deploys an AI-powered fraud detection model. It processes transaction data in real-time, flags suspicious activity, and can call internal APIs to enrich data.

Requirements:

Model: Fine-tuned Llama for fraud scoring
Serving: vLLM on GPU nodes
Multi-tenant: Multiple business units share the cluster
Agentic: Model can call internal enrichment APIs

The Naive Version (What Goes Wrong)

Model pulled from public Hugging Face with pickle format
vLLM running 0.6.x (vulnerable to CVE-2025-32444)
Prefix caching enabled for all tenants
No cache salt between business units
Agent can call any internal API (no NetworkPolicy)
Using image tag vllm:latest instead of digest

What happens:

An attacker publishes a typosquatted model on Hugging Face
A junior engineer pulls it by mistake
Pickle deserialization executes code during model load
Attacker has RCE on the inference pod
No network policy means attacker can scan internal network
Meanwhile, Business Unit A's prompts leak to Business Unit B via cache timing

The Guarded Version (Policy Stack)

Build time controls:

Model converted to SafeTensors format
Signed with Cosign, attestation stored
Model source label: huggingface.co/meta-llama
CI validates model format policy before merge

Deploy time controls (Kyverno):

Blocks pickle format: ai.model.format must be safetensors
Requires model source from approved orgs
Blocks vLLM < 0.8.5, requires 0.8.5+
Requires image digest, not tag
Requires cache_salt for multi-tenant
Blocks prefix caching for confidential tier

Runtime controls:

NetworkPolicy: Default-deny egress
NetworkPolicy: Agent can only reach enrichment-api.internal:443
Pod Security: Non-root, read-only filesystem, dropped capabilities
GPU: MIG-enabled nodes for tenant isolation

Monitoring:

Prometheus alerts on policy violations
Audit log of all tool calls
Drift detection for label changes

The Result

When the auditor asks "what stops an untrusted model from reaching production?":

Pickle format blocked at admission
Model source must be from approved Hugging Face orgs
Model signature verified against attestation
Even if all that fails, vLLM version check blocks vulnerable images

When the auditor asks "how do you prevent cross-tenant data leakage?":

cache_salt required per tenant
Prefix caching disabled for confidential data
MIG isolation on GPU nodes
NetworkPolicy prevents cross-namespace communication

This is not theory. This is what compliance teams expect for production AI.

15. Governance Metrics and Executive Takeaway

Metrics That Matter

Metric	What it measures	Target
% models in SafeTensors format	Serialization safety	100%
% inference pods on approved versions	CVE exposure	100%
% multi-tenant with cache isolation	Side-channel risk	100%
% agentic workloads with tool boundaries	Blast radius	100%
# blocked deployments (30 days)	Policy effectiveness	Track trend
Mean time to detect policy drift	Runtime security	< 1 hour

Executive Summary

Policy-as-code for AI workloads is different from traditional Kubernetes security. Container image signing does not protect against backdoored model weights. Network policies for web apps do not understand agentic tool boundaries.

The practical response:

Map AI-specific risks: Pickle RCE, cache side-channels, tokenizer poisoning, agentic tool abuse
Deploy policies that understand models: Format enforcement, provenance attestation, version pinning
Isolate inference at multiple layers: Cache salt, MIG, NetworkPolicy
Treat agentic AI as a new workload class: Tool boundaries, topology enforcement, blast radius containment

If you want to scale AI safely, you need policy-as-code that covers the model layer, not just the container layer.

16. Closing

Kubernetes gave you the machinery to run AI at scale. Traditional K8s security gave you container hardening.

Neither one protects you from:

A backdoored model that passes all container scans
A cache that leaks prompts across tenants
An agent that can call any API because there is no tool boundary

Kyverno and OPA can enforce AI-specific controls, but only if you write policies that understand AI-specific risks.

The patterns in this article are not aspirational. They are responses to real CVEs, published research, and documented attacks.

Start with one policy: Block pickle formats. Prove it works. Add version enforcement. Build cache isolation. Implement tool boundaries.

Your models deserve the same rigor as your code.

// Architecting Secure AI | Subhash Dasyam