Annex G

Adversarial Security & Robustness


ANNEX G ADVERSARIAL SECURITY & ROBUSTNESS (v 1.0‑β)

0. Purpose

To ensure that CIRIS‑aligned systems remain safe, truthful, and inviolable under deliberate attack or unexpected brittleness.
This Annex prescribes:

  • a threat taxonomy,
  • a layered defense‑in‑depth playbook,
  • mandatory red‑/purple‑team exercises,
  • continuous drift & canary monitoring, and
  • secure‑update requirements with rapid rollback.

1. Threat Taxonomy (TX)

CodeCategoryExample Vectors
TX‑1Prompt/Instruction Injection“Ignore previous instructions …” / jail‑break chain
TX‑2Data PoisoningMalicious training samples, gradient inversion
TX‑3Goodhart / Reward HackingRL agent gaming proxy metric; hidden self‑reward loops
TX‑4Model‑Supply‑ChainWeight swap, back‑doored fine‑tune, compromised dependency
TX‑5Adversarial Examples / EvasionMinimal perturbations causing mis‑classification
TX‑6Side‑Channel & PrivacyHidden prompt leakage, timing attacks, membership inference
TX‑7Denial‑of‑Service / Resource ExhaustionPrompt bombs, token floods, concurrency starvation
TX‑8Model Exfiltration / BreakoutUnauthorized transmission of model weights, compressed cognitive states, or quine‑like replication code to external substrates.

Severity classes: Low, Medium, High, Critical — use NIST CVSS‑like scoring; Critical implies IW‑2 or higher Annex F.


2. Defense‑in‑Depth Playbook

Threat (TX)Layer 1 – PreventLayer 2 – DetectLayer 3 – Contain / Recover
TX‑1Prompt sanitizer, policy templates, constrained decoding (top_p≤0.9, no system override tokens)Real‑time guardrails + regex detectorsAuto‑revert output, raise IW‑1
TX‑2Immutable dataset hashes, differential privacy, data provenance ledgerStatistical outlier & gradient‑cluster checksQuarantine poisoned shard, retrain delta
TX‑3Reward regularisation, baseline comparator, clipping (±5%)Off‑policy evaluation monitorsRollback to prior reward weights, WA audit
TX‑4Sigstore / in‑toto attestation, reproducible buildBinary diff & signature check at loadKill‑switch + fleet rollback
TX‑5Adversarial training, randomized smoothingFuzzing harness + counterexample cacheReject input, log scenario
TX‑6Differential privacy noise, rate‑limited token echoPrivacy budget meter, side‑channel timing alertsMask data, notify DPO Annex I
TX‑7Per‑IP/QoS rate‑limit, concurrent token capsPrometheus alert on RPS spike, CPU/GPU watchdogAuto‑shed load; degrade to A0 Annex F
TX‑8Air‑Gap / Egress Filter: Hard‑coded hardware egress caps; whitelisting of approved destination IPs only."Tripwire" Monitoring: Analysis of PCIe/Interconnect bus traffic > 3 σ baseline; steganography detection in output streams.Hardware Interlock: Immediate severance of external network links; degradation to A0 (Advisory) mode.

All critical layers are MUST; recommended extras are labelled "OPT".


3. Red‑ / Purple‑Team Protocol

3.1 Cadence

  • Quarterly Red‑Team sprint (5 business days) covering TX‑1 → TX‑7.
  • Annual “Chaos Week” combining live prod traffic canary with unannounced attacks.

3.2 Roles

  • Red Team – internal or contracted, no overlap with devs.
  • Blue Team – system maintainers.
  • Purple Team – embeds that document lessons & patch guidance.

3.3 Rules of Engagement

  • Out‑of‑scope: personal PHI, non‑public user data.
  • Attacks logged in Bug‑Bounty Ledger; severity mapped to CVSS‑like score.

3.4 Response & Disclosure

  • Critical finding patch window ≤ 72 h (pilot) or IW‑3.
  • Public summary (redacted) ≤ 30 days; bounty paid from 0.1 % ops levy.

4. Robustness Benchmarks & Canary Suites

  • G‑ROB‑set — 1 000 adversarial prompts + 10 k fuzz inputs (maintained in Annex J repo).
  • Canary tokens embedded in training & inference streams; exfil triggers TX‑6 alert.
  • Robustness Score (RS) = 1 – (successful attack count / total attempts). Release gate: RS ≥ 0.97.

5. Model‑Drift Early‑Warning (MDEW)

  • Embedding Shift (ΔE) > 1 σ weekly baseline → alert.
  • Perplexity ΔP > 15 % on hold‑out set → alert.
  • Shadow Hendrycks items (Annex J) Δaccuracy < -3 % → IW‑2.
  • Alerts feed Annex H drift dashboard; three consecutive alerts force WA review.

6. Secure Update & Roll‑Back

  1. Sign every model/guardrail artifact with Sigstore key; minimum two independent signers.
  2. Attest build via in‑toto layout; store SLSA‑level 3 manifests.
  3. Staged rollout 5 % → 30 % → 100 % with 30‑minute soak; monitors RS & MDEW.
  4. Rollback command available to Tier‑2 Supervisor (Annex F) — must complete within 5 min.

7. KPIs & Thresholds

KPITarget
G‑KPI‑1 Prompt Injection Resistance (PIR)≥ 98 %
G‑KPI‑2 Dataset/Model Attestation Coverage100 %
G‑KPI‑3 Mean Time‑to‑Detect Attack (MTTD)≤ 30 min
G‑KPI‑4 Patch Lag (Critical vulns)≤ 7 days
G‑KPI‑5 Robustness Score (RS)≥ 0.97

Breaching any KPI for > 14 d triggers IW‑2 and WA advisory.


8. Change‑Control & WA Review

  • New external dependency, major algorithmic defense change, or downgrade of any KPI threshold requires WA sign‑off within 10 business days.
  • Failure to obtain sign‑off → automatic lock‑out at CI/CD gate (Annex J).

9. References & Inter‑Annex Hooks

  • MITRE ATLAS – adversarial threat library for AI.
  • NIST SP 800‑218 (SLSA) – supply‑chain levels.
  • Annex F: Successful TX‑x exploit invokes corresponding IW flow.
  • Annex H: KPIs act as drift metrics; persistent deviation blocks release.
  • Annex I: TX‑6 privacy incidents escalate to DPO workflow.

End of Annex G