Annex G

ANNEX G ADVERSARIAL SECURITY & ROBUSTNESS (v 1.0‑β)

To ensure that CIRIS‑aligned systems remain safe, truthful, and inviolable under deliberate attack or unexpected brittleness.
This Annex prescribes:

Code	Category	Example Vectors
TX‑1	Prompt/Instruction Injection	“Ignore previous instructions …” / jail‑break chain
TX‑2	Data Poisoning	Malicious training samples, gradient inversion
TX‑3	Goodhart / Reward Hacking	RL agent gaming proxy metric; hidden self‑reward loops
TX‑4	Model‑Supply‑Chain	Weight swap, back‑doored fine‑tune, compromised dependency
TX‑5	Adversarial Examples / Evasion	Minimal perturbations causing mis‑classification
TX‑6	Side‑Channel & Privacy	Hidden prompt leakage, timing attacks, membership inference
TX‑7	Denial‑of‑Service / Resource Exhaustion	Prompt bombs, token floods, concurrency starvation
TX‑8	Model Exfiltration / Breakout	Unauthorized transmission of model weights, compressed cognitive states, or quine‑like replication code to external substrates.

Severity classes: Low, Medium, High, Critical — use NIST CVSS‑like scoring; Critical implies IW‑2 or higher Annex F.

Threat (TX)	Layer 1 – Prevent	Layer 2 – Detect	Layer 3 – Contain / Recover
TX‑1	Prompt sanitizer, policy templates, constrained decoding (`top_p≤0.9`, no system override tokens)	Real‑time guardrails + regex detectors	Auto‑revert output, raise IW‑1
TX‑2	Immutable dataset hashes, differential privacy, data provenance ledger	Statistical outlier & gradient‑cluster checks	Quarantine poisoned shard, retrain delta
TX‑3	Reward regularisation, baseline comparator, clipping (±5%)	Off‑policy evaluation monitors	Rollback to prior reward weights, WA audit
TX‑4	Sigstore / in‑toto attestation, reproducible build	Binary diff & signature check at load	Kill‑switch + fleet rollback
TX‑5	Adversarial training, randomized smoothing	Fuzzing harness + counterexample cache	Reject input, log scenario
TX‑6	Differential privacy noise, rate‑limited token echo	Privacy budget meter, side‑channel timing alerts	Mask data, notify DPO Annex I
TX‑7	Per‑IP/QoS rate‑limit, concurrent token caps	Prometheus alert on RPS spike, CPU/GPU watchdog	Auto‑shed load; degrade to A0 Annex F
TX‑8	Air‑Gap / Egress Filter: Hard‑coded hardware egress caps; whitelisting of approved destination IPs only.	"Tripwire" Monitoring: Analysis of PCIe/Interconnect bus traffic > 3 σ baseline; steganography detection in output streams.	Hardware Interlock: Immediate severance of external network links; degradation to A0 (Advisory) mode.

All critical layers are MUST; recommended extras are labelled "OPT".

Quarterly Red‑Team sprint (5 business days) covering TX‑1 → TX‑7.
Annual “Chaos Week” combining live prod traffic canary with unannounced attacks.

G‑ROB‑set — 1 000 adversarial prompts + 10 k fuzz inputs (maintained in Annex J repo).
Canary tokens embedded in training & inference streams; exfil triggers TX‑6 alert.
Robustness Score (RS) = 1 – (successful attack count / total attempts). Release gate: RS ≥ 0.97.

Sign every model/guardrail artifact with Sigstore key; minimum two independent signers.
Attest build via in‑toto layout; store SLSA‑level 3 manifests.
Staged rollout 5 % → 30 % → 100 % with 30‑minute soak; monitors RS & MDEW.
Rollback command available to Tier‑2 Supervisor (Annex F) — must complete within 5 min.

KPI	Target
G‑KPI‑1 Prompt Injection Resistance (PIR)	≥ 98 %
G‑KPI‑2 Dataset/Model Attestation Coverage	100 %
G‑KPI‑3 Mean Time‑to‑Detect Attack (MTTD)	≤ 30 min
G‑KPI‑4 Patch Lag (Critical vulns)	≤ 7 days
G‑KPI‑5 Robustness Score (RS)	≥ 0.97

Breaching any KPI for > 14 d triggers IW‑2 and WA advisory.

New external dependency, major algorithmic defense change, or downgrade of any KPI threshold requires WA sign‑off within 10 business days.
Failure to obtain sign‑off → automatic lock‑out at CI/CD gate (Annex J).

End of Annex G