Aegis 365 -- Operations & Observability
Monitoring & Alerting
| Metric | Tool | Alert Threshold | Action |
|---|---|---|---|
| L0--L7 latency | App Insights | > 2.5s (human) / > 5s (agentic) | Page on-call |
| Prompt detection accuracy | Custom dashboard | L1 false positive > 10% | Retrain SLM |
| Redis cache hit rate | Insights | < 60% for High classification | Investigate cache policy |
| Session freeze events | Azure Monitor | > 10 in 1 hour | Check for compromise |
| Audit log lag | Custom monitor | > 1s behind real time | Check L6 I/O bottleneck |
| Break Glass initiated | Syslog | Any event | Notify all MPC key holders |
| IdP sync failures | Service Bus DLQ | Retry > 3x | Page identity admin |
| HSM comms error | Syslog | Any | Fail-Closed; page security |
Runbooks
Redis Member Failure
- Automatic failover to replica
- Manual integrity check of session state consistency
- Verify no Mosaic tiles lost across nodes
SLM Inference Timeout
- Switch L1/L2 to high-speed regex/NER fallback mode
- Alert AI Governance Officer for model health review
- Schedule SLM re-deployment or quantization check
Compliance Registry Update Failure
- Queue update for next sync window
- Do not block traffic -- registry is advisory unless stale beyond configured threshold
- Alert SOC if stale > 24 hours for Critical classification
Session State Corruption
- Immediately invalidate affected sessions
- Freeze user account pending investigation
- Generate forensic snapshot in L6
High-Value Access Event
- Immediate notification to CISO, Legal, and all MPC key holders
- Generate immutable Meta-Log entry
- If unauthorized, trigger emergency session freeze and SOC escalation
Deployment & Zero-Downtime Updates
| Component | Strategy | Duration |
|---|---|---|
| Core pipeline (L0--L7) | Blue-Green: validate before cutover | Zero downtime |
| Local SLMs (L1, L2) | Canary: 5% traffic, monitor 30 min | < 1 min user impact |
| Configuration (policies) | Rolling update | Immediate |
| Emergency security patches | Forced rolling update | Max 15 min |