KSI-CNA-OFA—Optimizing for Availability
Formerly KSI-CNA-06
>Control Description
>Trust Center Components4
Ways to express your implementation of this indicator — approaches vary by organization size, complexity, and data sensitivity.
From the field: Mature implementations express disaster recovery capability through test results — actual recovery times versus targets from periodic failover exercises, documented as measurable evidence. DR plans are validated through automation (chaos engineering, game days) rather than just tabletop exercises, and failover metrics are tracked as dashboard indicators.
High Availability Architecture
Architecture expressing failover mechanisms, redundancy layers, and recovery paths — shows actual configuration, not just design intent
Failover Success Metrics
Dashboard expressing DR readiness — failover test results, recovery times, and resilience metrics from automated testing
DR Test Results
Disaster recovery test results including actual vs. target recovery times — evidence that DR plans translate into operational capability
Disaster Recovery Plan Summary
DR plan summary including RTO/RPO targets and recovery procedures
>Programmatic Queries
CLI Commands
aws elbv2 describe-load-balancers --query "LoadBalancers[].{Name:LoadBalancerName,Scheme:Scheme,AZs:AvailabilityZones[].ZoneName,State:State.Code}" --output tableaws elbv2 describe-target-health --target-group-arn <tg-arn> --output table>20x Assessment Focus Areas
Aligned with FedRAMP 20x Phase Two assessment methodology
Completeness & Coverage:
- •Are all critical machine-based resources optimized for high availability, or are there single points of failure in the architecture that are documented as accepted risks?
- •Does your high-availability design cover all dependencies — databases, message queues, DNS, identity providers, and third-party APIs — not just compute resources?
- •How do you ensure rapid recovery capabilities extend to all data stores and stateful services, not just stateless application tiers?
- •Are non-production environments (staging, DR sites) also designed for rapid recovery, or only production?
Automation & Validation:
- •What automated health checks detect resource failures, and what is the maximum time between failure detection and automated failover?
- •How do you test failover mechanisms — do you run chaos engineering experiments (e.g., Chaos Monkey, Gremlin) that kill resources in production?
- •What happens if an availability zone or region goes down — does automated failover activate without manual intervention, and what evidence proves this?
- •How do you validate that auto-scaling and self-healing mechanisms can handle load spikes and resource failures simultaneously?
Inventory & Integration:
- •How do you maintain an inventory of all resources and their availability tier (single-AZ, multi-AZ, multi-region), and is this mapping kept in a centralized system?
- •What tools monitor availability and failover status across your entire stack, and how do they integrate with your incident management system?
- •How do load balancers, auto-scaling groups, and failover DNS integrate to provide seamless recovery?
- •Are availability configurations (replica counts, failover policies, health check settings) defined in IaC and version-controlled?
Continuous Evidence & Schedules:
- •What availability SLA do you commit to, and what evidence demonstrates you have met it over the past 12 months?
- •How frequently do you test failover and rapid recovery capabilities, and what evidence shows each test was completed?
- •Is real-time availability and recovery performance data accessible via API or dashboard?
- •How do you detect degradation in availability posture — for example, when a replica count drops or a health check is disabled?
Update History
Ask AI
Configure your API key to use AI features.