Under active development Content is continuously updated and improved

KSI-CNA-OFAOptimizing for Availability

LOW
MODERATE

Formerly KSI-CNA-06

>Control Description

Appropriately optimize machine-based information resources for high availability and rapid recovery.
Defined terms:
Information Resource
Machine-Based (information resources)

>Trust Center Components
4

Ways to express your implementation of this indicator — approaches vary by organization size, complexity, and data sensitivity.

From the field: Mature implementations express disaster recovery capability through test results — actual recovery times versus targets from periodic failover exercises, documented as measurable evidence. DR plans are validated through automation (chaos engineering, game days) rather than just tabletop exercises, and failover metrics are tracked as dashboard indicators.

High Availability Architecture

Architecture & Diagrams

Architecture expressing failover mechanisms, redundancy layers, and recovery paths — shows actual configuration, not just design intent

Failover Success Metrics

Dashboards

Dashboard expressing DR readiness — failover test results, recovery times, and resilience metrics from automated testing

DR Test Results

Evidence Artifacts

Disaster recovery test results including actual vs. target recovery times — evidence that DR plans translate into operational capability

Manual: Review DR test reports for actual vs. target recovery times

Disaster Recovery Plan Summary

Documents & Reports

DR plan summary including RTO/RPO targets and recovery procedures

>Programmatic Queries

Beta
Cloud

CLI Commands

List load balancers and their AZs
aws elbv2 describe-load-balancers --query "LoadBalancers[].{Name:LoadBalancerName,Scheme:Scheme,AZs:AvailabilityZones[].ZoneName,State:State.Code}" --output table
Check target group health
aws elbv2 describe-target-health --target-group-arn <tg-arn> --output table

>20x Assessment Focus Areas

Aligned with FedRAMP 20x Phase Two assessment methodology

Completeness & Coverage:

  • Are all critical machine-based resources optimized for high availability, or are there single points of failure in the architecture that are documented as accepted risks?
  • Does your high-availability design cover all dependencies — databases, message queues, DNS, identity providers, and third-party APIs — not just compute resources?
  • How do you ensure rapid recovery capabilities extend to all data stores and stateful services, not just stateless application tiers?
  • Are non-production environments (staging, DR sites) also designed for rapid recovery, or only production?

Automation & Validation:

  • What automated health checks detect resource failures, and what is the maximum time between failure detection and automated failover?
  • How do you test failover mechanisms — do you run chaos engineering experiments (e.g., Chaos Monkey, Gremlin) that kill resources in production?
  • What happens if an availability zone or region goes down — does automated failover activate without manual intervention, and what evidence proves this?
  • How do you validate that auto-scaling and self-healing mechanisms can handle load spikes and resource failures simultaneously?

Inventory & Integration:

  • How do you maintain an inventory of all resources and their availability tier (single-AZ, multi-AZ, multi-region), and is this mapping kept in a centralized system?
  • What tools monitor availability and failover status across your entire stack, and how do they integrate with your incident management system?
  • How do load balancers, auto-scaling groups, and failover DNS integrate to provide seamless recovery?
  • Are availability configurations (replica counts, failover policies, health check settings) defined in IaC and version-controlled?

Continuous Evidence & Schedules:

  • What availability SLA do you commit to, and what evidence demonstrates you have met it over the past 12 months?
  • How frequently do you test failover and rapid recovery capabilities, and what evidence shows each test was completed?
  • Is real-time availability and recovery performance data accessible via API or dashboard?
  • How do you detect degradation in availability posture — for example, when a replica count drops or a health check is disabled?

Update History

2026-02-04Removed italics and changed the ID as part of new standardization in v0.9.0-beta; no material changes.

Ask AI

Configure your API key to use AI features.