Why are Companies Missing RTO Targets? Key Insights and Solutions

Ralph Labarta
Mar 2
6 min read

Recent conversations with clients and partners reveal a common cyber resiliency challenge: recovery efforts often fail to meet the Recovery Time Objective (RTO) during actual cyber events. RTO is a critical recovery performance criterion; it can be the difference between a client service inconvenience or a major business service failure.

The Road Here

RTO failures are not limited to those that have underinvested in business continuity planning or technologies. The focus here is on companies that have invested in cloud deployments, enterprise backup solutions, and alternate availability zones, but fail to meet RTO objectives during real world recovery events. Let us assume that the building blocks of a modern cloud based high-availability solution are in place:

Where are the failure points that can turn a 24-hour RTO into a 72+ hour outage.

Set the Stage

A characteristic of an effective business continuity plan is that it addresses a broad range of interruption scenarios. However, the unique challenges presented by cyber events must be taken into consideration to ensure the plan will perform as expected under cyber duress. The below sets out a typical cyber interruption response scenario and the related technology recovery activities.

Monday, April 7, 2024

5:52 AM ET: System monitoring reports "DOWN" condition across various IP's and URL's associated with client facing and internal web applications.

6:07 AM ET: Staff report inability to access login page of internal CRM system.

6:14 AM ET: IT staff confirm broad outage of resources hosted in their East cloud deployment.

6:18 AM ET: IT opens support tickets with cloud and support partners.

6:25 AM ET: IT management confers with staff to review status and prepare communication to management and internal staff. Staff continues to investigate outage scope and potential causes.

6:49 AM ET: Client phone calls and emails increase as awareness of the outage expands. Internal staff is able access phone and email but CRM and other client service systems are unavailable. Operations management is aware of the outage and are inquiring as to when system access will be restored.

7:05 AM ET: IT has determined that the scope of the outage is limited to the company's technology resources hosted in their East cloud deployment and not part of a broader provider or internet interruption. All systems including administrative consoles within the East deployment are unresponsive.

7:30 AM ET: Vendor resources and internal IT staff have been unable to identify a root cause of the outage. More senior technical recourses are expected online shortly, but access to logs, SIEM data, vCenter, etc. is limited or completely unavailable. No additional information indicating ransomware or other malicious intrusion has been received or identified.

8:10 AM ET: Client calls are overwhelming the support center as the business day ramps up. Senior management has been updated but the lack of actionable information has limited their role to simply approving client communications and making key customer calls. Management decides to formally initiate the disaster response plan including cyber incident response elements. The CTO directs two IT engineers to follow the incident response plan playbook and initiate efforts to bring up their alternate availability deployment in the West region, while other technical staff focus on the primary environment.

8:27 AM ET: No significant progress has been made on recovering services in the primary environment. Senior management is pushing for a timeline on system restoration in the West region. Based on the documented disaster recovery plan, an RTO of 6 hours is communicated to management, it is assumed that IT services will be back online by 3:00 PM ET which would allow for a significant portion of the client base to complete transactions with due dates of today. In coordination with CTO, resources are directed toward restoration efforts and communications to clients and internal staff indicate an active effort to restore services by end of day are underway.

Fast Forward......

2:39 PM ET: Senior management has been informed by the CTO that numerous issues will prevent restoration of services to be completed by 3:00PM. Due to the complexity of technical issues, an updated timeline for restoration could not be provided.

What Went Wrong...and What Could Go Wrong

Real world scenarios that derail recovery efforts.

Issue	Detail	Impact	Mitigation
Access to DR Environment	Teams working on restoration efforts identified that their credentials mirrored those in the primary environment and were potentially compromised.	Ensuring secure access to the DR environment required additional engineering resources to address. Delay: 1-2 hrs.	Add cyber considerations to DR credentials and authentication plan.
Securing DR Environment	Engineers raised additional concerns about the integrity of the DR environment based on the growing suspicion that the outage in the primary environment was due to a cyberattack.	Additional efforts to ensure the DR environment had not been compromised required vendor resources that were working on the primary environment. Delay: 1-2 hrs.	Ensure safeguards are in place and tested to prevent corruption of DR environment. Add security analysis of DR environment to plan.
Deployment	Engineers identified various primary firewall and network changes that were not reflected in the firewall build in the DR environment.	Documentation of changes was not available due to lock down of primary environment. Engineers had to piece together emails to identify changes. Delay: 1 hr.	Improve change management process to ensure consideration of DR environment synchronization and ensure DR support documents are available outside of potentially impacted systems.
Blue Screen	VM blue screen occurred on a restored server deemed critical.	Troubleshooting and resolution impacted by delayed vendor response. Delay: 1-2 hrs.	Increase testing frequency of DR environment to reduce risk of recovery issues. Ensure critical vendor support resources are governed by SLA's aligned with DR needs.
Missing Server	A recently deployed server was missing from protection group and not available in DR environment.	Identification and workaround resolution consumed engineering resources. Delay 1-2 hrs.	Ensure procedures governing changes and deployments include DR specific steps that specify analysis of DR requirements.
Key Replication	Key replication issue prevented boot of server.	Vault access resolution consumed engineering resources. Delay: 1-2 hrs.	DR testing should include real-time access to keys, password, etc. vaults. Testing should not include staging of needed data outside of resources available when primary systems are offline.
Restore Performance	Disk storage type was not changed prior to restore due technical issues.	Initial restore effort was aborted based on estimated completion time, disk type change was resolved after vendor delay and initial data restore restarted. Delay 1-3 hrs.	Ensure effective leadership oversees restore process and that deviations from documentation must be approved.
Data Integrity	Data integrity scan crashed due to lack of system resources.	Increased memory and CPU provisioned after delay and scan restarted. Delay 1-2 hrs.	Ensure accurate sizing analysis conducted frequently and well documented in plan. Frequent testing can identify negative impacts of primary system growth.
Data integrity	Data integrity scan showed critical errors potentially caused by snapshot issues in the primary environment.	Older snapshot was selected for restore with the hope that it was not impacted by the issues in the primary environment. Restore process restarted. Delay: 1-2 hrs.	Testing should include analysis of restore integrity to increase probability of valid restore points. Replication tools often create data integrity challenges during high-transaction volume periods.
Authentication	Recently deployed application authentication solution component failed due to unknown dependency on primary environment.	Network engineer and third-party provider resources needed to troubleshoot and resolve. Delay 1-2 hrs.	Ensure deployments of any technologies in the primary environment are evaluated for DR impacts. Frequent testing helps identify authentication issues.
Environment Certification	Plan lacked detail to guide certification of restored systems after a suspected cyber event.	Consultation among security, legal, insurance, technology and senior management teams further delayed system access. Delay 1-2 hrs.	Develop detailed system certification criteria as part of plan. Test certification process as part of DR testing.

Conclusion: Invest, Streamline, Test, Test, Test

DR has come a long way from the days of tape backups stored offsite. In a modern cloud environment, the simplicity and functionality offered by leading vendors provides powerful tools but, the complexity and interdependencies of IT environments creates a counterbalance of challenges to recovery performance. While this article provides a detail look into the complex human and technology perspectives of a recovery effort, the key takeaway is as follows:

Frequent and well-planned testing is the most effective tool to increase the probability of successful recovery that meet RTO.

Effective and frequent testing will reveal weaknesses, oversights, mistakes and potential pitfalls. Testing should take place under duress (resource unavailable, timed steps, interruptions, etc.) While many cyber advocates have popularized executive table-top exercises, engineering teams must be exercised in the same manner.

Key Action Items for Boards/Ownership/Management:

Understand backup and recovery capabilities of critical systems.
Understand how recovery capabilities are tested.
Require recovery performance documentation.
Require frequent recovery testing commensurate with business risk (minimum quarterly).
Ensure recovery testing efforts are properly funded and supported by business operations.

An analysis of recovery capabilities should also include critical SaaS vendors. It is insufficient to rely on a vendor's SOC 2 or similar third-party attestation.

A footnote on compliance frameworks: Various compliance frameworks (SOC 2, NIST, ISO 27001) only require annual DR testing. In the opinion of the author, this is simply not frequent enough to provide a high probability that successful system recovery can be achieved within a planned RTO.

Why are Companies Missing RTO Targets? Key Insights and Solutions

The Road Here

Set the Stage

What Went Wrong...and What Could Go Wrong

Conclusion: Invest, Streamline, Test, Test, Test

Recent Posts

Comments