| Internet-Draft | Problem Statement for Network Resilience | March 2026 |
| Zhao, et al. | Expires 3 September 2026 | [Page] |
This document defines the problem space and analyzes the limitations of current network architectures when dealing with complex, cascading, and unanticipated failures.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 3 September 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Traditional IP network reliability architectures are primarily built upon the principle of "Robustness." While static redundancy and deterministic topology convergence are mature enough to handle predictable single-point failures, modern networks exhibit significant survivability gaps as business logic complexity grows. Therefore, it is necessary to introduce resilience enhancement capabilities to improve the network's ability to adapt and maintain service continuity in complex environments.¶
The core drivers for this shift include:¶
Evolution of Service Requirements: Critical services are shifting from simple "availability" to "deterministic survivability." This requires the network to maintain a baseline SLA even under extreme shocks, rather than accepting long-term interruptions.¶
Complexity Surpassing Human Intervention: As analyzed in [RFC7276], traditional IP OAM mechanisms primarily focus on connectivity and continuity. However, they are increasingly insufficient for detecting implicit deterioration where the failure is not binary (up/down), especially in high-precision scenarios requiring millisecond-level awareness.¶
Failure Modes Shifting from "Deterministic" to "Unanticipated": Existing robust designs focus on "all-or-nothing" failures. However, they show clear survivability deficiencies when handling cross-layer correlated risks, gray failures, and resource bottlenecks.¶
The current vulnerability of networks is manifested in four deep failure modes:¶
This refers to the direct offline status of physical links or nodes. While mechanisms like BFD and FRR are mature, issues persist in multi-point failure scenarios where backup paths may be exhausted or lead to "black holes" due to a lack of real-time capacity awareness.¶
State Deception: Traditional heartbeats often fail to capture micro-burst deteriorations.¶
Detection Gaps: While In-situ OAM (IOAM) as specified in [RFC9197] enables the collection of fine-grained, hop-by-hop telemetry data, it defines the data plane encapsulation rather than the operational logic for mitigation. Consequently, without an integrated automated response mechanism, traffic may remain on degraded links, leading to sustained SLA violations.¶
Under extreme pressure (e.g., traffic surges or DDoS), the system hits resource bottlenecks, leading to loss of self-rescue capability; for example, when the control plane CPU is exhausted, the network loses its management entry point.¶
Based on the process of failure evolution, the root causes of resilience deficiency are categorized into three stages:¶
Configuration and Specification Issues: Misconfigurations or non-standard networking practices are prevalent in current network deployments.¶
Lack of Simulation/Prediction: Current networks lack the capability for integrated risk analysis and high-fidelity simulation across multi-vendor and multi-disciplinary complex environments.¶
Issues include protocol defects and a lack of real-time resource awareness on escape paths.¶
Protocol and Solution Defects: Bugs within protocols or improper coordination between solutions in complex scenarios (e.g., multi-solution stacking).¶
Escape Path and Fault Tolerance Failure: Even when backup paths exist, they often fail to provide the intended relief. This is typically due to:¶
Resource Blindness: Traffic switches to backup paths that immediately collapse because they cannot handle the sudden load surge, stemming from a lack of real-time resource awareness.¶
Ineffective Design: The backup or escape schemes themselves are improperly designed (e.g., suboptimal path calculation or logical loops), resulting in a failure to achieve the intended "escape" effect and leaving the service in a degraded or interrupted state.¶
Insufficient Cross-layer Coordination: The physical and network layers fail to collaborate, preventing rapid responses to cross-layer common-cause failures.¶
Recovery relies too heavily on manual intervention; while frameworks such as Service Assurance for IP-Based Networks (SAIN) [RFC9417] provide a foundation for modeling dependencies, fully automated "closed-loop" evolution remains in its infancy.¶
To address the aforementioned problems, a resilient architecture should satisfy:¶
Proactive Risk Awareness: The ability to identify risk trends before failures occur based on multi-dimensional telemetry data.¶
Elastic Resource Buffering: The ability to absorb instantaneous traffic shocks without changing topology through elastic scheduling, isolation, and resource decoupling.¶
Deterministic Self-Healing: The ability to restore service performance to baseline SLA within a predefined time limit and maintain "inertial operation" of services.¶
Closed-loop Immune Evolution: The ability to learn failure patterns through feedback loops and automatically upgrade defense strategies to raise the future anti-risk baseline.¶
TBD.¶
Resilience mechanisms may introduce new attack vectors, such as injecting false telemetry data to trigger unnecessary path oscillations. Any framework must introduce identity-based authentication for all sensing data and policy updates.¶
TBD.¶
TBD.¶