idr Z. Zhang Internet-Draft K. Kompella Intended status: Standards Track HPE Expires: 3 September 2026 A. Mahale Meta R. Bhargava Crusoe A. Zhang Westford Academy 2 March 2026 BGP Signaling for Multipath Traffic Engineering Junction States draft-zzhang-idr-mpte-signaling-00 Abstract Multi-Path Traffic Engineering (MPTE) combines Traffic Engineering with Multi-Path forwarding, offering a much desired TE solution for both traditional WAN and new AIML DC/DCI. MPTE tunnels are based on MPTE Directed Acyclic Graph (DAG) and can be signaled with extensions to RSVP-TE, PCEP, BGP. This document specifies the BGP protocol extensions and procedures for signaling MPTE DAGs. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 3 September 2026. Copyright Notice Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved. Zhang, et al. Expires 3 September 2026 [Page 1] Internet-Draft BGP Signaling for MPTE March 2026 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Mode of Operation . . . . . . . . . . . . . . . . . . . . 3 1.2. Collecting Topology/TE Information . . . . . . . . . . . 4 1.3. Considerations for BGP Signaling . . . . . . . . . . . . 4 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1. AFI/SAFI and NLRI . . . . . . . . . . . . . . . . . . . . 5 2.2. Full Link Identifier sub-TLV . . . . . . . . . . . . . . 6 2.3. Link Index sub-TLV . . . . . . . . . . . . . . . . . . . 7 2.4. Interface and Node Address sub-TLV . . . . . . . . . . . 7 2.5. Procedures . . . . . . . . . . . . . . . . . . . . . . . 7 2.5.1. Originating Junction State Routes . . . . . . . . . . 8 2.5.2. Receiving Junction State Routes . . . . . . . . . . . 8 2.5.3. Ordered Control . . . . . . . . . . . . . . . . . . . 9 2.5.4. Route Update and Withdrawal . . . . . . . . . . . . . 10 2.5.5. Routes For Other Messages . . . . . . . . . . . . . . 10 3. Security Considerations . . . . . . . . . . . . . . . . . . . 10 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 5. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 6.1. Normative References . . . . . . . . . . . . . . . . . . 10 6.2. Informative References . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 1. Introduction [I-D.kompella-teas-mpte] describes the architecture and framework for Multipath Traffic Engineering (MPTE). A signaling approach was described, which could be implemented via extensions to RSVP, PCEP, or BGP. This document specifies how to signal MPTE Directed Acyclic Graphs (DAGs), in particular, how to provision junctions that make up an MPTE DAG, using a new AFI/SAFI, the MPTE AFI/SAFI, in BGP. [I-D.ietf-bess-bgp-multicast-controller] specifies the BGP extensions to signal multicast replication states to multicast tree nodes. Much of the concepts and extensions can be used to signal MPTE Junction states. This section describes how that is achieved, and the difference between multicast signaling and MPTE signaling. Zhang, et al. Expires 3 September 2026 [Page 2] Internet-Draft BGP Signaling for MPTE March 2026 1.1. Mode of Operation While the BGP signaling for MPTE is not limited to Data Centers (DCs), a DC using EBGP signaling is used as an example. Assume the EBGP sessions between switches in the DC support the MPTE SAFI for the signaling of junction states. A future revision of this document will describe the use of Route Reflectors to a) isolate MPTE from the other functions of the EBGP mesh (basic routing), and b) scale sessions. For each DAG, its Signaling Source (SS), which could be a controller or an ingress switch, originates a set of BGP routes of the MPTE SAFI, one for each junction node. The route is referred to as a Junction State route, and MUST carry a Route Target (RT) to target the route at the corresponding junction node. Once the route is propagated to the targeted node, the matched RT causes the route to be imported by the node and stopped from being propagated further. Before the matching, the route is propagated by the BGP infrastructure. Before a junction node has at least one path set up to an egress, its upstream node should not start sending traffic to it. This ordered control is preferably done in a hop-by-hop fashion, like in the RSVP- TE case. When a junction has its local state set up for a DAG (starting with an egress node), it originates a RESV route for each of its PHOPs for the DAG, including encapsulation (e.g., an MPLS label) and BW information (e.g., the maximum traffic it expects from the PHOP). The upstream node repeats the process, and eventually the ingress node can start sending traffic. Note that a junction node may originate RESV routes before it receives from all its NHOPs. When more or updated RESV routes are received from its downstream, or when some of its downstream nodes are removed or no longer reachable, it will send updated RESV routes to its PHOPs. As an option, the ordered control could be done by the signaling source (SS). In this case, the encapsulation information (e.g., MPLS labels) can be assigned by the SS and included in the junction routes (the label assignment options are detailed in [I-D.ietf-bess-bgp-multicast-controller]). After a junction node installs the forwarding state, it sends an acknowledgment route to the SS, which will tally the result and notify the ingress when and how much traffic can be put onto the DAG. This is as if the junction nodes were programmed with static routes, which shifts the burden/ complexity to the SS. Zhang, et al. Expires 3 September 2026 [Page 3] Internet-Draft BGP Signaling for MPTE March 2026 1.2. Collecting Topology/TE Information Typically, Traffic Engineering uses the IGP (via TE extensions) to distribute topology and TE information. That is not an option for a DC that uses BGP signaling. BGP-LS [RFC9552] is a mechanism using BGP extensions to collect link state and TE information that has been signaled by IGP. Typically, the information is distributed to a few collectors (e.g., controllers) from a few distributors (e.g., IGP border routers). This document suggests using BGP LS [RFC9815] to distribute TE information for MPTE. Every switch is a distributor of its local information. If distributed calculation is used, each switch is also a collector of other switches' local information. More details will be provided. 1.3. Considerations for BGP Signaling [I-D.kompella-teas-mpte] outlined the information carried in the JUNCTION message. When implemented in BGP, the MC ID, MPTED ID, MPTED Version and Tunnel Type are encoded in the NLRI of a new SAFI. For the tunnel information part, the ingress/egress nodes information and tunnel bandwidth are (for now) not encoded. The junction bandwidth is in the NLRI as well, but not considered as part of the NLRI key. All the NHOP and PHOP information is encoded into a Tunnel Encapsulation Attribute (TEA) [RFC9012], with extensions specified in [I-D.ietf-bess-bgp-multicast-controller] and this document. A TEA encodes a list of "tunnels", each of which could be a real tunnel or just an interface or neighbor. As explained in [I-D.ietf-bess-bgp-multicast-controller], when a TEA is attached to an NLRI of MCAST-TREE SAFI, corresponding traffic is replicated across the downstream tunnels in the TEA. Otherwise (including in the MPTE case), traffic is load-balanced across the downstream (NHOP in the case of MPTE) tunnels. Other than that, most of the TEA extensions defined in [I-D.ietf-bess-bgp-multicast-controller] are applicable to MPTE, with the following notes: * All tunnel types and sub-TLVs mentioned in [I-D.ietf-bess-bgp-multicast-controller] can be used. * A tunnel with an RPF sub-TLV is for a PHOP. Zhang, et al. Expires 3 September 2026 [Page 4] Internet-Draft BGP Signaling for MPTE March 2026 * The NHOP load share is encoded in the Weight sub-TLV [RFC9830]. * In the case of labeled MPTE tunnels: - The Tree Label Stack sub-TLV is used to signal the outgoing label (stack) of an NHOP. - The Receiving MPLS Label Stack sub-TLV is used to signal the incoming label (stack) of a PHOP. * For the MCAST-TREE case, only one tunnel has an RPF sub-TLV, and either there is only one tunnel with the Receiving MPLS Label Stack sub-TLV in the case of P2MP tunnel, or each tunnel has a Receiving MPLS Label Stack sub-TLV in the case of MP2MP tunnel. * For the MPTE case, only and all the PHOP tunnels for labeled MPTE tunnels have a Receiving MPLS Label Stack sub-TLV unless ordered control is used. * The indication of an egress point (on a pure egress or on a bud node) is an Any-Encapsulation tunnel without either the RPF sub- TLV or any sub-TLV that identifies a downstream interface/tunnel. In the bud node case, this tunnel has the Weight sub-TLV, indicating the load share as the traffic is load-balanced between local delivery and other NHOP tunnels. 2. Specification The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2.1. AFI/SAFI and NLRI This document defines a new SAFI type MPTE with value TBD1 for signaling MPTE junction states. When it is used with AFI 1, the IP addresses in the NLRI are IPv4. When it is used with AFI 2, the addresses in the NLRI are IPv6. The NLRI is encoded as follows: Zhang, et al. Expires 3 September 2026 [Page 5] Internet-Draft BGP Signaling for MPTE March 2026 +-----------------------------------+ | Route Type (1 octet) | +-----------------------------------+ | Length (1 octet) | +-----------------------------------+ | Route Type specific (variable) | +-----------------------------------+ This document defines the following Route Types: + 1 - Junction State route + 2 - Junction RESV route The Route Type specific part of the NLRI has the following format for both route types: +-----------------------------------+ | MC Address (4/16 octet) | +-----------------------------------+ | MPTED ID (4 octets) | +-----------------------------------+ | MPTED Version (4 octets) | +-----------------------------------+ | Tunnel Type (2 octets)| | +-----------------------------------+ | Junction Node Address (4/16 octet)| +-----------------------------------+ | Originating Node Address | +-----------------------------------+ | Junction BW (4 octets)| | +-----------------------------------+ All the fields above, except the Junction BW, along with the route type and length are part of the NLRI key. 2.2. Full Link Identifier sub-TLV The Full Link Identifier sub-TLV is used to identify an unnumbered interface by the Peer Node Address, Peer Link Index and Local Link Index. It is used for unnumbered PHOPs in the Junction State routes. Zhang, et al. Expires 3 September 2026 [Page 6] Internet-Draft BGP Signaling for MPTE March 2026 +- - - - - - - - - - - - - - - - + | sub-TLV Type (1 Octet, TBD2) | +- - - - - - - - - - - - - - - - + | sub-TLV Length (1 Octets) | +- - - - - - - - - - - - - - - - + | Peer Node Address (4/16 Octets)| +- - - - - - - - - - - - - - - - + | Peer Link Index (4 Octets) | +- - - - - - - - - - - - - - - - + | Local Link Index (4 Octets) | +- - - - - - - - - - - - - - - - + 2.3. Link Index sub-TLV The Link Index sub-TLV encodes the Link ID on a node receiving the route. It is used for unnumbered PHOPs in the Junction RESV routes . +- - - - - - - - - - - - - - - - + | sub-TLV Type (1 Octet, TBD3) | +- - - - - - - - - - - - - - - - + | sub-TLV Length (1 Octet) | +- - - - - - - - - - - - - - - - + | Link Index (4 Octets) | +- - - - - - - - - - - - - - - - + 2.4. Interface and Node Address sub-TLV The Interface and Node Address sub-TLV encodes the local or neighbor address on an interface, and the address of the node that the interface connects to. The type of address (IPv4/IPv6) is inferred from the sub-TLV Length. +- - - - - - - - - - - - - - - - + | sub-TLV Type (1 Octet, TBD4) | +- - - - - - - - - - - - - - - - + | sub-TLV Length (1 Octet) | +- - - - - - - - - - - - - - - - + | Peer Node address (4/16 Octets)| +- - - - - - - - - - - - - - - - + | Intf/Nbr address (4/16 Octets)| +- - - - - - - - - - - - - - - - + 2.5. Procedures Zhang, et al. Expires 3 September 2026 [Page 7] Internet-Draft BGP Signaling for MPTE March 2026 2.5.1. Originating Junction State Routes After the MC calculates an MPTED, the SS originates a Junction State route for each junction node. The route carries an IP Address Specific RT, with the Global Administrator field set to the junction node's address and the Local Administrator field set to 0. The route carries a Tunnel Encapsulation Attribute (TEA). Each tunnel in the TEA corresponds to a PHOP or NHOP: * Each tunnel is an Any-Encapsulation tunnel, with a Full Link Identifier sub-TLV or an Interface and Node Address sub-TLV to identify an incoming/outgoing link (in addition to other sub- TLVs). * Each PHOP tunnel MUST also include the following sub-TLVs: - One RPF sub-TLV to indicate it is a PHOP - One Receiving MPLS Label Stack sub-TLV to encode the incoming label unless ordered control is used. * Each NHOP tunnel MUST include one Tree Label Stack sub-TLV to encode the outgoing label unless ordered control is used. It MAY include a Weight sub-TLV to encode the NHOP share. If one NHOP tunnel includes a Weight sub-TLV, then all NHOP tunnels MUST include a Weight sub-TLV. 2.5.2. Receiving Junction State Routes Each node X that receives an MPTE route with an RT whose Global Administrator field does not match its loopback address propagates the route to all its neighbors (except the one from which it received the route). If the RT matches its own loopback address, X MUST import it, and MUST stop re-advertising the route upon match and importation. Once the route is imported, X installs forwarding states as described in the following sections, in the case of MPLS when ordered control is not used (other tunnel types or ordered control will be specified in a future revision). Zhang, et al. Expires 3 September 2026 [Page 8] Internet-Draft BGP Signaling for MPTE March 2026 2.5.2.1. Building Forwarding Nexthop When a data packet is received, an IP address or MPLS label lookup is done to produce the forwarding information about how the packet should be forwarded. The forwarding information is referred to as forwarding nexthop in this document, or simply nexthop when it is not ambiguous. The forwarding nexthop for a junction is built by checking each NHOP tunnel. The Interface and Node Address sub-TLV or the Full Link Identifier sub-TLV identifies the outgoing interface/neighbor, and the Tree Label sub-TLV identifies the outgoing label. The Weight sub-TLV provides the load-balancing share for the link, and bandwidth reservation can be done based on the Junction Bandwidth in the NLRI and the Weight sub-TLV in the NHOP. 2.5.2.2. Installing Routes For each PHOP tunnel in the TEA, a label route is installed with the label value in the Receiving Label Stack sub-TLV, pointing to the forwarding nexthop built as specified above. 2.5.2.3. Sending Junction State Route Acknowledgment Each junction sends an acknowledgment back to the SS. Unless ordered control is used, the SS makes sure that all junctions are properly programmed before the tunnel is put into use. The acknowledgement is simply the same Junction State route modified as follows: * The Originating Node's Address is set to the junction node's address. * The Route Target is set to target the SS. 2.5.3. Ordered Control When hop-by-hop Ordered Control is used, the Junction State route does not carry encapsulation information (e.g., labels) in the PHOPs/ NHOPs, and the junction's forwarding state is not installed until at least one Junction RESV route has been received from one of the NHOPs. Each junction originates a Junction RESV route targeted at each of its upstream junctions. The route type specific part of the NRLI is set according to the Junction State route, with the Junction Node Address set to that of the upstream junction, which is from either the Interface and Node Address sub-TLV or the Full Link Identifier sub-TLV. The Originating Node Address is set to that of Zhang, et al. Expires 3 September 2026 [Page 9] Internet-Draft BGP Signaling for MPTE March 2026 this junction. A Route Target is used to target the route at the upstream junction. The Junction BW is set to the total BW to be reserved on the upstream junction for this junction. A TEA is attached, with only PHOP tunnels toward the upstream junctions . The PHOP tunnel includes one of the following: * A Tunnel Egress Endpoint sub-TLV, in which the address is set to the interface/neighbor address in the Interface and Node Address sub-TLV in the corresponding PHOP in the corresponding Junction State route. * A Link Index sub-TLV, in which the Link Index is the Peer Link ID in the Full Link identifier sub-TLV in the corresponding PHOP in the corresponding Junction State route. 2.5.4. Route Update and Withdrawal When a junction is updated (e.g., with added/removed/updated PHOPs/ NHOPs), the corresponding Junction State route is updated accordingly. If a junction is deleted, the corresponding Junction State route is withdrawn. Corresponding acknowledgement and reservation routes are updated, originated, or withdrawn accordingly. 2.5.5. Routes For Other Messages To be added. 3. Security Considerations To be added. 4. IANA Considerations To be added. 5. Acknowledgments The authors thank Vishnu Pavan Beeram, Chandrasekar Ramachandran, Sudharsana Venkataraman, and Jai Hari M K for their comments and suggestions. 6. References 6.1. Normative References Zhang, et al. Expires 3 September 2026 [Page 10] Internet-Draft BGP Signaling for MPTE March 2026 [I-D.ietf-bess-bgp-multicast-controller] Zhang, Z. J., Raszuk, R., Pacella, D., and A. Gulko, "Controller-based BGP Multicast Signaling", Work in Progress, Internet-Draft, draft-ietf-bess-bgp-multicast- controller-16, 28 February 2025, . [I-D.kompella-teas-mpte] Kompella, K., Jalil, L., Khaddam, M., and A. Smith, "Multipath Traffic Engineering", Work in Progress, Internet-Draft, draft-kompella-teas-mpte-01, 7 July 2025, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . [RFC9012] Patel, K., Van de Velde, G., Sangli, S., and J. Scudder, "The BGP Tunnel Encapsulation Attribute", RFC 9012, DOI 10.17487/RFC9012, April 2021, . [RFC9815] Patel, K., Lindem, A., Zandi, S., and W. Henderickx, "BGP Link State (BGP-LS) Shortest Path First (SPF) Routing", RFC 9815, DOI 10.17487/RFC9815, July 2025, . [RFC9830] Previdi, S., Filsfils, C., Talaulikar, K., Ed., Mattes, P., and D. Jain, "Advertising Segment Routing Policies in BGP", RFC 9830, DOI 10.17487/RFC9830, September 2025, . 6.2. Informative References [RFC9552] Talaulikar, K., Ed., "Distribution of Link-State and Traffic Engineering Information Using BGP", RFC 9552, DOI 10.17487/RFC9552, December 2023, . Authors' Addresses Zhang, et al. Expires 3 September 2026 [Page 11] Internet-Draft BGP Signaling for MPTE March 2026 Zhaohui Zhang HPE Email: zhaohui.zhang@hpe.com Kireeti Kompella HPE Email: kireeti.ietf@gmail.com Aditya Mahale Meta Email: aditya.ietf@gmail.com Raghav Bhargava Crusoe Email: raghavbhargava12@gmail.com Aaron Zhang Westford Academy Email: aaronzhang194@gmail.com Zhang, et al. Expires 3 September 2026 [Page 12]