Automatic Code Partitioning Speeds Network Processor Programming

By Prashant R. Chandra

Executing code in parallel using multiple processing engines can help developers meet the ever-increasing speed demands of network data plane operations. Partitioning a software task into code for parallel execution, however, has typically required an engineering-intensive iterative process. By using compilers that support an auto-partitioning programming model, developers can greatly speed design efforts while improving system performance.

In the data plane, processing consists primarily of receiving packets from a media interface or switch fabric, then performing a series of operations. The processor must classify packets, modify them to accommodate the protocols and control plane policies being implemented, queue the packets for transmission, and transmit them back out through a switch fabric or media interface according to a traffic management policy. All this must be accomplished fast enough to be ready for the next packet when it arrives, an increasingly challenging task for conventional processors as data rates rise. This packet-by-packet application flow fits well, however, with the use of processor parallelism. Parallel processors can work on packets separately as they are received, reducing the data rate each processor must support.

Advanced network processors (NPUs), such as the Intel® IXP2XXX product line, implement such hardware parallelism, allowing them to independently execute multiple cooperative thread contexts. This independent execution allows software developers to partition various packet processing tasks into operations that execute concurrently. In this way, the cycle budget for any one packet or cell can be multiplied by the degree of parallelism allocated to packet processing.

While conventional programming models allow full use of the hardware available for implementing networking algorithms and for creating and managing parallelism, software developers must perform additional software design tasks. One added effort is partitioning applications: changing them from sequential, run-to-completion algorithms into multi-threaded parallel execution streams. This task requires making trade-offs among performance, instruction store usage, and coding complexity by iteratively analyzing, coding, and profiling results until an optimized result is achieved.

Developers must also spend time creating schemes for locking, signaling, and synchronization among application threads to control access to shared data structures. The shared structures are needed because of the typically high cost of memory access in network processing. Developers must then establish inter-process communications between execution engines for transfer of control and data between processing stages in a pipeline.

The Auto-Partitioning Programming Model

An auto-partitioning programming model automates these software tasks, streamlining programming and freeing developers to focus on applications development instead of implementation. In Intel’s latest C compiler for the Intel® IXA software development kit (SDK), for instance, the auto-partitioning programming model exploits the explicitly parallel processor topology in Intel IXP2XXX network processors. The compiler works to meet user-specified application performance requirements by spreading processing tasks across a large number of individual RISC engines (microengines, or MEs) that can each implement as many as eight cooperative processing threads.

When using the auto-partitioning programming model, developers express the network processing application as a set of packet processing stages (PPSes). These are sequential operations that can execute concurrently and can intercommunicate. The physical and logical structure of the program consists of a set of C source files that implement one or more PPSes. The auto-partitioning C compiler analyzes the critical paths and performance requirements of these PPSes, and renders them individually on one or more microengine threads.

The C functions that implement PPSes do not take any function arguments and do not return once called. Each PPS has a distinct entry point and a main loop that runs indefinitely. Packet processing stages are implemented using familiar C function syntax, preceded by the keyword “__pps”. As shown in Figure 1, only minimal coding is required to implement a PPS.


Figure 1 � Auto-partitioning starts with defining the application as a series of packet processing steps (PPS) in C.

The auto-partitioning programming model introduces performance control directly into the compilation process by providing C language extensions for handling critical paths through the PPS. Developers use these extensions to specify the performance and throughput requirements for each critical path. The extensions take two forms: source code annotations and external throughput specifications.

Source code annotations identify the critical paths for the compiler, using the directive: “ __path ( identifier );” The “identifier” in the statement names the critical path and indicates that the program point where the directive occurs lies on the named path. From this annotation, the compiler will determine what other program points lie on the critical path. For example, if a conditional structure or compound statement lies on a critical path, as shown in Figure 2, then by convention the else half is assumed to lie on the critical path. Hence in Figure 2 , statement 2 lies on the critical path called “X.” If a critical path directive occurs on one or more branches of a conditional statement, however, those branches do not belong to the critical path unless explicitly indicated. For example, in Figure 3 statements 0, 1, and 3 belong to critical path X, while statement 2 does not. All statements on the same critical path are subject to the same performance requirement. A statement may belong to more than one critical path.


Figure 2 � To automatically allocate resources, developers must first use the C extension __path (); to identify critical paths for the compiler.

Figure 3 � The location of the __path(); directive in the code allows the compiler to identify all the statements that form a critical path.

The throughput specification describes the throughput requirement in terms of loop iterations per second for each annotated critical path, assuming the worst-case scenario of all packets traversing that critical path. A compiler command-line switch contains the throughput specification “-T<path_name=identifier>”. Here, “identifier” refers to the associated path directive. Multiple switches are used to specify throughput requirements for multiple critical paths.

Rendering PPSes to Microengines

For each annotated critical path that is accompanied by a throughput specification, the compiler evaluates the number of microengines required and automatically chooses partitioning that meets the performance requirements of all critical paths. If the compiler cannot satisfy the performance requirements of one or more critical paths it prioritizes the critical paths in the order in which the throughput specifications appear on the command line, then generates an error message that shows the best performance achievable for each of the failed critical paths. The compiler then produces a detailed performance report that includes utilization of the various hardware resources, such as memory bandwidths, and the best achievable performance on the other critical paths.

In a typical design, each iteration of the PPS main loop processes a single packet. The compiler works to achieve the desired application performance by replicating the execution of the packet processing stage across enough threads or microengines to enable the worst-case execution path to accept and dispatch packets at the worst-case arrival rate. The compiler will render the PPSes to the microengines using one of the following techniques:

  • If the instruction stream for the packet processing loop fits within the code store of a single microengine, the compiler can replicate this instruction stream across enough threads to meet performance requirements. This is also referred to as functional pipelining.
  • If the instruction stream for the packet-processing loop exceeds the code store of a single microengine, or if the compute horsepower of a single microengine is insufficient to meet performance requirements, the compiler will search for an appropriate location to distribute the instruction stream onto two or more microengines as needed. The compiler automatically generates code to marshal and unmarshal the live variables into a series of words to be sent over a compiler-inserted communications “pipe” between the microengines, as shown in Figure 4.
  • To meet certain performance objectives, the compiler may choose a combination of these techniques.

Figure 4 � When allocating code to the processor�s microengines (ME), the compiler can choose from among several strategies to find the best fit between the resources and the code requirements.

Supporting Software Reuse

Automatically rendering the PPSes onto the microengines based on the performance requirements vastly speeds software development. A derivative advantage is that it simplifies software reuse within a processor family. For instance, Intel® network processors support a wide range of network applications-from low-end Internet access appliances to multi-gigabit core-network routers. Many of these devices share a common set of packet processing functionality, and network equipment manufacturers increasingly desire to reuse application code among these related devices to consolidate their hardware and software platforms.

Intel’s current implementation of C for network processors shields the application code from underlying changes in the microengine instruction set. In addition, the compiler hides differences in microengine implementation among NPUs. The application code the developer writes, however, is typically tied directly to the type of NPU, clock speed, and the particular microengine on which the code is executing. As a result, choices made to optimize performance become hard-coded at the source level, which complicates the task of migrating applications to other NPUs. The auto-partitioning programming model will abstract such performance differences among microengines and other functional units. This abstraction then allows a developer to focus on factoring common networking functions into a reusable code base, and to incorporate this code transparently into multiple applications running on different Intel network processors without re-optimization or reconfiguration.

The auto-partitioning C compiler also provides a mechanism to re-use legacy assembly code as well as code generated from previous versions of Intel® IXA Microengine C. When using existing code, developers must specify one or more input files to the compiler by using the current programming model and indicate to which microengines these input files are assigned. The compiler will then exclude these microengines from its available pool of resources when rendering PPSes onto the remaining microengines. In addition to the microengines, developers can also specify available bandwidth on the various memory channels and internal resources, which are used to determine the available pool of resources. A specialized form of pipe syntax allows PPSes to communicate with existing code modules. Mixed-model applications may also use shared memory constructs.

By providing mechanisms for reuse of existing software as well as automating time-consuming partitioning tasks, the auto-partitioning programming model can boost the efficiency of network processing software development for parallel processing architectures. That boosts design performance along with designer efficiency.


Prashant R. Chandra is a Principal Engineer with the Infrastructure Processor Division at Intel Corporation. He was one of the original contributors to the development of the Intel® IXA Portability Framework for the Intel® IXP12XX and IXP2XXX network processors. Currently, he leads the development of the auto-partitioning C compiler for IXP2XXX network processors. He is also a key contributor to the development of the next generation of network processor architectures.

For more information, visit: developer.intel.com/DESIGN/NETWORK/PRODUCTS/npFAMILY/INDEX.HTM