Flat Program? Overheated Multicore Application?
Roadside assistance with an industry-standard inter-process communication - MCAPIBy Tedd Gribb and Martina Brehmer, Polycore Software, and Markus Levy, The Multicore Association
Planning the Multicore Journey
Approaching multicore is like planning a road trip; the vehicle should be in good shape and ready for travel. Similarly, a multicore project must be carefully planned. Engineering projects come with unknown factors and a multicore project adds complexity.
One perspective of the multicore journey can be seen in the roadmap of the Multicore Association’s (MCA) roadmap, with the primary goal being to develop an extensive set of application programming interfaces (APIs) and the establishment of an industry-supported set of multicore programming practices and services. One of these is the communications API, MCAPI, the underlying focus of this article.
Questions that Designers Face in Planning a Multicore Journey
Is the application ready for multicore? Does it have “built-in” concurrency characteristics or will it require some restructuring to efficiently run on a multicore platform? Modification is likely.
Is message passing a good approach? Synchronization may be more challenging with true concurrency.
What is the best way to ensure scalability and code re-use for the future? The next platform may have many more cores and designers should be able to re-use the application as the number of cores increases.
What tools are available? Good tools are always useful; with multicore, they are critical.
Some engineers consider homogeneous platforms to be SMP and heterogeneous platforms to be AMP. The application’s behavior should be the deciding factor for symmetric or asymmetric, rather than what type of core is being used.
Another debate is the importance of a programming model when using dual-core. If multiple communication channels are involved, then queues and data should be safeguarded and communications quickly become complicated even with just two cores. Second, if the application would need to scale to a different platform, source code reuse for next-generation platforms is beneficial, and selecting a standard programming model with tools support is important.
Let’s take a look at shared memory and zero copy. Factors to consider are: how much transaction data exists, the number of cores sharing the memory, the memory architecture and access to DMA. When moving large amounts of data, copy by reference is attractive. However, if many cores are randomly accessing the same memory, moving the data to a local memory may be more efficient.
Lastly, it is often misunderstood that with SMP, the application will automatically run faster. In a system with multiple applications such as a personal computer or a server, this would likely be the case up to a certain number of cores, because the applications are mostly independent. However, embedded systems typically have a single application and that single application needs to be distributed across multiple cores. The application must have opportunity for concurrency, and data dependencies must be considered.
How to Prepare
Consider which programming model is the best match, based on application behavior and the platform of choice. The model should be reusable in upcoming trips.
Can the tool model the application and allow for review of different possibilities before committing to a platform and design? Do the tools provide the flexibility to adapt and scale? Tools that support and enable designers to use standards simplify the journey. Migrating to and optimizing for multicore is an iterative process, so multiple passes of design and optimizations typically occur.
Symmetric Multi-Processing (SMP)
With SMP, there is a single OS instance running on multiple cores, and work queues are serviced by the next available core. Shared memory is used for management and data sharing, and the cores must be homogeneous because they are executing code that’s compiled with a single instruction set.
The OS manages the scheduling of threads or processes. Typically, SMP offers high throughput because of its scheduling. However, response latency is unpredictable. Threads/processes can be pinned to cores, which introduce asymmetry. The more cores that are on the same bus and the same memory, the greater the diminishing returns will be.
|Figure 1: MCAPI communication modes|
Asymmetric Multi-Processing (AMP)
AMP has multiple instantiations of one or more OSes, simple scheduler or no OS on each core. A system could have multiple different types of operating systems. Cores can be homogeneous or heterogeneous. AMP is quite scalable because it is more loosely coupled than SMP. Different types of memory are available, which can be shared or local and across chip boundaries with a variety of interconnect types. Cores can be dedicated to certain functions, allowing them to have deterministic behavior and more precise control over the system. The ability to use different kinds of cores, matched to the workload, such as DSP for signal processing, provides for better power efficiency. However, multiple tool sets are needed when different types of cores are in the system.
MPI is used for widely distributed computing, and MCAPI for closely distributed computing. Message passing is a ubiquitous model, available within OSes and used for networking.
Message passing can be applied to SMP and AMP or a combination thereof and scales well. MPI can be found in distributed supercomputers with many cores, whereas MCAPI primarily targets closely distributed computing, defined as multiple cores on a chip, multiple chips on a board or a combination thereof.
OpenMP is primarily used for SMP systems and often used for distribution of loops across multiple cores. OpenCL is another model that is primarily used for GPU programming.
Take Advantage of Standards
The programming model should be suitable for your target market, work the same way on few and many cores and be agnostic to the type of core, OS and transport, and thereby facilitate scalability. Seek actively supported standards by an organization with broad industry support, including hardware and software vendors and consumers. The standard should be available and proven to be portable.
MCAPI Communication Modes
“Connectionless messages,” the most flexible mode, is sent from a source to a destination, without requiring a connection, using both blocking and non-blocking functionality. Communication is buffered with per-message priority.
Packet channels use connected unidirectional FIFO channels, with blocking and non-blocking functionality. Communication is buffered with per channel priority.
Scalar channels are for 8, 16, 32 or 64 bits scalars and use connected unidirectional FIFO channels, with blocking functionality and per-channel priority.
Other functional groups are for node and endpoint management, for managing non-blocking operations and one for other support functions.
Inspect Before Embarking on the Journey
The MCAPI programming model works with AMP and SMP systems. MCAPI is very scalable and should take us well into the future, allowing reuse across platforms and product generations, and is targeted at closely distributed computing and embedded systems.
Multiple commercial implementations are available as well as an open source implementation. MCA membership spans the industry, providing a body of expertise and knowledge that can be leveraged and tools that are readily available.
MCAPI can be used stand-alone, or in combination with other synergistic MCA standards.
Tools Simplify the Journey
Ideally, all flows in an application are mapped to a multicore platform in the most efficient way. The graphic below is a function pipeline. The more compute-intensive functions, in the red ovals, are replicated and applied in parallel steps to equalize the time of each step in the pipeline.
Rapid prototyping capabilities lighten the programming load. If mapping can be done in a timely manner, then different topologies and configurations can be evaluated and fine-tuned for optimal results. Mapping to multicore is an iterative process, and a rapid prototyping capability will expedite the process.
Tools - MCAPI Enable
On a single processor, function parameters are passed on the call stack. With functions on different cores (AMP) parameters are passed by explicitly communicating. Functions are simply encapsulated by communication calls.
|Figure 2: The application flow and the corresponding tools-generated topology map.|
Tools – Instant MCAPI
The tool exemplified in the picture is aware of the underlying topology, nodes and ports, and the information is used in coding templates. The programmer selects an MCAPI function, e.g., send message or receive message; then decides whether to use the blocking or non-blocking version, selects the destination node and ports. The next step is to select parameters, either from existing variables or those offered by the tool. The template is completed, and MCAPI code inserted in the application at the programmer specified location. The code generation speeds up the programming and reduces errors and debugging.
Tools – SW vs. HW Accelerator
Setting up a DMA transaction takes time and must be balanced with the amount of data to be moved. For a small transaction, software copy may be better, whereas for a large transfer, DMA is better. Being able to set a threshold for a DMA transfer based on the transaction cost for the specific DMA, i.e., automatically determine whether to use soft or hard copy, would be preferred. Programming a DMA can also be complex and time-consuming. A tool that configures the DMA instead of programming, including setting a trigger, will make life a lot easier and faster. To use or not to use the DMA should be transparent to the MCAPI-enabled application. Tools provide a platform for such experimentation and optimization.
|Figure 3: Instant MCAPI – Code generated with an MCAPI template tool speeds up the programming process and reduces errors/ debugging.|
Tools – Memory Utilization
Using shared memory is simple because one can pass a pointer referencing the data, along with some metadata. The small amount of data movement is attractive for efficiency reasons. On the other hand if a number of cores randomly access the data in shared memory, bus contention may reduce performance. In some cases, moving the data from one local memory to another local memory may be better. The data can be processed faster, because local memory is undisturbed by other cores. Should DMA be available, the DMA would offload the CPU and provide a compact transaction, occupying the bus for a short period of time. Some parts of the data flow may be best served with copy by reference and others by data movement. Predicting memory behavior in a multicore platform can be challenging and having the tools and runtime capabilities to experiment with memory usage will likely produce more optimal results.
Tools - On-Chip / Off-Chip
What if the platform spans across multiple chips? The inter-chip transport could be shared memory or a serial connection, requiring data movement. Does the application have to be aware of the topology, and if so, how about portability and spanning across multiple platforms? The application doesn’t have to know what’s under the hood, just as drivers do not have to understand all that goes on under the hood of a car in order to be able to operate the car.
Figure 4: Tools SW vs. HW – A tool that configures the DMA instead of programming, including setting a trigger.
Figure 5: Tools off-chip/onchip – There is no programming required to invoke the SRIO transports.
Let’s call roadside assistance on this one as well. The tools can assist in both selecting the proper transport and configuring the transport, and the application will remain unchanged. As we can see in the case of the SRIO, transport we can select the communication type (9 or 11) and a few other parameters and we are ready to communicate. There is no programming required to invoke the SRIO transports.
If you’re planning for a multicore journey, improve the trip with a better route and fewer surprises along the way. When planning your project, consider the application’s behavior and characteristics as well as the platform. Then select a matching programming model. Experiment before committing to the design. Access to rapid prototyping tools can be very valuable. Also, reusable programming models are cost-effective because they allow applications to span multiple platforms, both within a product line and between product generations.
MCA offers a family of standards that can be used individually or combined. Runtime solutions and tools are available, supporting MCAPI on multiple platforms, scaling various numbers of cores and offering code reusability. The process is iterative and being able to rapidly complete the first pass and experiment is key to a successful multicore journey. Code-generation tools speed the process, provide consistent code and repeat-ability.
Multicore journeys are complex, but can be simplified. Accept all available roadside assistance.
Ted Gribb is PolyCore Software’s vice president of sales. Prior to joining PolyCore Software, Gribb had sales management positions for Wind River, Diab Data and Mentor Graphics. Previously, he held management positions in software engineering. Gribb received a Bachelor of Science degree in mathematics from DeSales University.
Martina Brehmer is the director of marketing for PolyCore Software. Prior to joining PolyCore Software, Brehmer received her Bachelor of Science in International Business from the Eberhardt School of Business at University of the Pacific. She is also dedicated to photography.
Markus Levy is president of The Multicore Association and chairman of the Multicore Developer’s Conference. He is also the founder and president of EEMBC. Mr. Levy was previously a senior analyst at In-Stat/MDR and an editor at EDN magazine, focusing in both roles on processors for the embedded industry.