Headlines

Headlines

Get the Truth on Multicore-Processor Performance

To make the right multicore decisions, designers must use the right multicore benchmarks.

By Markus Levy and Shay Gal-On

The multicore technology era is solidly upon us. Nearly every processor vendor is offering or developing products and architectures to support the imminent demand. Meanwhile, system developers nervously contemplate their options as they realize that adopting multicore technology presents as many challenges as it does benefits. One of those challenges lies in analyzing the potential performance of a processor and/or system-on-a-chip (SoC) that’s based on multicore technology. Not surprisingly, putting multiple execution cores into a single processor (as well as continuing to increase clock frequency) doesn’t guarantee greater multiples of processing power. For a given application, there’s also no assurance that a multicore processor will deliver a dramatic increase in a system’s throughput.

Despite the pessimism, the right combination of processor and programming techniques can scale well with the number of cores. Yet this will depend on how a processor is designed and how one writes his or her application program. Industry-standard benchmark tests from EEMBC demonstrate a multicore system’s behavior in a wide variety of scenarios. In doing so, they model how a designer’s system will function. Results of these benchmarks reveal some interesting truths that will hopefully make designers pay close attention to the combined effects of the multicore processor, memory subsystem, operating system (OS), and other system-level characteristics.

Multicore-Processor Design And Performance Analysis

When analyzing multicore performance, one must consider scalability where contexts exceed resources. Assume that a program is composed of a varying number of threads. (It’s not unreasonable to have hundreds of threads in a relatively complex program.) If the number of threads exactly matched the number of processor cores, it’s possible that performance could scale linearly. Realistically, however, the number of threads will exceed the number of cores. In addition, performance will depend on other factors, such as memory and I/O bandwidth, intercore communications, OS scheduling support, and synchronization efficiency.

The memory bandwidth of a multicore processor depends on the memory subsystem design. That subsystem design, in turn, is dependent on the underlying multicore architecture. Multicore implies either a shared- or distributed-memory architecture. Shared memory, which is typically associated with homogeneous multicore systems, is accessed through a bus. It is controlled by some locking mechanism to avoid simultaneous access of the same memory by multiple cores. Shared memory enables a straightforward programmingprogramming model, as each processor can directly access the memory (see Figure 1). The shared-memory structure can become a bottleneck when too many cores try to access it simultaneously. This bottleneck also implies that the memory architecture doesn’t scale well with an increasing number of cores.

Figure 1: Two dual-core, shared-memory architectures are depicted with different memory subsystems. There are performance advantages and disadvantages associated with each approach.

Unless an application is running on “bare metal” (i.e., directly on the processor hardware without operating-system support), OS scheduling will play a big role in determining multicore implementation behavior. Scheduling refers to the way that processes are assigned priorities in a priority queue. Yet scheduling also will be determined by the availability of on-chip processing resources. (This will be based partly on the operating system’s ability to monitor the availability of hardware resources, such as cores or hyper threads.)

High-Level Multicore Benchmark Categories

A multicore processor can be utilized several ways. Examples include asynchronous multiprocessing (AMP), functional partitioning, and parallelization. Among other things, AMP provides a centralization of distributed processing. Here, all cores can run entirely different and theoretically unrelated tasks. Essentially, this is the same as four separate machines running in one package. Even though there may be very minimal interaction between the cores, the overall performance will be limited by the system-level memory bandwidth.

Figure 2: Comparing two dual-core platforms demonstrates how results can vary and depend on multiple factors.

Functional partitioning is possible with multiprocessors (i.e., separate processor chips). But a system can potentially benefit from the cores’ proximity in a multicore chip—especially for data sharing between cores. An example of functional partitioning is where core_1 runs a security application, core_2 runs a routing algorithm, core_3 enforces policy, etc. Depending on the workload presented by each of these functions, it’s possible that additional functions can be added or swapped in as needed.

Finally, multicore technology can be used to increase an application’s parallelization or concurrency. From a performance perspective, this is perhaps the greatest benefit of multicore technology. But it also raises the most challenges so far, as it requires a careful analysis of the manner in which threads share data. Although inter-thread communications will benefit from the cores’ proximity, cache utilization and system-level shared resources also will have an impact.

Multicore benchmarks need to take all of these applications into account. Of course, this is more easily said than done. There’s never been a simple way to measure the performance of “normal” singlecore processors and reduce it to a single measure of “goodness.” Similarly, it’s exponentially more difficult to measure the performance of a multicore device and produce a single figure of merit.

When working with multicore benchmarks, it becomes clear that the combined interactions of all of the factors outlined above result in marked performance differences—even among quite similar platforms. Tests on two dual-core processors, for example, show quite different rates of speed-up, depending on the number of concurrent streams and which specific benchmarks are running (Fig. 2). From this information, the designer can tailor his or her software to align with the benchmark characteristics that yielded the highest performance on that specific processor.

Benchmarks For Different Types Of Parallelism

From a parallelism perspective, a multicore benchmark must target two fundamental areas of concurrency: data and computational throughput. Benchmarks that analyze data throughput will show how well a solution can scale over scalable data inputs. This can be accomplished by duplicating the same computation and applying it to multiple different datasets. A real-world example of this method includes the decoding of multiple different JPEG images (as may occur when viewing a web page).

It’s interesting to determine the point at which performance begins to degrade while increasing the number of data inputs. In developing such a benchmark test, the biggest challenge is that the code must be thread-safe to support simultaneous execution by multiple threads. Without compromising required performance throughput, it must satisfy the need for multiple threads to access the same shared data as well as the need for a shared piece of data to be accessed by only one thread at any given time.

To demonstrate computational throughput, the above approach can be extended further by developing tests that can initiate more than one task at a time. Concurrency is then implemented over both the data and code. This will demonstrate a solution’s scalability for general-purpose processing. As an example, consider the execution of MPEG decode(x) followed by MPEG encode(x). This example is similar to what one might find in a set-top box in which the satellite signal is received, decoded, and encoded into a different-quality signal for storing on the hard disk. As a benchmark, this application requires synchronization between the contexts as well as a method for determining when the benchmark completes.

Data decomposition is where an algorithm is divided into multiple threads that work on a common data set. It therefore demonstrates support for fine-grain parallelism. In this situation, the algorithm could be working on a single audio and video datastream. But the code can be split in such a way so as to distribute the workload among different threads—each of which can be handled by a different processor core. These threads are distributed based on the number of available processor cores. Of course, efficient processing is possible only because the cores within the multicore device are closely distributed and can support high-bandwidth, low-latency data transfers.

The Effects Of Concurrency

EEMBC developed several hundred workloads for its first suite of multicore benchmarks, which are referred to as MultiBench. The combined effect of using all of these workloads provides a very comprehensive view of multicore-processor behavior. Some results, which were derived by using data-throughput tests, can serve as examples of the methods previously described.

Take a look at how two different brands of quad-core processors perform on the benchmark suite. Both chips share the same x86 instruction-set architecture. (In other words, they’re both PC-compatible processors.) In addition, both have four cores within a single device and are connected to 4 GBytes of 667-MHz DDR2 memory subsystems. About the only differences are the processors themselves and the way they’re connected to memory, which is dictated by the processor vendor. For the most part, they are similar and competitive chips. Yet they exhibit different behaviors on the same set of benchmarks. On the charts, test results are shown for the SingleWorkerMark (SWM), MultiWorkerMark (MWM), and MultiItemMark (MIM) benchmarks. For now, the only significant difference is that the SWM test is single-threaded while the other two are multithreaded benchmarks.

Figure 3

Figure 3 illustrates how the processor from “Brand X” performs on three different MultiBench tests as workloads increase. Looking at the horizontal scale, one can see that the workload increases from one context (on the left of the chart) up to 20 contexts (on the right side of the chart). The vertical axis on this and subsequent charts has been scaled so that the performance of a single context is always 1.0. This makes it easier to see how performance does or does not scale with increasing workloads.

The good news is that the performance throughput of the Brand X quad-core processor increases as the number of workloads increases. That’s a good thing. If it didn’t improve, there’d be no point in using a multicore processor in this application. The bad news is that it doesn’t increase linearly. The maximum performance with 20 contexts is just shy of 3X the baseline performance with one context. Even with four processor cores working on 20 tasks, overall performance throughput triples (which many would consider a reasonable performance increase). But performance on the multithreaded MIM test is a bit disappointing, maxing out at less than 2X the baseline performance. At least performance doesn’t decrease with increasing workloads.

Figure 4

Looking now at Figure 4, it’s clear that the exact same tests were conducted on a nearly identical system, but with a competing “Brand Y” processor. Unlike the previous example, there’s a pronounced “kink” in the results graph. This processor’s performance on the MWM benchmark increases linearly up to four contexts—one context per processor core. As more contexts are added, however, it actually declines. The processor’s performance on the SWM and MIM benchmarks is somewhat more intuitive. Performance gradually increases, but then plateaus at around 8-12 contexts.

Figure 5.jpg

 

Figure 5 shows the first dual, quad-core setup. It comprises two processor chips—each with four cores—for a total of eight processor cores. As in the first test, the system has 4 GBytes of DDR2 memory. Here, however, the two processors are sharing it. In this particular case, all of the memory is local to one of the processors. The other processor accesses it though a shared link between the two chips. This aspect gives one of the processors a built-in advantage, although both processors can access all of the available memory.

Here, performance scales better than it did with the single-processor (four-core) system from Figure 3. Peak performance on SWM is about 3.75X the baseline—much improved from before. It’s certainly not anywhere near 8X performance, but it’s a substantial improvement nonetheless. The two multicore benchmarks (MWM and MIM) also show steady growth—even suggesting that they might have grown further had the workload been dialed up even more.

Now The Work Begins

The results shown here merely scratch the surface on the amount of data that can be produced by MultiBench. As an industry standard, most (if not all) vendors are using the same methodology to measure the performance of their multicore processors (at least the ones with SMP architectures). This article is careful to avoid indicating which processors were being used to generate this data. The purpose here was to demonstrate the variety of results and convince the reader that no single number is enough to determine a multicore processor’s performance levels.

As the multicore revolution progresses, the number of cores per chip are expected to roughly double with each processor generation. It will thus become even more important to understand multicore-processor behavior. The work doesn’t stop with this first generation of MultiBench. EEMBC is currently developing subsequent versions of benchmark suites that will help analyze heterogeneous processors (i.e., SoCs) as well as application-specific standard benchmarks (ASSBs) that will perform tests based on real-world scenarios. Hopefully, this effort will help to arm designers with the right tools so that they can make good decisions about their next multicore-processor projects.

 

Authors:

Shay Gal-On is EEMBC’s director of software engineering and leader of the EEMBC Technology Center. At EEMBC, Shay created the EnergyBench and MultiBench Standards for benchmarking.

Markus Levy is founder and president of EEMBC. He also is president of the Multicore Association and chairman of Multicore Expo. Levy was previously a senior analyst at In-Stat/MDR and an editor at EDN magazine, focusing in both roles on processors for the embedded industry. He is a volunteer firefighter as well.