Benchmarking Sea Change – Fusion and more
With the advent of asymmetric processors – most famously AMD's A-series APUs, how accurate are today's benchmarks as a measure of performance?
All these years, it has been reasonably simple to classify benchmarking software: on one side, we had synthetic benchmarks which would typically focus on a single set of features, such as CPU integer processing index memory bandwidth or latency. These benchmarks resulted in specific numbers that would give us the raw performance of the device from a given angle.
For instance, an Intel Nehalem CPU (with HyperThreading disabled for apples-to-apples comparison), might not be faster than an identically-clocked Core 2 Quad processor in a core-for-core integer test. Likewise, the Nehalem's last level cache would be three times slower in latency than its predecessor due to the use of L3 cache.. Conversely, the Nehalem's memory bandwidth would be anywhere from 2 to 4 times higher due to the use of an integrated memory controller. Also, some benchmarks would use only a single core, while others would scale across all of them.
On the other hand, there were more 'comprehensive benchmarks', the likes of 3DMark for example, which depended on other factors in addition to the CPU. The results of these benchmarks could vary by a greater extent.
However, the advent of asymmetric CPU implementations in the market, combined with the increasing use of GPU computing, has changed the picture and made synthetic benchmarks more exciting. We can now run the same OpenCL or Visual Studio code stream on either the CPU cores only, the GPU cores only, or both in parallel. We can even use all of these plus an external GPU if one is attached. A great example is, of course, the new A-series Fusion APUs ('Llano') from AMD.
Let us use a recent synthetic benchmark as an example, the just-released Sandra 2011 SP4a. If you start Sandra's GPGPU processing, memory or cryptography test on a Llano APU, you'll notice at least three benchmark options: Internal APU GPU only, APU GPU + CPU, and CPU only. If you add a discrete GPU, you get three more options: external GPU only, GPU + APU, and of course all of the stuff together. If the integrated GPU on Intel's upcoming Ivy Bridge processors support OpenCL, you'll have these options as well, which adds another point of contention between AMD and Intel.
How then can we categorise a CPU as being faster or slower than the competition? On balance, CPU core performance is currently still somewhat more important, but GPU performance is steadily gaining performance as more applications are able to offload specific workloads to the GPU. Also bear in mind that the integrated GPU has to share memory and, in the case of Sandy Bridge, cache with the CPU cores. More intensive GPU usage might affect simultaneous CPU performance in certain indices.
So, what are we likely to witness in next generation synthetic – and other – benchmarks? Firstly, more variables – it will be increasingly hard to point out at one platform as being decisively 'better' or 'faster' than another. Just like in the abovementioned Sandra benchmark, there will have to be separate CPU, GPU and memory tests, as well as a combined index. This combined index will use all possible components (CPU, integrated GPU, discrete GPU) to complete the benchmark instead of being a composite of the sub-tests. Such an approach benefits AMD at the moment due to Llano's superior integrated GPU. It also validates AMD's Fusion approach of integrating the CPU and GPU into one logical processing entity from the programmers' point of view.
For other benchmarks, there would be some influence too. Remember 3DMark's CPU scores? The use of PhysX hardware acceleration gave Nvidia cards a huge leg up in the CPU test – but cost 3DMark a fair amount of legitimacy in the process. What about now, when an OpenCL compute routine can run on a combination of CPU and GPU? How can you truly separate pure CPU performance? Our upcoming Llano platform comparative test will delve deeper into this matter.