Analysis: Intel wins the expected #1 supercomputer spot, right in the middle of Middle Kingdom
We told you all the juicy details about it months ago, down to the CPUs and accelerators used, and the expected total size when finished, but we didn’t tell you it’s World’s No 1 even at the pilot stage. But there are buts and ifs here…
The supercomputing world got the new record holder: “Tianhe 2”, the 54 PFLOPs Rpeak / 30 PFLOPs Rmax Intel-only machine to be installed in the brand new Guangzhou Supercomputing Center, but integrated and tested in the place of its creation, Changsha NUDT, National University of Defense Technology, right in the middle of the Middle Kingdom. This is actually the first stage of the planned 100+ PFLOPs machine which we described here many times over, but even its pilot configuration is by far the world’s No 1 at present, which the Changsha / Guangzhou team used well to its advantage global marketing-wise, something not yet often seen by China academic establishments.
With some 32,000 12 core Ivy Bridge Xeon E5-2600 v2 CPUs, as well as 48,000 Xeon Phi PCIe accelerator cards, in a dual-CPU, triple-Phi per node configuration, this is an astonishing machine. By the way, it’s one of rare Dual Xeon E5 configurations where 64 main PCIe v3 lanes are all used by default; 48 for the three Phis, and 16 for the NUDT’s own high end dual rail interconnect, whose previous version performed very well in Tianhe 1.
All fine and dandy, but the sources on the ground, during the Changsha NUDT conference announcing the giant machine, told us it’s not all so perfect. Firstly, you can notice that the usually very high Linpack efficiency ratio seen in Intel clusters isn’t seen here: 30/54 or some 55 percent Rmax to Rpeak ratio isn’t exactly great.
Of course, the Xeon Phi, still in early maturity stages of its computational life, can be explained as a key reason, but there is more: the main workhorses, the 12 core Xeon v2 CPUs, seem to be feeling more memory system bottlenecks than their 8 core Sandy Bridge ‘v1’ socket compatible predecessors. Well, if basically the same memory system has to feed half more cores, even with half increased L3 cache to 30 MB per socket, there need to be further optimizations there. If I were Intel, I would use this time to additionally validate DDR3-2133 server memory besides DDR3-1866 as well, and encourage DIMM vendors to offer low latency modules for these CPUs to minimize the bottlenecks in memory intensive HPC and Big Data jobs.
So yeah, the performance was a bit below the expected, according to the internal sources, but that’s nothing that bit of system level tweaking to feed the CPUs, and software fixing for the Phi’s, couldn’t improve soon.
But there was something else, far more intriguing: power consumption. During the Linpack run, the whole Changsha NUDT airconditioning had to be turned off – yes, for the whole large campus. Mind you, this is a very high end military facility, and huge one at that, surely not without good power supply system, but, according to the team, the power requirements went up from the original estimates sufficiently that this ‘safety measure’ had to be implemented in hot summer to ensure all goes well.
Now, this is not a secret, as many attendees of the conference heard it too, on the sides, and actually it’s not something that would automatically open their HPC business doors to AMD and Nvidia either, as, until mid 2014 at least, they don’t have Intel’s matching ‘top CPU plus top accelerator’ combo ‘meal deal’ to offer. And by then, Haswell EP Xeon E5 v3 and 14nm ‘Knights Landing’ Phi will be there too, each on its own or in a combo.
But, as local CPUs like Loongson MIPS and Shenwei Alpha – the latter one having moved forward a generation since our last coverage, and so successful with internal government deals that it refuses any outside marketing – offer ever better power performance efficiency in HPC, matched by respective, often not public, improvements in compiler and software suites, things will turn interesting as they can, and will, be the choice too. China will have at least four 100 PFLOP or faster systems by 2015. And, since even SMIC in Shanghai can now easily fab you a 28 nm part, the process lag is disappearing too… so, will this Tianhe 2 be completed to 100 PFLOPs in the present configuration, or something brand new of 100 PFLOPs class will supplant it along the banks of Zhujiang river in the ancient Canton?