mic Intel to push MIC aggressively for Exascale computing

Yes, it is Larrabee reborn – this time the focus is strictly on supercomputing and general technical computing acceleration. What's the focus?

Remember Larrabee, Intel's big time attempt to get into the discrete GPU business? Well, it may have failed in this respect, but the core architecture was transplanted to another market segment that would welcome its approach – high performance computing. After all, the idea of using a simplified – in this case, Pentium like – core as a front end to a big, wide SIMD FP engine, and multiplying that by over 50 times on a single die with a high bandwidth interconnect between the cores, and also high bandwidth external memory connection alike that on GPUs, does make sense for a number of technical and scientific computing jobs. And, why not, top end gaming physics too…

Intel call this family of products MIC – Many Integrated Core, chips, or 'Knights' line. The first MIC to be offered to the discerning public, in a limited quantity for a sort of pilot introduction, is 'Knights Corner', basically a GPU-like PCIe accelerator card with a 22-nm process MIC chip that integrates some 50 cores for roughly 1 TFLOPs DP FP performance, or nearly 6 times that of the Xeon E5 top processor bin right now, within a similar power budget – a critical point required to get to, say, Petaflop within 10 racks now, or Exaflop level performance within a single datacentre size and power budget in 2018.

micdie Intel to push MIC aggressively for Exascale computing

[Intel® Many Integrated Core Architecture (Intel® MIC Architecture) at 1-1.2 GHz]

 

The thing that differentiates MIC from ATI or Nvidia GPGPUs is that it's front end is a X86 core, therefore the same programming model can apply for both the main CPU and the accelerator, rather than resorting to OpenCL or CUDA. On the other hand, the first MIC cores are based on a 64-bit enhanced version of the 16 year old Pentium that fronts a very wide SIMD FP unit, whose dual-issue in-order instruction approach limits the maximum achievable FP rates. Intel will surely fix that in the next round, but using the X86 as a front end, with all the associated baggage, remains a double-edged sword.

What's the only major problem, besides of course having to re-compile your apps code to get the best performance? Well, everytime you get away from the on-board memory via the PCIe bus to retrieve stuff from the system memory, the performance will – just like in GPUs – drop dramatically, even several times. Intel may solve a part of that by using a much faster, simpler sub protocol for rapid data transfer over PCIe, since there's no QPI based MIC as of yet. In fact, Intel might decide never to make a QPI based one, if they succeed in allowing the future MIC chips to be, in a way, normal CPUs that can boot an OS – of course not Windows, I guess – on their own, without requiring a Xeon CPU front end. However, if Xeon E5 or E7 nodes are still kept as the front end, then linking a bunch of MIC chips on board via QPI to all of them as a sort of coprocessor makes much more sense from system memory sharing point of view too.l

Now, the newest Nvidia and AMD GPUs, the top end versions of Kepler and Southern Islands, can boot a custom OS actually, and run as standalone CPUs of a sort if need be. Of course, their highly custom instruction set architectures would require a custom OS environment, but that can also change. Then, the MIPS, Alpha and SPARC derivatives of Chinese CPUs are all expected to have very wide, high core count SIMD FP versions with multi-teraflop performance per chip by 2014 too. Even if they are a process generation behind still by that time, the much simpler, more efficient RISC front ends would likely make their overall performance quite competitive.

In summary, Intel MIC is a flagship representative of the new family of devices that evolved the GPU Compute into the Supercompute, with higher performance per socket leading to less sockets required – especially as we hit the physical scaling limits of aggregating tens of thousands of multi-core CPUs to reach multi-petaflop and exaflop computing levels. The immense per-chip double-precision FP performance and easier programmability due to the common X86 front end help too, however the competition from several other vendors, including the GPU and CPU makers, will have very competitive solutions of their own.