grant IDF SF 2012: Haswell EP/EN (Grantley) in 2014, now with moar cores

While mainstream "there's simply no need for more" quad-core Haswell parts should have further power savings, but not much extra performance unless you recompile the software to take advantage of the new ISA, the high end parts will take a hulk smash approach.

Intel's mainstream processors have been stuck with two dual-cores per chip since Core 2 Quad in 2006, or quad cores per die in Nehalem LGA1156 parts in 2008, and that's not about to change at least till the first 14 nm Broadwell chips come sometime in 2014. As most PC apps aren't that well multithreaded yet, it makes sense to stick with this approach and simply save power or use the extra die area for things like better (always sorely needed for Intel) GPU, or hopefully more cache. However, there was lots of speculation how far will Intel go in cramming more cores per die on its high end CPU parts, where there's both need for strong per-core performance and also multi-core benefit from heavily threaded workstation and server apps, including supercomputing where a single task could run up to a million parallel threads in extreme situations.

Well, the ever-friendly IDF sources, exclusive to the VR-Zone team, cleared some of those doubts for our readers. You already know that, on the enterprise highest end, the 2.4 GHz 'Westmere EX' 10-core Xeon E7 4800 will be replaced by mid next year with the 2.2+ GHz 15-core 'Ivy Bridge EX' Xeon E7 4800 v2 and the brand new socket 2011 but with different pinout than the one you knew before. That rocket will have a socket compatible upgrade in the 16-20 core 'Haswell EX' Xeon E7 4800 / 8800 v3 in mid 2014, followed by 'Broadwell EX' Xeon E7 4800 / 8800 v4 (yes, you guessed it) in 2015. The last three share the triple-QPI interconnect design, compared to quad-QPI in the Westmere EX – a pity, as the extra QPI could help add coprocessors like a future version of Xeon Phi or FPGA, for instance, to share system memory at full speed.
 
The more interesting part is, of course, the dual socket one, as it shares lots in common (socket and chipset, among others) with what's usually Intel's top desktop HEDT platform. While there were doubts and otherwise rumours before, it now looks like the current 8 core Sandy Bridge EP Xeon E5 2600 / 4600, to be followed by mid next year with socket compatible 10 core 3.2+ GHz Ivy Bridge EP Xeon E5 2600 / 4600 v2, will have a 2014 replacement – this time again with yet another different pinout within the same physical Socket 2011 dimensions. The Haswell EP Xeon E5 2600 / 4600 v3, which will likely also have a uni-CPU desktop flavour too, is expected to be a 14-core processor with 4-channel DDR4-2133 replacing the DDR3 from Ivy Bridge EP, and dual QPI links at 9.6 GT/s, a bit faster than now. So OK, a bit extra memory bandwidth but still way below DDR4's real DDR4-3200 scheduled server RAM speeds. But why so many cores, when the EX parts are supposed to handle that kind of count?
 
There could be two key reasons: first, as mentioned yesterday, Haswell isn't a major performance jump from Ivy Bridge on per-core performance, with an average only 10% clock for clock per core speedup unless you recompile the stuff for its new instructions, and then provided your algorithms has use for those new instructions, like FMA or 256-bit parallel integer handling in AVX, among others. If there's no major – or any – clock speed improvement, due to the power saving design focus as Haswell is an Ultrabook-optimised core, where to get the extra performance to justify the sales?
 
Then, the power savings in Haswell enable you to cram more cores on the die while keeping the TDP. So, the 145W server and 160W workstation TDP limits of Haswell EP could accomodate 14 cores this time, yet the per-core L3 cache capacity would stay the same, at 2.5 MB per core. Frankly, within that TDP, I'd rather like to see a side derivative with just 8 cores, but 4 MB L3 per core and letting those cores go higher in GHz, and I think many workstation and quite a few HPC users would agree there, as per-thread performance didn't gain much over the past 5 Intel CPU generations. Nevertheless, even the 14-core part shouldn't have lower clock than its Ivy Bridge EP predecessors in the same TDP range, meaning 3.2 GHz at least for the top workstation part at launch.
 
At that speed, we're talking some three quarters of a teraflop double precision theoretical peak speed per socket, or 1.5 TFLOPs dual socket workstation in mid 2014, backed by 8 channels of DDR4 memory bandwidth – a serious 'show cause' for Nvidia Maxwell GPUs at that time to to justify being plugged in, if the CPUs perform that well themselves without any messy code recompilations. And, all that focus on finely threaded apps for GPUs will also help all these extra cores crammed on the Haswell EP die to find use, too. Funny thing is, that same year may see more than one party having really lovely multiteraflop general purpose CPUs out there… Opteron APU, maybe, anyone?