Intel Xeon Phi (B0 Stepping): The Knight in Shining Armor?
Several weeks ago, Intel officially announced first products based on the 22nm Knights Corner architecture. Known as Xeon Phi, the Larrabee Reborn is coming to fight Nvidia Tesla, AMD FireStream, FPGAs and Chinese CPU vendors.
Last week, Intel started sampling the B0 silicon of Xeon Phi to its preferred partners. There are five different SKUs floating around and this article will reveal the details and intricacies surrounding Intel's attack on the GPGPU and homogenous computing.
To every Intel insider, codenames P54C, Camino, Coppermine, Tejas and Larrabee carry special weight on the shoulders. In Intel's 44-year history, these products brought major disgrace to engineering ethos which always operates by the code of "manufacturable excellence" (even though many bad mouths claim the beancounters calculate the die sizes and feature sets). For the purpose of this story, we'll concentrate on Larrabee ashes, codenamed Aubrey Isle, Knights Corner and Knights Landing (Not to be confused with King's Landing).
After the cancellation of Larrabee as a discrete graphics project, Intel took the Aubrey Isle silicon (Larrabee = Architecture, Aubrey Isle = ASIC) and started sampling select partners. The board which hosted the Aubrey Isle was codenamed Knights Ferry and the project was officially announced as MIC (Many Integrated Core) in March 2010, with the limited sampling following immediately after the release. Aubrey Isle chip was manufactured in 45nm process, consisting out of 32 cores, 1MB L1 and 8MB of L2 cache operating at 1.2GHz. 2GB of GDDR5 memory connected to the chip using ultra-wide 1024-bit ringbus architecture (remember ATI R600? It utilized a 512-bit Ringbus memory controller). In order to power the board, you needed to provide 300 Watts of power.
Overall, Knights Ferry board delivered 750 GFLOPS in single precision, while the double precision figures varied between 41-47% (effectively sub-400 GFLOPS). It is unknown how many boards Intel delivered to its partners, but it was pretty hard to gain access to one. The author of these lines had limited experience with one Knights Ferry board, and the state of developer tools was nothing to write home about.
The situation changed in early 2011, with the arrival of another knight in shining armor. Knights Corner is a completely new chip manufactured in 22nm process node, using Tri-gate transistors similar to the Ivy Bridge processors today. The chip utilized an unknown number of processing cores, with different BIOS and board revisions enabling different amount of cores, cache and onboard memory.
For example, A0 silicon boards featured 48, 52 or 60 cores, ranging from 1.5 to 1.9MB of L1 and 24-30MB of cache. Onboard memory size went from 2GB to either 4 or 8GB GDDR5 memory, clocked at 2.4 to 4.5 billion transfers per second (600-1125 MHz in QDR mode). The frequency of Knights Corner A0 chip went down to 1GHz, though. This rough 22nm design consumed up to 300 Watts of power, but the computational power went down as the company reorganized the way how chip works. In essence, Knights Corner represents the return to the processing mantra inside Intel, ditching overly complex ringbus architecture for a more direct approach to each and every processing core.
Back during the initial reveal fanfare, Intel used 60C A0 silicon to show 1 TFLOPS in single precision. However, new architecture puts a lot of accent on the double precision calculations and the parts offered different figures for double precision. Still, the parts lagged behind Nvidia Tesla M2090 boards by all accounts. Nvidia's announcement of Tesla K10 and K20 did not change the attitude inside Intel, as the chip giant remain bullish about prospects they can offer. Intel wants to show that Knight's Corner can take on the Tesla K20 (GK110 GPU, 2304 cores, 384-bit memory interface, 12GB GDDR5). However, with Dual-GPU Tesla K10 offering 4.58 GLOPS in Single Precision (but mediocre 190 GFLOPS DP) and K20 targeting 3 TFLOPS Single and 1.5 TFLOPS Double Precision, it is clear that Intel has to play catch up – even though the company manufactures its Knights Corner ASIC in a leading manufacturing node (Intel 22nm Tri-Gate versus TSMC 28nm Planar).
Our sources were adamant that while they were on the right track, they don't believe the parts were ready for prime time and still believed that they have much to improve in the Knights Corner in B0 and C1 silicon revisions. However, with commercial shipments starting sooner than later (target = Lead the Top 500 list in June 2013), a Chelsea miracle had to happen.
Enter B0 ASIC. This part represents a much bigger step in evolution of the part and introduces several "must have" features from contemporary GPUs, or should we say – "must have" features from desktop and notebook CPUs from quite some time.
Knights Corner B0 comes in several flavors, with 57C, 60C and 61 cores being the most common configurations. Yes, the company unlocked an odd number of cores, compared to even number in Larrabee and Aubrey Isle. The change in number of processing cores changed L1 and L2 cache, and we now have 1.8-1.9MB of L1 and 28-30.5MB of L2 cache. Onboard memory now greatly varies between the parts, with available flavors being 3GB, 6GB and 8GB of GDDR5 memory.
Chips now have a set minimal clock speed of 600 MHz (57C/3GB and 57C/6GB) and 630+ MHz for the 60C/8GB and two 61C/8GB cards. Memory clock is now set at more realistic 5-5.5 billion transfers (1.25-1.37 GHz QDR), enabling memory bandwidth figures exceeding 300 GB/s. Top clock is now set at 1.1GHz for 57C/3GB and 57C/6GB parts, while the rest either works at 1.05 or 1.09GHz. The big news is the introduction of Turbo mode, but we weren't able to sniff out the exact clock achievable in Turbo mode for a low amount of active core.
TDP e.g. Thermal Design Power also underwent significant changes. Intel now offers two 245W boards (57C/3GB and 60C/6GB) and three 300W designs (57C/6GB and two 61C/8GB) and is available in two configurations – with or without passive heatsink. Active heatsink (the old Larrabee design) is only available with the 57C/6GB/300W version, while top range configurations (G5756-200 and G5757-200) come either without any thermal solution (probably for testing the liquid cooling and 3rd party designs, such as liquid cooling design from CoolIT Systems, which dates back to 2009), or with a passive heatsink. The passive designs are all meant for rackable servers, where large amount of air will funnel through the thick and heavy dual-slot heatsink. Personally, I had one of the cards in my hands and trust me they are not light by any standards.
All B0 boards carry the revision ES2 and feature a lot of changes from A0 silicon. Allegedly, wonder BIOS managed to fix a lot of things, and our sources are confident they can take the battle versus the more power efficient parts from Nvidia and AMD. Intel's internal desire is to reach 1TFLOPS DP and 2TFLOPS SP. According to the numbers above, we don't believe it is an impossible task. After all, if you invest billions, sooner or later you'll hit the jack pot. Or at least hope for it.
We will continue our coverage of Knights Corner, as well as the second generation architecture codenamed Knights Landing (or is it fifth… if we count all Larrabee generations).