AMD Pushes Steamroller and Excavator Forward, Bullish about Performance Increases
Recently, there were a lot of rumors that AMD reorganized its roadmap and that some late 2012 and regular 2013 parts were cancelled. Before going any forward, we can now say that some of those rumors are true.
According to our confidential sources, the roadmap cancellations were significantly affected by the departures of engineers in tender moments of project development. The roadmap shakeup positively or negatively affected the processors under codenames such as the 'Kaveri', 'Kabini', 'Abu Dhabi', 'Seoul' etc.
However, projects such as Steamroller (also known as the 3rd Gen Bulldozer, or K13) and Excavator core (4th Gen Bulldozer or K14) are well underway, and the new engineering leadership with people such as Jim Keller (our sources from AMD/Intel/NVIDIA claim he's arguably the best CPU architect of all times) and John Gustafson (author of Gustafson's Law on parallel computing) started to make changes within the first day of the employment.
There seems to be a new buzz in the company, which can be described as fightback or rather, getting "back on track". Given that AMD hasn't had a winning CPU architecture since 2006 and did not maximize their GPU performance leadership when they had one. According to our sources, things are thoroughly changing within the company, even though there's plenty of work ahead, especially in the Investor Relations, PR and marketing departments.
On the recently held Hot Chips 2012 conference, AMD's Mark Papermaster hosted a keynote talk. During the talk, he disclosed the details about the upcoming architectures such as Steamroller and Jaguar, the next generation cores from AMD. Mark also went into greater detail about the 'FF' interconnect technology, which does not mean Four-Four (Ferrari) or French Fries but rather Freedom Fabric.
Can AMD's Steamroller Fix the Mistakes of Bulldozer and Piledriver?
The history of AMD's Bulldozer architecture is a painful one. The old management lead by Hector Jesus Ruiz and Dirk Meyer deliberately delayed the arrival of K10 "Barcelona" and got usurped by Intel's Core architecture. The mad rush to the original Bulldozer caused a lot of sacrifices in design and ultimately, lead AMD engineering team to cancel the architecture in 2008. Instead of creating a master-core, AMD's engineers envisioned a multi-step approach to increase the performance of its architecture. Bulldozer was the first generation part, introducing a flexible Floating Point Unit, support for AVX, XOP, FMA4 and so on. Piledriver was envisioned as the performance improvement, followed with Steamroller which would bring "greater parallelism" – in fact, it would improve the shortcomings of the original Bulldozer, and then we have Excavator, 4th generation design which should increase performance and fight off Haswell-EP/EX in 2014.
Given that AMD had restricted resources, the company could not address the issues inside the memory controller and the execution cores, so Steamroller features significant improvements to execution cores, while memory controller is heading for a revamp with the 2014 architecture dubbed Excavator. 2014 is also the year when we are going to see the complete fusion between the GPU and CPU architectures, at least as far as AMD is concerned.
We haven't included the upcoming Piledriver core (Trinity APU, Vishera CPU) in this comparison due to the fact that internally, Bulldozer and Piledriver are very similar, with the major changes being adding the support to new instructions (FMA4 is being joined by AVX 1.1, AES, F16C etc), improving IPC, reducing leakage by optimizing transistor design and offering a boost in frequency with reduced power consumption.
Comparing Steamroller to Bulldozer makes much more sense, since the two architectures are starting to differ in greater detail. First and foremost, AMD finally addressed the core starvation. Originally, Bulldozer had a single Fetch and single Decode unit, which were in charge of feeding both Integer and Float schedulers. It turns out that the size of those units were too small and quite often you'd waste precious cycles with either ALU or FPU pipelines not doing a thing. Steamroller goes back to square one and keeps the Fetch unit as a single entity, but the Decode part is now doubled. Each Decode unit feeds one INT unit (4 pipelines) and the FP Scheduler, which has three dedicated units (two 128-bit FMAC units which can act as a single 256-bit unit when you need 256-bit AVX. For legacy code, the MMX Unit is now a single separate entity (instead of multiple side half-units in Bulldozer design). Also, one of major improvements is the increase in the instruction cache size. Up until Bulldozer, AMD featured the largest L1 cache in the field – both L1 Instruction (I-Cache) and Data (D-Cache) were the same size (64KB). With 128KB of L1 cache, AMD easily compensated for the size deficit in L2 and L3 cache versus Intel Nehalem and Sandy Bridge architectures. Bulldozer sliced down L1 cache to "better than Pentium 4, but still crap", as one of our sources put it bluntly (16KB L1 D-Cache and 64KB L1 I-Cache). Steamroller increases the size of Instruction cache beyond K7/K8/K10/K10.5/BD, but L1 D-Cache won't remain the same either.
According to Mark Papermaster, the improvements should yield up to 30% performance increase, but our sources inside the company beg to differ.
"Steamroller is not Bulldozer Enhanced. F*** no. The layout might look the same but our LEGO blocks are completely different. When all is said and done we should get 45% improvement and this goes to show how the Bulldozer was f***** design. This is all what Bulldozer was supposed to be."
This is a direct quote from our source which will remain anonymous, but is important enough to be awake at 3:32AM European time (the source is in the obamaland). The bullish statements coming from engineers are nothing uncommon, as the technology and products are nothing less than their own babies, male or female.
Furthermore, AMD reorganized the L2 cache, which is now dynamic and can change its size per core. Gone are the days when every 2-core block had fixed amount of L2 cache and that was that. Steamroller is taking full advantage of the fact that AMD has exclusive, not inclusive cache architecture, and the L2 will stretch and shrink depending on the load on a particular core.
L3 cache is attached to the memory controller and we suspect that there will be no improvements there. Remember that according to our own professional opinion, AMD's troubles with Barcelona (K10) and Shanghai (K10.5) were the sole result of a mismatched memory architecture, which saw L3 cache operating at the same clock as the memory controller, which was up to 50% slower than the regular clock. As a result, the copy from one core to another was painfully slow – we've seen the latency of DDR3 memory being equal to L3 cache, with the said cache offering less bandwidth than the dual-channel DDR3-1600 memory. For the record, our chip-level and architecture-level sources confirmed our doubts, which in turn caused a lot of friction and cut-offs between the public relations parts of AMD and the author of these lines. With the new management in place, it looks like the new staff prefers honesty than games.
Getting back on track, Steamroller also features improved inter-core performance, but major improvements there will come with Excavator CPU/GPU merger. This is also the time when we are going to see the new CPU sockets, but AMD is silent as a grave on that one.
Can AMD Compete Against Intel?
One of mantras we've heard from AMD lately is that the company won't compete against Intel anymore, that the time of big performance is done and dusted, that mobile is king etc. That mantra was nothing else but an understandable defensive move, but that's not where the money lies. While AMD will go and compete in the mobile field with the mixed-mode x86+ARM based APUs, the real money lies in the datacenters. If you take a look at the growth of mobile devices, nobody talks about the correlation with the datancenters, which require the most powerful CPUs you can handle. Mobile processors cannot handle the load and they rely on as best server CPU as possible.
With Steamroller, AMD looks to be on the path of resurgence, but Intel will not hold still. Ivy Bridge-EP and EX will mark 2013, while Haswell-EP and EX will arrive during 2014. Steamroller core will go head to head against Ivy Bridge-EP/EX, while Excavator will go against Haswell-EP/EX. While few would give AMD a fighting chance, the truth of the matter is that the company understood its mistakes, and they've already beaten Intel once, an 8000-pound gorilla of the semiconductor market (yes, it's an 8000-pound gorilla, not the 800-pound one).
Can Jim Keller and John Gustafson lead the resurgence? Only time will tell, but Steamroller just may be what the company needs. In meanwhile, you'll have to make due with Bulldozer and Piledriver cores.