amd How AMD can Steamroller the competition in 2013

You've seen the Net coverage of AMD Mark Papermaster's Hot Chips Steamroller micro-architecture unveiling. Looks like what Bulldozer should have been in the first place a year ago, but some stuff is still missing to compete better against Intel…

… And it's not in the core, actually. The core and related improvements, if they really bring the mentioned 45% overall performance gain including the clock frequency boost, are good. But, and a big but here, they are still in the same socket – with dual-channel memory for the single die chip, and quad-channel memory, both DDR3, for the dual die server variety.

At up to 5 dual-core pairs per die, only two channels of DDR3 memory might not feed the CPU as well as four channels of DDR3-1866 (possibly even 2133, looking at Inphi's register clock chip announcement this summer for such ECC buffered DIMM support) on the Ivy Bridge EP next year. The effects on memory bound apps including the increasingly popular 'big data' and analytics, would be serious.

Yet, AMD is short of resources – partly due to their own making – to create new sockets with greater memory and interconnect bandwidth, until Excavator 2014 generation. So, aside of asking another party to co-fund a new socket, what's the remaining possible option? There's one…
amd How AMD can Steamroller the competition in 2013
 
Remember some years ago, AMD was active – even had a bunch of patents – regarding dense but fast eDRAM (low latency embedded DRAM) technology that is just a  bit slower than SRAM but provides density almost at DRAM level. That was (almost) forgotten.
 
Then, let's take a page from the old Alpha and IBM Power CPUs – as well as GT3 workstation (Xeon E3 v3) Haswell flavours with dedicated outside cache die, L4 in the Haswell case, sitting on a separate wide backside bus within the chip packaging, like an MCM.
 
Combining AMD eDRAM and backside L4 cache die approach could give AMD a, say, 128 MB or even 256 MB dedicated L4 cache sitting on a wide, even 1024 bit, bus within the chip packaging, and massively help counter the bandwidth drawback of the two DDR channels per die. In some apps where the code and/or big loops of data fit within that footprint, you could get over double the real life performance just this way. What say you, AMD? The time to act is now, and the options aren't many to choose from…
 
Picture Credits: Futurama S06E25 'Overclockwise'