Scalar Processor Design Improves GPU Efficiency

Although leading GPUs to date have used vector processing
units, because many operations in graphics occur with vector data (such as R-G-B-A
components operating in pixel shaders or 4×4 matrices for geometry transforms
in vertex shaders), many scalar operations also occur. During the early GeForce
8800 architecture design phases, NVIDIA engineers analyzed hundreds of shader
programs which showed an increasing use of scalar computations. They realized
that with a mix of vector and scalar instructions, especially evident in longer,
more complex shaders, it’s hard to efficiently utilize all processing
units at any given instant with a vector architecture. Scalar computations are
difficult to compile and schedule efficiently on a vector pipeline.

Both NVIDIA and ATI vector-based GPUs have used shader hardware
that permits dual instruction issue. Recent ATI GPUs use a “3+1”
design, allowing single issue of a four-element vector instruction, or dual-issue
of a three element vector instruction and a scalar instruction. NVIDIA GeForce
6x and GeForce 7x GPUs are more efficient with 3+1 AND 2+2 dual-issue design,
but still not as efficient as a GeForce 8800 GPU scalar design, which can issue
scalar operations to it’s scalar processors with 100% shader processor
efficiency. NVIDIA engineers have estimated as much as 2X performance improvement
can be realized from a scalar architecture that uses 128 scalar processors versus
one that uses 32 4-component vector processors, based on architectural efficiency
of the scalar design. (Note that vector-based shader program code is converted
to scalar operations inside a GeForce 8800 GPU to ensure complete efficiency.)