7 NVIDIA GeForce 8800GTX Review

With its unified pipeline and shader architecture, GeForce
8800 GPU design significantly reduces the number of pipeline stages, and changes
the sequential flow to be more looping oriented. Inputs are fed to the top of
the unified shader core, and outputs are written to registers, and then fed
back into the top of the shader core for the next operation. The classic pipeline
uses discrete shader types represented in different colors, where data flows
sequentially down the pipeline through different shader types. The illustration
on the right depicts a unified shader core with one or more standardized, unified
shader processors.

3 NVIDIA GeForce 8800GTX Review

Data coming in the top left of the unified design (such as
vertices), are dispatched to the shader core for processing, and results are
sent back to the top of the shader core, where they are dispatched again, processed
again, looped to the top, and so on until all shader operations are performed
and the pixel fragment is passed on to the ROP subsystem.

The GeForce 8800 design team realized that extreme amounts
of hardware-based shading horsepower would be necessary for high-end DirectX
10 3D games. While DirectX 10 specifies a unified instruction set, it does not
demand a unified GPU shader design, but NVIDIA GeForce 8800 engineers believed
a unified GPU shader architecture made most sense to allow effective DirectX
10 shader program load-balancing, efficient GPU power utilization, and significantly
improved GPU architectural efficiency.

Note that the GeForce 8800 unified shaders can be also be used
with DirectX 9, OpenGL, and older DirectX versions. No restrictions or fixed
numbers of unified shading units need to be dedicated to pixel or vertex processing
for any of the API programming models.

In general, numerous challenges had to be overcome with such
a radical new design over the four year GeForce 8800 GPU development timeframe.
Looking more closely at graphics programming, we can safely say that in general,
the number of pixels outnumbers vertices by a wide margin, which is why you
saw a much larger number of pixel shader units versus vertex shader units in
prior fixed shader GPU architectures.

4 NVIDIA GeForce 8800GTX Review

But different applications do have different shader processing
requirements at any given point in time—some scenes may be pixel-shader
intensive and other scenes may be vertex shader-intensive. In a GPU with a fixed
number of specific types of shader units, restrictions are placed on operating
efficiency, attainable performance, and application design. For illustration,
the figure below shows a theoretical GPU with a fixed number of four vertex
shader units and eight pixel shader units, or a total of 12 shader units altogether.

5 NVIDIA GeForce 8800GTX Review

The top scenario shows a scene that is vertex shader intensive,
and it can only attain performance as fast as the maximum number of vertex units,
which in this case is “4”. In the bottom scenario, the scene is
pixel shader intensive, which might be due to various complex lighting effects
for the water. In this case, it is pixel shader limited, and can only attain
a maximum performance of “8”, equal to the number of pixel shader
units, which is the bottleneck in this case. Both situations are not optimal,
because hardware is idle and performance is left on the table so to speak. Also,
it’s not efficient from a power (performance/watt) or die size and cost
(performance/sqmm) perspective.

With a unified shader architecture, at any given moment when
an application might be vertex shader intensive, you can see the majority of
unified shader processors are applied to processing vertex data, and in this
case, the overall performance is increased to “11”. Similarly, if
pixel shader heavy, the majority of unified shader units can be applied to pixel
processing, also attaining a score of “11” in the example below.

6 NVIDIA GeForce 8800GTX Review

Unified streaming processors (SPs) in GeForce 8800 GPUs can
process vertices, pixels, or geometry—they are effectively general purpose
floating point processors. Different workloads can be mapped to the processors,
including Physics and other possible workloads we may see in the near future.
Note that geometry shading is a new feature of the DirectX 10 specification.
The GeForce 8800 unified stream processors can process geometry shader programs,
permitting a powerful new range of effects and features, while reducing dependence
on the CPU for geometry processing.

The GPU dispatch and control logic can dynamically assign vertex,
geometry, or pixel operations to available SPs without worrying about fixed
numbers of specific types of shader units. In fact, this feature is just as
important to developers, who need not worry as much that certain aspects of
their code might be too pixel shader intensive or too vertex shader intensive.
Then again, many developers would still be mindful of what type of hardware
majority of gamers are running…

Not only does a unified shader design assist in load-balancing
shader workloads, it actually helps redefine how a graphics pipeline is organized.
In the future, it is possible that other types of workloads can be run on a
unified stream processor.