CineFX 3.0 : Shader 3.0 Architecture

NV40 NVIDIA GeForce 6800 Ultra (NV40) Review

Based on the mature 130nm process technology developed by IBM,
NV40 has a hefty 222M transistors count, a 70%  increase over the NV38,
bulk of which is made up by the Pixel Shader unit. NVIDIA has made their NV40
much more scalable than before with 16 parallel pixel pipelines where products
can be developed from entry level with 8-12 pipes all the way to high end of 16
pipes. Based on NVIDIA terminology, it is known as CineFX 3.0 architecture on
the NV40 which is a total redesign over the CineFX 2.0 on the NV38. The below
expectation performance from the pixel shader unit of the NV38 which sports 4×2
/ 8×0 prompted NVIDIA to redesign the whole architecture for 16×1 (16 pixels per
clock color and Z) and 32×0 (32 pixels per clock Z only). NV40 continues to have
full FP32 support like the NV38 but definitely it is much more capable with its
current architecture. Through Microsoft DirectX 9.0c, Shader Model 3.0 comprised
of Vertex Shader 3.0 (VS 3.0) and Pixel Shader 3.0 (PS 3.0) will be fully
supported on NV40.

Vertex Shader

NV40 has 6 vertex shader pipes twice that of the NV38 to feed
the 16 hungry pixel pipes. It is based on the MIMD (Multiple Instruction
Multiple Data) design where each engine can execute different instructions
separately. On NV40, the instructions slots is doubled to 512, texture fetchers
are available, dynamic flow control – additional looping/branching options and
new subroutine call/return functions are available to give programmers more
choices for writing efficient shader programs. Geometry instancing is also
supported and the length of the Vertex Shader programs becomes infinite. The
vertex processors pf the NV40 can perform texture look-up which can be used for
geometry deformation such as Displacement mapping to provide depth and realism
to every component, surface, and character in a scene.

Vertex frequency stream divider is a nice feature of the NV40′s
vertex processors as it allows a subset of the input registers to be initialized
at a less frequent rate where previously vertex shader was invoked once per
vertex processed. The application sets a frequency between 0 to 216 -
1 for a given data stream where all elements in that stream are affected by this
value. Effects can be efficiently applied to multiple characters or objects in a
scene, providing individuality where models are otherwise identical for eg. a
group of soldiers where each soldier has his own unique attributes and
animations. With these new features of the VS 3.0 in place, shader programs can
now be written with more effects than ever before.

Pixel Shader

Single Pixel Pipeline Diagram

pixelpipe NVIDIA GeForce 6800 Ultra (NV40) Review

Pipelines NVIDIA GeForce 6800 Ultra (NV40) Review

NV3xNV40

NV40 conforms to DirectX 9.0c PS 3.0 requirements with more instructions,
registers, supports dynamic branching and demands for full FP32 precision. Each
of the 16 pixel pipelines contains 1 texture processor, 2 FP32 shader units and
2 FP ALUs. The two ALUs of the NV40 in each pipe can perform up to 8 Ops/pixel
which is twice as many as the NV3x. With 32 FP ALUs working at the same time,
the NV40 can perform up to 128 Ops and 64 instructions in a clock cycle. The
texture units of the NV40 can perform Bilinear, Trilinear and Anisotropic
filtering up to 16X compared to 8x of the NV3x. It is delightful to see NVIDIA
is matching up with the ATi’s R3xx and R420 architectures which is capable of
16x.

Towards the end of the pixel pipelines is the ROP (Render Outputs) where it
handles AA as well Z checking and color compression. There are 16 of these ROP in NV40 where each contains a Z ROP and a C ROP and is capable of writing
one color Z pixel or 2 Z/Stencil value per clock cycle. NV40 is capable of up to
4x MSAA like the NV3x and any higher will require a combination of
super-sampling and multi-sampling. NVIDIA has decided to move ahead for their
NV40 from a ordered grid method at 4x to rotated grid like the ATi’s and that
means better quality of smoothing edges of polygons without any performance
penalty.

NV40 also support Multiple Render Target (MRT) technology to allow for up to
4 values to be output from pixel shader to a few buffers simultaneously. It is
being used for deferred shading, a technique where the lighting of a scene can
be done after rendering all of the geometry, eliminating multiple passes through
the scene. Photorealistic lighting can be created while avoiding unnecessary
processing time for pixels that do not contribute to the visible portions of an
image. However, no antialiasing is supported when MRT is implemented.

 

High Precision Dynamic-Range (HPDR) Technology

The NVIDIA High-Precision Dynamic-Range (HPDR) technology or
more commonly known as Dynamic Range Rendering is based on the OpenEXR standard
by Industrial Light & Magic. To execute this type of rendering approach, the GPU
needs to be capable of floating point shading, blending, filtering, and
texturing and it must also be able to store colors so that the logarithmic
nature of the data can be preserved. Definitely, with NV40 architecture put in
place, HPDR is now possible as it is already capable of full FP32 shading, FP16
textures filtering as well as 16x Anisotropic filtering. With NVIDIA’s 64-bit
texture filtering and blending technology, developers now can easily take
advantage of advanced shading, blending, filtering, and texturing capabilities
to create accurate effects such as iridescence, motion blur, and soft shadows.