Shader programming tips #1
Thursday, January 29, 2009 | Permalink
DX9 generation hardware was largely vector based. The DX10 generation hardware on the other hand is generally scalar based. This is true for both ATI and Nvidia cards. The Nvidia chips are fully scalar, and while the ATI chips still have explicit parallelism the 5 scalars within an instruction slot don't need to perform the same operation or operate on the same registers. This is important to remember and should affect how you write shader code. Take for instance this simple diffuse lighting computation:
float3 lightVec = normalize(In.lightVec);
float3 normal = normalize(In.normal);
float diffuse = saturate(dot(lightVec, normal));
A normalize is essentially a DP3-RSQ-MUL sequence. DP3 and MUL are 3-way vector instructions and RSQ is scalar. The shader above will thus be 3 x DP3 + 2 x MUL + 2 x RSQ for a total of 17 scalar operations.
Now instead of multiplying the RSQ values into the vectors, why don't we just multiply those scalars into the final scalar instead? Then we would get this shader:
float lightVecRSQ = rsqrt(dot(In.lightVec, In.lightVec));
float normalRSQ = rsqrt(dot(In.normal, In.normal));
float diffuse = saturate(dot(In.lightVec, In.normal) * lightVecRSQ * normalRSQ);
This replaces two vector multiplications with two scalar multiplications, saving us a 4 scalar operations. The math savvy may also recognize that rsqrt(x) * rsqrt(y) = rsqrt(x * y). So we can simplify it to:
float lightVecSQ = dot(In.lightVec, In.lightVec);
float normalSQ = dot(In.normal, In.normal);
float diffuse = saturate(dot(In.lightVec, In.normal) * rsqrt(lightVecSQ * normalSQ));
We are now down to 12 operations instead of 17. Checking things out in
GPU Shader Analyzer showed that the final instruction count is 5 in both cases, but the latter shader leaves more empty scalars which you can fill with other useful work.
It should be mentioned that while this gives the best benefit to modern DX10 cards it was always good to do these kind of scalarizations. It often helps older cards too. For instance on the R300-R580 generation it often meant more instructions could fit into the scalar pipe (they were vec3+scalar) instead of utilizing the vector pipe.
sqrt[-1]
Saturday, January 31, 2009
Thanks, This is a really useful tip.