Humus - Comments

More pages: 1 2

DPPS (or why don't they ever get SSE right?)
Monday, March 16, 2009 | Permalink

So in my work on creating a new framework I've come to my vector class. So I decided to make use of SSE. I figure SSE3 is mainstream now so that's what I'm going to use as the baseline, with optional SSE4 support in case I ever need the extra performance, enabled with a USE_SSE4 #define.

Now, SSE is an instruction set that was large to begin with and has grown a lot with every new revision:
SSE: 70 instructions
SSE2: 144 instructions
SSE3: 13 instructions
SSSE3: 32 instructions
SSE4: 54 instructions
SSE4a: 4 instructions
SSE5: 170 instructions (not in any CPUs on the market yet)

Why all these instructions? Well, perhaps because they can't seem to get things right from the start. So new instructions are needed to overcome old limitations. There are loads of very specialized instructions while arguably very generic and useful instructions have long been missing. A dot product instruction should've been in the first SSE revision. Or at the very least a horizontal add. We got that in SSE3 finally. Yay! Only 6 years after 3DNow had that feature. As the name would make you believe, 3DNow was in its first revision very useful for anything related to 3D math, despite its limited instruction set of only 21 instructions (although to be fair it shared registers with MMX and thus didn't need to add new instructions for stuff that could already be done with MMX instructions).

So why this rant? Well, DPPS is an instruction that would at first make you think Intel finally got something really right about SSE. Maybe they has listened to a game developer for once. We finally have a dot product instruction. Yay! To their credit, it's more flexible than I ever expected such an instruction to be. But it disturbs me that they instead of making it perfect had to screw up one detail, which drastically reduces the usefulness of this instruction. The instruction comes with an immediate 8bit operand, which is a 4bit read mask and a 4bit write mask. The read mask is done right. It selects what components to use in the dot product. So you can easily make a three or two component dot product, or even use XZ for instance for computing distance in the XZ plane. Now the write mask on the other hand is not really a write mask. Instead of simply selecting what components you want to write the result to you select what components get the result and the rest are set to zero. Why oh why Intel? Why would I want to write zero to the remaining components? Couldn't you have let me preserve the values in the register instead? If I wanted them as zero I could have first cleared the register and then done the DPPS. Had the DPPS instruction had a real write mask we could've implemented a matrix-vector multiplication in four instructions. Now I have to write to different registers and then merge them with or-operations, which in addition to wasting precious registers also adds up to 7 instructions in total instead of 4, which ironically is the same number of instructions you needed in SSE3 to do the same thing. Aaargghh!!!!

Aras Pranckevicius
Tuesday, March 17, 2009

We had a very similar thought at work the other day. If you take a look at PS2 VUs, or PPC Altivec, or ARM VFP - they are all reasonable. And then you have SSE with it's umpteen revisions -- and it still hasn't got things right.

Greg
Tuesday, March 17, 2009

how much speed does SSE bring to a Vector3f? and can't a compiler vectorize the code itself?

and also, how do you effectively measure the performance gain?

Groovounet
Tuesday, March 17, 2009

What actully worse is that on some CPU (q6600 and co if I remember) the number of cycles need for ddps is so high that you'd rather use the sse2 instructions. I think it's fixed now with q9300 & cobut I never tried tested it.

I think it was quite the same with the horizontal add on p3: no efficient enought at introduction.

I wish an intel interested in 3d they would now that dot product is a common operation

Groovounet
Tuesday, March 17, 2009

An instruction called rdtsc allows to count the number of cycle taken by some instructions. It need to be use carefully but work quite well.

Compiler can vertorise a code, it's what visual c++ do, it's quite efficient but far from what a human can do in some case. If a c++ code isn't write with simd in mind, the compiler won't get much out of it because it will be hard to serialize.

The senario of vector and matrix classes is treaky. For some operations like matrix product or vector matrix product you can get a lot, 60 cycles and 36 cycles on a q6600, but when it come to basic thing like initialization or additions... The compiler work well!

A mistake is to write everything in asm. It's fun but it breaks compiler capability to optimize the code, especialy cross function optimisations.

It's all about testing.

Finally, don't bother to much! With a co-working we had some fun optimizing a the same code. We end up that the code he wrote was faster on his computer (atlhon x2 6000+) and the code I wrote was faster on mine q6600. Now with core i7 I'm sure that we could rewrite the code to reach better performance.

From CPU to CPU the number of cycle for each instructions change so what you took as an optimization could become slower on others CPUs because of variation on the instruction cycle count. Event the order of instructions change efficiency because of instructions latencies.

Humus
Tuesday, March 17, 2009

I haven't made any benchmarks so far, but I of course intend to verify that there's a performance benefit. But from my attempts in SSE in the past I know it can certainly be worth it. So far I've only verified that my code works and that the compiler generates reasonable code. From what I can see it does pretty much what I expect. I was initially afraid there would be overhead in unnecessary loading and storing, but it in fact generates very good code from what I can tell. Although you have to ensure that you enable pretty much every optimization, especially code inlining for small functions, link time code generation and enabling SSE for general floating point use instead of the FPU.

I'm of course using intrinsics rather than assembly. Using assembly is only an option if you have a quite long sequence of instructions. Besides inline assembly is not supported anymore for 64bit code, so I don't want to use it unless I have to. It's also nice to see that GCC supports the same intrinsics, so my code worked in Linux with very minor changes.

Groovounet
Wednesday, March 18, 2009

How do you expect to manage vec3? When the storage space matter it's quite unlikely to use vec4 instead of vec3...

Humus
Wednesday, March 18, 2009

I'll probably do two different classes, one "real" float3, and one that's more like float3_as_float4.

Rohit Garg
Thursday, March 19, 2009

Since when has Intel designed good isa's?

More pages: 1 2


General
	News 3D Pictures Textures Articles Cool stuff
Other
	FAQ About Humus
Other 3D sites
	OpenGL.org GameDev.net Beyond3D Rage3D
Programming info
	OpenGL extension registry NeHe