A couple of benchmarks
Saturday, March 21, 2009 | Permalink
So I put my SSE vector class to the test to see if it would give any actual performance improvement over the standard C++ implementation I've used in the past. So I set up a test case with an array of 16 million random float4 vectors, which I multiplied with a matrix and stored into result array of the same size.
First I tested the diffent implementations against each other. I tested the code compiled to standard FPU code, and then with MSVC's /arch:SSE2 option enabled, which uses SSE2 code instead of FPU most of the time (although mostly just uses scalar instructions), and then my own implementation using SSE intrinsics. This is the time it took to complete the task:
FPU: 328ms
SSE2: 275ms
Intrisics: 177ms
That's a decent performance gain. I figured there could be some performance gain by unrolling the loop and do four vectors loop iteration.
Unroll: 165ms
Quite small gain, so I figured I'm probably more memory bandwidth bound than computation limited. So I added a prefetch and streaming stores just to see how that affected performance.
Prefetch: 164ms
Stream: 134ms
Prefetch + Stream: 128ms
Final code runs 2.56x faster than the original. Not too bad.
Michael
Saturday, March 21, 2009
nice gain!
I'd love to see a fleshed out article about this- perhaps adding a GPU implementation for comparison
Groovounet
Saturday, March 21, 2009
GPUs get better performane only if the amount of data is large enought and it depents on whether you are using it on the gpu. Good to have a good sse implementation
It's quite easy to demonstrate improvement on a specific experiment, I'm looking forward for real use case test. Mat4 product and inverse are such great topics
Symmenthical
Monday, March 23, 2009
Have you try the MOVNTQ instruction ?
If i remember, you can have real gain in SSE, on some test few year ago i gained a factor of 2 on some VDub rescale filters.
Humus
Monday, March 23, 2009
That's the "streaming stores" in my post. Or to be correct, it's a MOVNTPS instruction using the _mm_stream_ps() intrinsic, which is the SSE equivalent of MOVNTQ instruction in MMX.
Symmenthical
Tuesday, March 24, 2009
Yes, I just confuse the two instructions, but obviously I was thinking about MOVNTPS.
Great Job, I'm one of your fan
Paul
Tuesday, February 1, 2011
is this SSE code available please?