"In the land of the blind, the one-eyed man is king."
- Niccoló Machiavelli

A couple of benchmarks
Saturday, March 21, 2009 | Permalink

So I put my SSE vector class to the test to see if it would give any actual performance improvement over the standard C++ implementation I've used in the past. So I set up a test case with an array of 16 million random float4 vectors, which I multiplied with a matrix and stored into result array of the same size.

First I tested the diffent implementations against each other. I tested the code compiled to standard FPU code, and then with MSVC's /arch:SSE2 option enabled, which uses SSE2 code instead of FPU most of the time (although mostly just uses scalar instructions), and then my own implementation using SSE intrinsics. This is the time it took to complete the task:

FPU: 328ms
SSE2: 275ms
Intrisics: 177ms

That's a decent performance gain. I figured there could be some performance gain by unrolling the loop and do four vectors loop iteration.

Unroll: 165ms

Quite small gain, so I figured I'm probably more memory bandwidth bound than computation limited. So I added a prefetch and streaming stores just to see how that affected performance.

Prefetch: 164ms
Stream: 134ms
Prefetch + Stream: 128ms

Final code runs 2.56x faster than the original. Not too bad.

Name

Comment

Enter the code below



Michael
Saturday, March 21, 2009

nice gain!

I'd love to see a fleshed out article about this- perhaps adding a GPU implementation for comparison

Groovounet
Saturday, March 21, 2009

GPUs get better performane only if the amount of data is large enought and it depents on whether you are using it on the gpu. Good to have a good sse implementation

It's quite easy to demonstrate improvement on a specific experiment, I'm looking forward for real use case test. Mat4 product and inverse are such great topics

Symmenthical
Monday, March 23, 2009

Have you try the MOVNTQ instruction ?
If i remember, you can have real gain in SSE, on some test few year ago i gained a factor of 2 on some VDub rescale filters.

Humus
Monday, March 23, 2009

That's the "streaming stores" in my post. Or to be correct, it's a MOVNTPS instruction using the _mm_stream_ps() intrinsic, which is the SSE equivalent of MOVNTQ instruction in MMX.

Symmenthical
Tuesday, March 24, 2009

Yes, I just confuse the two instructions, but obviously I was thinking about MOVNTPS.
Great Job, I'm one of your fan

Paul
Tuesday, February 1, 2011

is this SSE code available please?