"When I despair, I remember that all through history the way of truth and love has always won. There have been tyrants and murderers and for a time they seem invincible, but in the end, they always fall -- think of it, ALWAYS"
- Mohandas K. Gandhi

A couple of benchmarks
Saturday, March 21, 2009 | Permalink

So I put my SSE vector class to the test to see if it would give any actual performance improvement over the standard C++ implementation I've used in the past. So I set up a test case with an array of 16 million random float4 vectors, which I multiplied with a matrix and stored into result array of the same size.

First I tested the diffent implementations against each other. I tested the code compiled to standard FPU code, and then with MSVC's /arch:SSE2 option enabled, which uses SSE2 code instead of FPU most of the time (although mostly just uses scalar instructions), and then my own implementation using SSE intrinsics. This is the time it took to complete the task:

FPU: 328ms
SSE2: 275ms
Intrisics: 177ms

That's a decent performance gain. I figured there could be some performance gain by unrolling the loop and do four vectors loop iteration.

Unroll: 165ms

Quite small gain, so I figured I'm probably more memory bandwidth bound than computation limited. So I added a prefetch and streaming stores just to see how that affected performance.

Prefetch: 164ms
Stream: 134ms
Prefetch + Stream: 128ms

Final code runs 2.56x faster than the original. Not too bad.

[ 6 comments | Last comment by Paul (2011-02-01 21:48:29) ]