A couple of benchmarks
Saturday, March 21, 2009 | Permalink
So I put my SSE vector class to the test to see if it would give any actual performance improvement over the standard C++ implementation I've used in the past. So I set up a test case with an array of 16 million random float4 vectors, which I multiplied with a matrix and stored into result array of the same size.
First I tested the diffent implementations against each other. I tested the code compiled to standard FPU code, and then with MSVC's /arch:SSE2 option enabled, which uses SSE2 code instead of FPU most of the time (although mostly just uses scalar instructions), and then my own implementation using SSE intrinsics. This is the time it took to complete the task:
FPU: 328ms
SSE2: 275ms
Intrisics: 177ms
That's a decent performance gain. I figured there could be some performance gain by unrolling the loop and do four vectors loop iteration.
Unroll: 165ms
Quite small gain, so I figured I'm probably more memory bandwidth bound than computation limited. So I added a prefetch and streaming stores just to see how that affected performance.
Prefetch: 164ms
Stream: 134ms
Prefetch + Stream: 128ms
Final code runs 2.56x faster than the original. Not too bad.
[
6 comments |
Last comment by Paul (2011-02-01 21:48:29) ]