How to cut your D3D call cost by a tiny immeasurable fraction
Wednesday, August 4, 2010 | Permalink
One difference between D3D and OpenGL is that the former is using an object-oriented API. All API calls are virtual rather than being plain C calls like in OpenGL. The main advantage of this is of course flexibility. The runtime can easily provide many different implementations and hand you back any one depending on your device creation call parameters. The obvious example of that would be the debug and retail runtime. I suppose the D3DCREATE_PUREDEVICE in DX9 also handed you a different implemention than the standard functions. It's of course faster to have a D3D runtime function that's trimmed down rather than have the same function and look at IsDebug and IsPure booleans. The disadvantage of having virtual functions is that dispatching virtual function calls comes with a bit of overhead.
One thing to note though is that once you've looked up the actual address for a virtual function, the actual function call is no different than calling a non-virtual function. In fact, the only thing different from a plain C function or static member function is that you pass the this
pointer as well. Consider the following D3D11 call:
virtual void STDMETHODCALLTYPE DrawIndexed(UINT IndexCount, UINT StartIndexLocation, INT BaseVertexLocation);
We can declare the equivalent C-style function pointer type like this:
typedef void (STDMETHODCALLTYPE *DrawIndexedFunc)(ID3D11DeviceContext *ctx, UINT IndexCount, UINT StartIndexLocation, INT BaseVertexLocation);
Then we can create a function pointer like so:
And the call to ID3D11DeviceContext:: DrawIndexed(...) can be done with MyDrawIndexed(context, ...) provided that MyDrawIndexed has been loaded with the correct function pointer. Very straight-forward. So how do we find the function pointer? Virtual functions are looked up through a v-table, which is essentially a list of function pointers for all the virtual functions in the class. When a class which has virtual functions is created a pointer to a static v-table will be stored in the object. The C++ standard doesn't require any particular memory layout, or even require a v-table at all to solve the virtual function dispatch problem, so this code will be highly unportable. But if you're coding DirectX you're only building for Windows anyway and chances are you're using the MSVC compiler. In that case the v-table pointer will be the very first member of the class. Other compilers may do it differently. From what I gather it's common for Unix compilers to put the v-table pointer at the end of the class instead.
Given a ID3D11DeviceContext pointer, let's call it "ctx", the first thing we need to do it grab its v-table:
void **v_table = *(void ***) ctx;
This somewhat cryptic code basically just grabs the first 4 bytes (8 bytes on x64) out of the memory ctx points to, which will be the v_table pointer. Now we just need to know which entry in the table represents DrawIndexed. The hard way is to look in the the D3D headers. DrawIndexed is the 6th member declared in ID3D11DeviceContext, however it also inherits from ID3D11DeviceChild which has 4 virtual functions and which in turn inherits from IUnknown which has 3. So it's the 13th function, or should be at the index 12. So we can find the pointer like this:
DrawIndexedFunc MyDrawIndexed = (DrawIndexedFunc) (v_table);
The easy way to figure this out is to just set a breakpoint at a regular DrawIndexed call and switch to disassembly view to see what code the compiler generated. It could for instance look like this:
mov eax, dword ptr [esi]
mov ecx, dword ptr [eax]
mov edx, dword ptr [ecx+30h]
mov eax, dword ptr [esi]
Here esi points to the class which holds "ctx". So first it grabs the ctx pointer, then grabs the v-table from it and on third line looks up the function address at offset 0x30 in the v-table. 0x30 / sizeof(void *) is 12, so there's your index. The following three lines pushes the arguments to the function on the stack in reverse order, and then the this
pointer. The this
pointer, which in this case is "ctx", was fetched to eax on the first line.
Now what happens if we make this call through MyDrawIndexed? Well, this:
mov ecx, dword ptr [esi]
call dword ptr [MyDrawIndexed]
That's two instructions less. Woot!
Also note that the first call was daisy chaining the fetches. Two of those indirections were removed. It should be noted however that for this to work, the MyDrawIndexed variable must either be a static member function or a global variable. In other words, its address should be resolvable at compile time. If you only have one device context this should be no problem. If you are using multiple contexts, for instance for threaded rendering, you may not want to rely on both contexts having the same function pointers in its v-tables, although this is likely to be true if they were created with the same parameters. You could in that case simply store the function pointer next to the device context in whatever encapsulating class you have, like my "Context" class I referred to earlier. While this is not as optimal, it still cuts down some work:
mov ecx,dword ptr [esi]
mov edx,dword ptr [esi+4]
This is one instruction longer, although still one shorter than the initial code. The most important thing though is that this code still only has one level of indirection, whereas the original one has three.
I should also mention that C++ has some fancy syntax for pointers to C++ member functions. The underlying mechanism for how those work is somewhat different from how standard C functions work. However, using a static or global function pointer the actual code generated with that is the same as with a regular function pointer. If you put it next to "ctx" though it will generate one more instruction, or the same as the original virtual call. It's still only one level of indirection though, so it's still better. The actual function pointer appears to have a sizeof() of 16. I don't know what the unused bytes are for, only the last 8 are actually used in the call. The advantage of using C++ function pointers though is that you can assign to it by name instead of figuring out the v_table index, so it creates somewhat prettier code.
typedef void (STDMETHODCALLTYPE ID3D11DeviceContext::*DrawIndexedFunc)(UINT IndexCount, UINT StartIndexLocation, INT BaseVertexLocation);
DrawIndexedFunc MyDrawIndexed = &ID3D11DeviceContext:: DrawIndexed;
And then the fancy calling syntax:
So what does all this messing around actually gain you? Performance-wise probably somewhere between infinitesimal and nothing. The number of cycles spent inside the DrawIndexed call probably far outweights any slight gain in calling it. In fact, if you set a breakpoint and step inside the function you will find that you're stepping over a quite large number of instructions before you return. You'll also notice that DrawIndexed in turn calls a few other virtual functions under the hood. If anything, you gain insight into the underlying mechanisms of virtual function calls. Plus of course that messing with v-tables is a lot of fun.
[ 10 comments
| Last comment by Humus (2010-08-06 19:13:00)