"They have mis-underestimated us!"
- George W. Bush

Some thoughts on the compute shader and the future
Saturday, May 16, 2009 | Permalink

GPGPU has been a buzz-word for a while now. The first serious attempts in this area came around 2005/2006 with shader model 3.0 cards, and while the potential has always been great, up until now it's basically been a field for a small amount of researchers with very specialized applications. One of the problems has been that to use the GPU for general computing you had to go through graphics APIs like DirectX and OpenGL. These were designed for graphics and not for general computations. To really use a GPU to it's full potential you often needed more low-level access to the hardware. This resulted in vendor specific APIs like CTM, Stream SDK, Cuda etc. From the original stumbling attempts the technology in this field has moved at an amazing pace, and while GPGPU is still not exactly mainstream it's just a matter of time now before it is. The main driving forces behind this is the compute shader in DX11 and OpenCL. With vendor agnostic APIs and straightforward interaction with game graphics, we are likely to see more generic use of the GPU in future games. Most games will probably still stick to DirectX, so the DX11 compute shader will probably be the most relevant API for game developers, whereas I think researchers and general application developers will probably prefer OpenCL.

A couple of interesting things about the compute shader is that while the programming approach is radically different from traditional vertex and pixel shader programming, the underlying hardware hasn't radically changed to accomodate this change in API. It's mostly about using existing hardware capabilities in a smarter way. Yes, there are features being added and hardware of course continues to get more flexible, but the radical change is in how we program the GPU, rather than in how the GPU itself works. This is illustrated by the fact that DX11 adds a compute shader 4.0/4.1 for existing DX10 and DX10.1 hardware.

So what's separating a GPU from a CPU? How come a GPU generally is able to reach it's full potential and extract close to theorethical max performance whereas a CPU rather rarely reaches anywhere close to it's theorethical performance? The difference is in how it approaches long latency operations, in particular memory accesses. When a CPU needs a value from memory it will stall waiting for the value to arrive and then when it's available continue execution. In order to avoid stalls happening all the time CPUs have large caches so that for frequently accessed data the huge stalls can be avoided. In fact, these days CPUs contain more cache than computing logic. GPUs on the other hand are mostly logic and little cache. While there are some small caches to improve performance of localized accesses, the main approach is to hide latency of memory accesses through threading. Basically when a GPU hits a memory access, such as a texture fetch, it won't stall waiting for it to arrive, instead it'll just switch to another thread and continue working on that one. And when that thread hits a memory access, it'll switch to the next. At some point it has launched as many threads as it can run simultaneously, and will return to the first thread again. At that point chances are the memory request has finished and the results are readily available in the destination register. Because of this approach GPUs unlike CPUs need a very large register set to hold the active values of all threads in flight. The more registers (GPRs) that a shader uses the fewer threads a GPU can fit into the register set. Thus the number GPRs can have a significant impact on performance since fewer threads means a greater chance that by the time you're returning to a thread issuing a memory request the request may not yet be complete.

A couple of side notes:
While I don't think it was ever publicly documented what size the register set of the NV30 was, it's probably the case that the notably poor performance of this chip was mainly down to a small register set, which would also be why the Cg compiler at the time generally quite heavily traded ALUs for fewer GPRs in shaders.
There are also some GPU-like approaches in the CPU world. For instance Intel's HyperThreading also hides latencies through threading, hence the name of the technology. Instead of idling while memory request finishes it'll switch to another thread. So a single core can appear as two logical cores and run two different threads and switch between the two as it's waiting for memory.
The huge register set of GPUs also has an equivalent in the SPUs in Cell. It does not have the automatic threading of GPUs though, but you could manually emulate the behavior of GPUs though.

So what is a thread in the context of GPUs? It's basically an instance running the shader, for example a pixel in a pixel shader, or a vertex in the vertex shader. While GPUs always did threading this has been implicit and hidden from the developer. It's not very relevant in the programming model of vertex and pixel shaders. Each thread was completely independent of all other threads anyway. What the compute shader does is make threading a little bit more explicit. In the shader you tell the dispatcher how many threads you want in a threadgroup and more importantly you can share data between threads in a threadgroup. So while each thread get their own registers as usual there's also some registers that are shared between threads in a threadgroup. Given that they are shared it does not increase the register pressure by much. The biggest chunk will still generally be the local registers for each individual thread. And in practice you'll probably see the number of local registers needed go down since your previously duplicated registers can now be shared with some care from the developer. As a result, the register pressure would go down, potentially increasing performance.

So with this long introduction, what can we expect of future hardware and software? For developers the initial challenge will be the wrap their heads around the compute shader model and understand how it maps to the underlying hardware. The vertex and pixel shader model is easy and intuitive. The compute shader not so much. The good news is that developers that know how to write a good compute shader likely also have a better understanding of the hardware and will probably also write better vertex and pixel shaders.
Once you're getting into the idea of explicit threading on the GPU you will probably realize that threading on the GPU, like threading on the CPU, in addition to being able to extract additional performance also opens up for a lot of shooting yourself in the foot. The joy of race conditions is coming to the GPU. Horray! So whereas any working pixel shader written today can be expected to work for any future hardware, if you mess up with the compute shader it could break on future hardware. For instance if you forgot a sync point, or accidentally is writing results to the same memory address from two different threads, the timing on current hardware could be such that it works by accident anyway, whereas newer hardware gets slightly different timing and the end result in shared registers or memeory becomes different. The good news is that if you're doing graphics you'll probably notice most race conditions much easier than on CPU code because for instance some strange flickering might occur randomly. But chances are that at some point a game will ship with an undetectedly broken compute shader that breaks on future hardware. I'm going to guess that by say 2012 we'll see some games from 2010/2011 break on the latest and greatest GPU.

Another important aspect of this added flexibility of the hardware is how it will change the work balance of the GPU. Over the years the ALU:TEX ratio has slowly increased. Shaders get more and more compute bound and hardware is ramping up ALU power faster than texturing power. One factor that has slowed down this process is that there still exist some important parts of many games that are heavily TEX bound. This is particularly true for post-effects. The good news is that the compute shader comes in very handy for most post-effects and by sharing data between threads you can significantly cut down on the texture fetches needed. For instance instead of running a 3x3 blur filter in the pixel shader and require 9 texture fetches you could for instance launch a threadgroup of 10x10 where each thread takes one texture sample and writes to the shared registers, and then 8x8 results are computed and written to memory. This will take the number of texture fetches down from 9 to an average of 100/64 = 1.5625, or a 5.75x reduction. Given that the compute shader will likely replace most pixel shader based post-effects there remain very few reasons to be conservative about ALU:TEX ratio, so most likely in the next couple of generations we'll see the ratio increase faster than it has done so far.

Finally, with the compute shader it's going to be an increasingly attractive option to do more than just graphics on the GPU. Nvidia already has PhysX physics on the GPU, and most likely we'll see cross-vendor physics middleware in the future. Some developers will probably do their own physics, especially when it comes to effect physics, like cloth, water/fire/smoke animation and particle systems. Some might put parts of AI and game logic on the GPU as well. Audio is another good candidate for GPU processing. With things like this processed by the GPU, chances are you will also want to keep it there and make some systems self-contained on the GPU and only communicate some stuff back to the CPU as necessary. When this becomes mainstream, it's probably the nail in the coffin for current multi-GPU solutions. While CrossFire/SLI solutions will probably work fine for at least the next generation and perhaps another one, in the long term it'll be hard to continue with AFR rendering. Self-contained systems will simply create all kinds of nasty inter-frame dependencies and data copies between GPUs that it'll be hard to maintain reasonable performance scaling. This doesn't necessarily mean multi-GPU will die, but my prediction is that it'll have to change such that the two GPUs work together on the same frame and uses a shared memory and for all purposes behaves as if it was a single GPU. That probably means that two separate video cards in the same system running in CrossFire/SLI probably will die off, whereas solutions using two GPUs on the same board will continue to exist.

I'll throw in a final prediction as well. I think the GPU and CPU will eventually merge onto the same die. Not so that one will replace the other, or they'll become so similar that they are essentially the same, no, I think that we will still have a CPU and GPU very much like today, even on a single chip. But I think the multi-core trend on CPUs will take a different turn. Instead of increasing number of standard complex cores, you'll probably see just a few standard cores optimized for highest performance of sequential code. They'll execute all the hard to parallelize code. Then you'll have a large array of simple CPU cores, Larrabee style, which are individually not as fast but you can instead have many of them, which will take on CPU tasks that are mostly parallel. Including both simple and complex cores, rather than doing one or the other, is motivated by Amdahl's law. Everything can't be parallelized, so let's keep at least a few complex cores that can take on those tasks instead of constraining the whole system's scaling to the performance of the simple cores. Then finally you'll have the GPU where you do specialized super-parallel work, in particular (of course) graphics.

Name

Comment

Enter the code below



Overlord
Saturday, May 16, 2009

So in essence, Tim Sweetny was right.
I totally agree with these thoughts.
I also think it's not long before we will see OpenCL-GL (as in CL with minimum graphics functions).

BKLA
Saturday, May 16, 2009

>>>"you'll probably see just a few standard cores optimized for highest performance of sequential code. <cut> Then you'll have a large array of simple CPU cores"

Yey! CELL processor is from the future!!!

A.
Saturday, May 16, 2009

Would you prefer a coherent memory shared by all cores or a more Cell-like model?

Humus
Sunday, May 17, 2009

Overlord, I wouldn't say that Tim Sweeney was right. He's been predicting that as CPUs became faster the GPU would become irrelevant. But the trend has always been that the GPU has become more and more relevant at the expense of the CPU and the performance gap has widened so much that at this point GPUs are orders of magnitude faster and are now competing for work that has traditionally been done on the CPU. Although I'm sure he takes the Larrabee as proof he was right all along, except that Larrabee is a GPU, although very CPU-like.

BKLA, well, Cell is pretty close to what I imagine future CPUs will be like. Except the SPUs all work in their own little world quite isolated from the rest of the system. In a sense it's a quite GPU-like approach, but I'm not sure it makes sense for CPUs.

Overlord
Saturday, May 23, 2009

it depends a little on how you interpret that interview, i read it as there will come a time where programming for the gpu or the cpu becomes irrelevant, that you will basically write programs that will run on any resource seamlessly, be it the gpu or cpu or both.
It's interesting to note that the cell processor has about the same processing power as the lowest end today, so would any high end intel or amd if they where designed a little bit differently.
I wouldn't say that the gpu is orders of magnitudes faster, it might be for a single application, but not generally speaking.

I think that in the not so distant future CPUs (like the cell) will be equipped with 2-4 different kinds of cores, all good at doing their thing but are still coded the same way.
BTW larrabee is both gpu and cpu, though i don't think it will come out in time to make any difference in the gpu market.

ULJarad
Sunday, June 28, 2009

Hello Humus. I have some questions!

What exactly about the GPU allows it to perform so well on parallelized code without individual cores? Does Amdalh's Law apply to GPU's ability to do parallel work?
http://en.wikipedia.org/wiki/Amdahl%27s_law