A developer's diary of Graphics and Physics - Gerasimos "Gerry" Raptis: Compute Shaders

The compute shaders in both DirectX and OpenGL allow us to use the GPU in a way that is not pipelined as object->primitive->pixel.
In one sentence, what they allow you to do is dispatch an arbitrary number of GPU computing threads that operate in an arbitrary way on resources.

They are launched explicitly (you actually ask for the number of threads you want), they don't operate on any kind of primitive, which means that you also control them conceptually.

It would take too long to explain them fully, so I will just suggest reading the MSDN documentation on compute shaders, and check out the DirectX SDK samples tutorials for them.

In general, the idea is this (with DirectX semantics):

You bind any kind of classic resources (textures, constant buffers etc.) as shader resources, pretty much as you would do for any other kind of shader.

You normally create one or more Unordered Access Views in order to write to them.

You determine how many threads you want to launch (determined for the kind of work you want to do), and how they are going to be grouped (very important for caching and synchronization).

You launch as many groups as you need.

First of all, the "number of threads" is a 3D vector, in order to assist with mapping the shaders to resources (1D, 2D, 3D textures etc.) so that you require less math to manipulate them.

Example: assume we want to work on a 1024x512 image, and we want to operate on it once per pixel, pretty much like a pixel shader. This quite commonly means launching one thread per pixel.

We might conceptually break the work into 16x16 tiles (in this case, arbitrarily). We would define a shader that is marked [numthreads(16,16,1)], which means that each group will contain 16x16x1 threads, and they will be indexed thus.
In order to cover our entire 1024x512 image, we would need (1024/16 x 512/16) = 64x32 groups launched.

Let us do something pretty basic : Get our RenderTarget texture as input, draw a 10 pixel red rectangle as a border to it, and spit it out as output. Quite basic, doesn't need a compute shader to do, could be done in place, but let us take things step by step.

cbuffer Cbuffer: register(b0)
{
float width;

float height;
};

Texture2D textureBufferInput: register(t0);
RWTexture2D<float4> uavOutput: register(u0);

#define THREADS_X 16
#define THREADS_Y 16

[numthreads(THREADS_X, THREADS_Y, 1)]
void main(uint3 pixelId : SV_DispatchThreadID)
{
if (pixelId.x < 10 ||
pixelId.x + 10 > (uint)width ||
pixelId.y < 10 ||
pixelId.y + 10> (uint)height)
{
uavOutput[pixelId.xy] = float4(1, 0, 0, 1);
}
else
{
uavOutput[pixelId.xy] = textureBufferInput[pixelId.xy];
}

All right, it is pretty obvious here even to someone that looks at a compute shader for the first time, that RWTexture2D is what it says (a texture that can be written to), and we have somehow bound it as an output with commands that we can find in MSDN. All we're doing here is a simple copy, and we didn't even need to do it (as we understand that a RW Texture is exactly that...).

Keep in mind that we could of course write to any part of the output buffer, we are just choosing to "correspond" the same pixel in the input and output. For example, we could mirror it, or conditionally write, or blend, or do whatever else we want. By doing that, though, we introduce a need for synchronization, but let's leave that for later.

I would expect only one question here. How on earth did we get the correct pixel into the mysterious pixelID parameter? And what on earth is SV_DispatchThreadID?

Well, that is actually simple, but it can get confusing. You should check the Dispatch command in MSDN for the detailed explanation.

The short story is, there are a few different semantics, identifying the group XYZ, the thread inside the group XYZ, the thread inside the group as a flat index, and the global thread XYZ. SV_DispatchThreadID is the global thread XYZ, which means that it is unique per call, and we have configured them as:

(threadsX,threadsY,threadsZ) x (groupsX,groupsY,groupsZ) = (ImageX, ImageY,1).

The sum total of semantics that are available here (without any kind of preparation in the API side) are:

uint3 SV_GroupID : The 3D index of the current group. in our case, it will go from (0,0,0) to (63,31,0), as it comes from our Dispatch call.

uint3 SV_GroupThreadID : The 3D index in the current group ("relative") of the current thread. In our case, it will go from (0,0,0) to (15,15,0), as it comes from our numthreads attribute.

uint SV_GroupThreadIndex : The 1D "flattened" index in the current group ("relative") of the current thread. In our case, it will go from (0) to (16x16 = 256), as it is nothing more than a different way to expose SV_GroupThreadID.

uint3 SV_DispatchThreadId : This is the "global", or "absolute" id of a thread. It is defined by both dispatch call and numthreads, and in our case will go from (0,0,0) to (1023, 511, 0). It is defined as SV_GroupID * (groupsizeX, groupsizeY, groupsizeZ) + SV_GroupThreadID.

You can use whichever combination of the above suits your application. Simplistic per pixel calculations benefit from SV_DispatchThreadID. In order to make the more useful compute shaders shine you usually use group caching, which makes it necessary to use either GroupThreadID or GroupThreadIndex.

In other words, while nothing built in makes a thread correspond to a pixel, we have launched the compute shader in such a way that we can easily select a single pixel from it. In fact, we might say that in this case, we have just made it into a simple pixel shader.

Keep in mind, that the output Unordered Access View can be both read and written to for in place editing.

After this simple intro, we are ready for the next post, with some real conpute-shader post-processing.

A developer's diary of Graphics and Physics - Gerasimos "Gerry" Raptis

Pages

Sunday, May 12, 2013

Compute Shaders - Overview and theory

No comments:

Post a Comment

About me