Sunday, June 2, 2013

Compute Shaders - Parallel reduction for luminance measurement

So, how do we actually use the HDR render target mentioned in the previous post? What can we actually DO with color values that are greater than 1.0, since the screen will ultimately display them as 1.0? And how do we use a compute shader to make our life easier?

What we want to to, is convert the HDR image into a better LDR image than we would have if we simply clamped everything to 1.0. This is called tone-mapping, from mapping a range of values ( the 0..X range of the HDR render target) to another range (the 0..1 of the final LDR image)

The basic components of a tone-mapping algorithm are deceptively simple:
1) Get a heuristic of the scene's overall brighthness
2) Using the above heuristic and a suitable algorithm, map the values of the HDR frame buffer to a 0..1 range. In the process, use the initial data to apply any number of effects of very bright or very dark areas.

The 1) part is one where the compute shader really shines. What we want to do here is add all pixels of the initial buffer.
The conventional, non-compute-shader way to do that would be using the classic pipeline, and downscale/downsample the render target until we got a 1x1 texture which would contain the average color of the image.
A compute shader can do this very elegantly and efficiently, using a process called parallel reduction.

Reduction in this context is an operation that is applied on an input set and outputs a single value. The operation use here is either summation or average, with the same result.

As an example, obtaining the sum of the following values (1.0, 0.5, 3.2, 7.1, 5.6, 0.01, 0.6, 3.2) via reduction, would be the following iterative procedure:
1) 1.0 + 0.5 = 1.5, 3.2 +7.1 = 10.3, 5.6 + 0.01 = 5.61, 0.6+3.2 = 3.8
2) 1.5 + 10.3 = 11.8, 5.61 + 3.8 = 9.41
3) 11.8 + 9.41 = 21.21
Allowing us to calculate, for example, the average of these values as 21.21/8= 2.65125
The exact same could be obtained by averaging at each step, the only difference being that the division would be performed at each step instead of once at the end.

In any case, it should be obvious that if we perform that operation on our render buffer, we would easily obtain a single value which would be the "average color" of our scene, and which we could the use for the tone mapping.

A compute shader snippet to that effect can be found in the HDR tutorial of the DirectX samples. I have modified it to work as a loop:

So, the idea is pretty simple but, as always, the devil is in the details.
  • First , we need to cache, cache, cache : these are a lot of texture fetches there, and we really need to work with the local memory of the GPU as much as possible.
  • Second, we must know where to store the values, as it is not so obvious as it seems at first.
  • Third, and probably most importantly, there is a lot of synchronization that must be done here between each step, as it is quite obvious that if you don't synchronize, you WILL start a reduction step in one thread while another thread still calculates the previous, ending up with garbage values.
  • Fourth, keep in mind that you can only synchronize per group. What that means is that unless you use some fancy global-access atomic operations  (which I suppose is quite doable but I doubt the performance of directly using the global memory this way), you can only reduce in a single group. Which means that you need two passes, one to reduce to a number of values equal to the number of groups in the initial dispatch, and another to launch a single group that will reduce these values to a single luminance value.
  • While tempting to launch a single group and do the entire reduction at once, (though until I profile I cannot insist) I really doubt that synchronizing a few millions of threads for the reduction would be well performing.
So, we need to launch a group to reduce to a few values at first. My shader looks like this:

Texture2D textureBuffer: register(t5);
SamplerState textureSampler: register(s10);
RWTexture2D<float4> uavLuminance: register(u0);

#define SIZEX 16
#define SIZEY 16
#define GROUPSIZE SIZEX*SIZEY
groupshared float4 accumulator[SIZEX* SIZEY];

static const float DispatchSizeX = (uint)ceil(projectionDetails.width / SIZEX);
static const float DispatchSizeY = (uint)ceil(projectionDetails.height / SIZEY);

[numthreads(SIZEX, SIZEY, 1)]
void main(uint3 inp : SV_DispatchThreadID, uint3 gid : SV_GroupID, uint3 gtid : SV_GroupThreadID, uint gtidx : SV_GroupIndex)
{
    //Load pixel into local memory buffer.
    accumulator[gtidx] = textureBuffer[inp.xy];

    //Wait for all
    GroupMemoryBarrierWithGroupSync();

    //Parallel reduction to one value per group.
    //Values are accumulated on the bottom part of the accumulator.
    //For example :
    //0,1,2,3,4,5,6,7 ->
    //0+4=4,1+5=6,2+6-8,3+7=10,4,5,6,7 ->
    //4+8=12, 6+10=16, 8,10,4,5,6,7 ->
    //12+16=28,16,8,10,4,5,6,7 ->
    //for a final value of 28
    [unroll]
    for (uint ix = GROUPSIZE >> 1; ix > 0 ; ix = ix >> 1) {
    //Check if we are in the bottom half of the part under consideration. 
    //If we are, add the corresponding top half value to it. If not, this
    //thread's job is finished. Unfortunately it cannot return yet because
    //of the call to synchronize
        if (gtidx < ix) {
            accumulator[gtidx] = (accumulator[gtidx] + accumulator[gtidx + ix])*.5;
        }
        GroupMemoryBarrierWithGroupSync();
    }
    if (gtidx != 0) { return; }
    //Here we store the accumulated value of the group to the global texture
    uavLuminance[gid.xy] = accumulator[0];
}

This concludes the first pass. You really have a lot of options here, and the actual place in the buffer where the values will be stored, as well as the bounds-checking you will need to do is up to the actual implementation, but this is the basic operation that must be done.

The second pass will need to perform the same operation, and again the way that it will need to be dispatched varied, but the important point is that it MUST be a single group. How this is guaranteed is again up to the specific implementation. At the moment, I have just done the "quick" way of over-allocating threads and then checking out-of-bounds. A better solution might be to re-define SIZEX and SIZEY on screen reshape, something that in most applications is done so rarely that it might be worth it.
In any case, (I have purposely omitted the definitions that have to do with dispatch), the actual work done is the same:

numthreads(SIZEX, SIZEY, 1)]
void main(uint3 inp : SV_DispatchThreadID, uint3 gid : SV_GroupID, uint3 gtid : SV_GroupThreadID, uint gtidx : SV_GroupIndex)
{
    accumulator[gtidx] = gtidx > actual_number_of_values? 0.0f : textureBuffer[inp.xy]; //bounds check
    GroupMemoryBarrierWithGroupSync();

    [unroll]
    for (uint iy = GROUPSIZE >> 1; iy > 0 ; iy = iy >> 1) {
        if (gtidx < iy) {
            accumulator[gtidx] = (accumulator[gtidx] + accumulator[gtidx + iy]);
        }
        GroupMemoryBarrierWithGroupSync();
    }
    if (gtidx != 0) { return; }

    uavLuminance[float2(0, 0)] = accumulator[0] * actual_number_of_values;
}

And there you have it. You now proudly sport a texture whose first element is the average color of the screen. In the next post, we can actually use that for tone mapping.

No comments:

Post a Comment