Hierarchical Depth Buffers
March 25, 2020

Overview

A hierarchical depth buffer is a multi-level depth (Z) buffer used as an acceleration structure for depth queries. As with normal texture mip chains, the dimensions of each level are generally successive power-of-2 fractions of the full-resolution buffer’s dimensions. In this article I present two techniques for generating a hierarchical depth buffer from a full-resolution one.

First I show how to generate the full mip chain for a depth buffer in a way that preserves depth query accuracy in texture coordinate (or NDC) space even for non-power-of-2 depth buffer dimensions. (I’ve seen sample code online that doesn’t provide this guarantee, which makes accurately querying higher mip levels trickier.)

Second, for cases where only a single downsampled level is needed, I show how to generate this level using a single compute shader invocation that uses atomic operations in workgroup shared memory. For my application—where only 1/16th x 1/16th resolution (mip level 4) is needed—the compute shader technique is 2–3x faster than the usual multi-pass mip chain downsampling approach.

Introduction

Hierarchical depth (also known as Hi-Z) is a technique that comes up often in graphics. It’s used to accelerate occlusion culling (on the CPU and the GPU), screen-space reflections, screen-space ambient occlusion, volumetric fog, and more.

Additionally, it’s common for GPUs to implement Hi-Z as part of the rasterization pipeline. Fast Hi-Z lookups in on-chip caches allow full tiles of fragments to be skipped when they’re fully-occluded by previously-rendered primitives.

The general idea of Hi-Z is to accelerate depth queries by reading from reduced-resolution buffers. This is faster than reading from the full-res depth buffer for two reasons:

  1. A single texel (or just a few texels) of the reduced-res buffer can be used as a proxy for many at full-res.
  2. The reduced-res buffer can be small enough to fit in cache, making all lookups (especially random-access ones) fast.

The contents of the downsampled levels of a Hi-Z buffer vary depending on usage (e.g., whether the depth buffer is “reversed” or not, and what types of queries need to be enabled). Generally a texel in a level of a Hi-Z buffer stores the min or max of all the texels it subsumes from the previous level. Sometimes both min and max values are stored. It’s rare to store a simple average (as is often done for texture mip levels), as average depths are rarely useful for the types of queries desired.

Hi-Z buffers are most commonly queried as an “early out” to avoid further processing—and particularly more-precise lookups in the full-res buffer. For example, if we store the max values for a non-reversed depth buffer (where larger depth values are further away), we can quickly determine if a particular screen-space position is definitely occluded by the depth buffer (if its Z coordinate is > the (max) value stored at some higher (lower-res) level of the Hi-Z buffer).

Note the usage of the word “definitely” in the previous sentence: if the Z coordinate tested is <= to the (max) value retrieved, then it may or may not be occluded. For some applications, full-res depth lookups may be needed for the “maybe” case; for others, they may not (e.g., if all that’s at stake is wasted computation, rather than correctness).

My Application: Compute Shader Particle Rendering

I encountered the need for Hi-Z usage when implementing compute shader particle rendering in the engine for my VR app PARTICULATE. Because this rendering technique forgoes fixed-function rasterization, it must do its own depth tests—one for each single-pixel particle. And because particles are not sorted in any way, access to the depth buffer is (in the worst case) effectively random.

Random-access lookups in a full-screen texture are a recipe for poor performance. To mitigate this cost, I first do the depth lookups in a 1/16th x 1/16th resolution downsampled depth buffer. This buffer contains min depth values, which allows the rendering compute shader to skip the full-res depth test for the vast majority of visible particles. (If a particle’s depth is < the min depth stored in the reduced-res buffer, then we know that it’s definitely visible. If it’s >= the min, then we must check the full-res depth buffer.)

This makes the depth test generally cheap for visible particles. (It’s more expensive for occluded particles, but that’s fine because they don’t incur the rendering cost, so they are already cheap.).

First doing reduced-res depth lookups (as just described) reduced particle rendering times by up to 35% compared to always doing full-res lookups, so Hi-Z is a big win for my application.

I will now discuss two techniques for generating a hierarchical depth buffer.

Technique 1: Generating the Full Mip Chain

Many Hi-Z applications require the full depth buffer mip chain. Hi-Z occlusion culling, for instance, works by projecting a bounding volume into screen-space and using the projected size to choose the appropriate mip level (so that a fixed number of texels are accessed per occlusion test).

It’s generally straightforward to generate the mip chain from the full-res depth buffer—for each texel in level N, take the max (or min, or both) of the corresponding 4 texels in the previously-generated level N-1. Do successive passes of this (halving both dimensions each time) until the final 1x1 mip level is generated.

However, things are more complicated for depth buffers with non-power-of-2 dimensions. Because we’re often building Hi-Z for depth buffers of typical screen resolutions (which are rarely power-of-2), this is something that should be dealt with robustly.

Let’s first establish what the value in each texel of a depth buffer mip level is supposed to represent. For the purposes of this discussion, assume the mip chain stores min values. Depth lookups should use nearest-neighbor filtering, as interpolating min values is not useful and would also interfere with the intended hierarchical nature of the depth mip chain.

So when we get the value of a particular texel in mip level N, what does its value signify, precisely? It should be the min of all texels in the full-res depth buffer that occupy the same footprint in (normalized) texture coordinate space.

To put it another way, if a particular texture coordinate (in \( [0, 1]^2 \)) maps (using nearest-neighbor filtering) to a particular texel in the full-res buffer, then that full-res texel must be considered for the min that is computed for the texel in each higher mip level that the same texture coordinate also maps to.

When this correspondence is guaranteed, we know that a lookup in the higher mip levels will never yield a depth value that is > the value of the texel the same texture coordinate maps to in the full-res buffer (level 0). For a particular level N, this guarantee also holds true for all levels under that level (< N).

For even level dimensions (which for power-of-2-dimensioned full-res buffers will be the case for every level until a dimension becomes 1), this is easy. In the one-dimensional case, for a texel in level N with index \( i \), we need to consider the texels in level N-1 with indices \( 2i \) and \( 2i + 1 \) for the min. Thus, \( D_{N}[i] = \text{min}(D_{N-1}[2i], D_{N-1}[2i + 1]) \). There is a direct 2-to-1 mapping between texels (and thus texture coordinate footprints) because the dimension of each level is exactly half of the previous.

Example for even level dimensions: The 6 texels in this level reduce to 3 in the next level up. The texture coordinate footprints of the (3) higher-level texels each overlap exactly 2 texels from the lower level. (Circles are texel centers, and the boxes around them represent texture coordinate footprints when using nearest-neighbor filtering.)

Example for even level dimensions: The 6 texels in this level reduce to 3 in the next level up. The texture coordinate footprints of the (3) higher-level texels each overlap exactly 2 texels from the lower level. (Circles are texel centers, and the boxes around them represent texture coordinate footprints when using nearest-neighbor filtering.)

For odd level dimensions (of which there will be at least one for non-power-of-2 full-res dimensions), things are more complicated. For a level N-1 of odd dimension \( \mathit{dim}_{N-1} \), the dimension of the next highest level (N) will be \( \mathit{dim}_{N} = \lfloor \frac{\mathit{dim}_{N-1}}{2} \rfloor \), which is \( \neq \frac{\mathit{dim}_{N-1}}{2} \).

What this means is that there is no longer a straightforward 2-to-1 mapping of texels in level N-1 to level N. Instead, the texture-coordinate-space footprint of each texel in level N overlaps the footprint of 3 texels in level N-1.

Example for odd level dimensions: The 7 texels in this level reduce to 3 in the next level up. The texture coordinate footprints of the (3) higher-level texels each overlap the footprints of 3 texels from the lower level.

Example for odd level dimensions: The 7 texels in this level reduce to 3 in the next level up. The texture coordinate footprints of the (3) higher-level texels each overlap the footprints of 3 texels from the lower level.

Thus, \( D_{N}[i] = \text{min}(D_{N-1}[2i], D_{N-1}[2i + 1], D_{N-1}[2i + 2]) \). This means that the same texel in level N-1 sometimes contributes to the min computed for 2 texels in level N. This is necessary to maintain the correspondence earlier described.

The above description was done in one dimension for simplicity. For two dimensions, if both dimensions of level N-1 are even, a 2x2 texel region in level N-1 maps to a single texel in level N. If one dimension is odd, then either a 2x3 or 3x2 region in level N-1 maps to a single texel in level N. If both dimensions are odd, then the “corner” texel shared by the row and column extensions must also be considered—so a 3x3 region in level N-1 maps to a single texel in level N.

Example code

The following GLSL fragment shader code implements the algorithm just described. It should be run for each successive mip level, starting with level 1 (level 0 being full-res).

uniform sampler2D u_depthBuffer;
uniform int u_previousLevel;
uniform ivec2 u_previousLevelDimensions;

void main() {
	ivec2 thisLevelTexelCoord = ivec2(gl_FragCoord);
	ivec2 previousLevelBaseTexelCoord = 2 * thisLevelTexelCoord;

	vec4 depthTexelValues;
	depthTexelValues.x = texelFetch(u_depthBuffer,
                                    previousLevelBaseTexelCoord,
                                    u_previousLevel).r;
	depthTexelValues.y = texelFetch(u_depthBuffer,
                                    previousLevelBaseTexelCoord + ivec2(1, 0),
                                    u_previousLevel).r;
	depthTexelValues.z = texelFetch(u_depthBuffer,
                                    previousLevelBaseTexelCoord + ivec2(1, 1),
                                    u_previousLevel).r;
	depthTexelValues.w = texelFetch(u_depthBuffer,
                                    previousLevelBaseTexelCoord + ivec2(0, 1),
                                    u_previousLevel).r;

	float minDepth = min(min(depthTexelValues.x, depthTexelValues.y),
                         min(depthTexelValues.z, depthTexelValues.w));

    // Incorporate additional texels if the previous level's width or height (or both)
    // are odd.
	bool shouldIncludeExtraColumnFromPreviousLevel = ((u_previousLevelDimensions.x & 1) != 0);
	bool shouldIncludeExtraRowFromPreviousLevel = ((u_previousLevelDimensions.y & 1) != 0);
	if (shouldIncludeExtraColumnFromPreviousLevel) {
		vec2 extraColumnTexelValues;
		extraColumnTexelValues.x = texelFetch(u_depthBuffer,
                                              previousLevelBaseTexelCoord + ivec2(2, 0),
                                              u_previousLevel).r;
		extraColumnTexelValues.y = texelFetch(u_depthBuffer,
                                              previousLevelBaseTexelCoord + ivec2(2, 1),
                                              u_previousLevel).r;

		// In the case where the width and height are both odd, need to include the
        // 'corner' value as well.
		if (shouldIncludeExtraRowFromPreviousLevel) {
			float cornerTexelValue = texelFetch(u_depthBuffer,
                                                previousLevelBaseTexelCoord + ivec2(2, 2),
                                                u_previousLevel).r;
			minDepth = min(minDepth, cornerTexelValue);
		}
		minDepth = min(minDepth, min(extraColumnTexelValues.x, extraColumnTexelValues.y));
	}
	if (shouldIncludeExtraRowFromPreviousLevel) {
		vec2 extraRowTexelValues;
		extraRowTexelValues.x = texelFetch(u_depthBuffer,
                                           previousLevelBaseTexelCoord + ivec2(0, 2),
                                           u_previousLevel).r;
		extraRowTexelValues.y = texelFetch(u_depthBuffer,
                                           previousLevelBaseTexelCoord + ivec2(1, 2),
                                           u_previousLevel).r;
		minDepth = min(minDepth, min(extraRowTexelValues.x, extraRowTexelValues.y));
	}

	gl_FragDepth = minDepth;
}

Caveats with this listing

For full-res depth buffers where one dimension is greater than 2x the magnitude of the other dimension, the calls to texelFetch may index out of the bounds of u_depthBuffer. (In such cases the smaller dimension will become 1 before the other does.) I wanted to use texelFetch (which takes integer coordinates) in this example to make what’s happening as clear as possible, and I have not personally encountered these especially wide/tall depth buffers. If you do, you can clamp the coordinates passed to texelFetch or use texture and normalized texture coordinates (with clamp-to-edge set for the sampler). When computing min or max it’s always fine to consider the same texel multiple times for the boundary cases.

Second, though the first 4 texelFetch calls could be replaced with a single textureGather, this complicates things (as textureGather can’t specify mip level), and I have not observed a speedup from textureGather.

Performance

I used the above fragment shader to generate the full mip chain for the (two) per-eye depth buffers in my VR engine. For the test, each eye’s resolution was 1648x1776, which results in 10 additional downsampled mip levels (and thus 10 passes). The time to generate the full chain for both eyes together was 0.25 ms on an NVIDIA GTX 980 and 0.30 ms on an AMD R9 290.

Mip levels 4, 5, and 6 generated from the depth buffer corresponding to the color buffer shown above them. (Note that much of the scene is transparent and thus does not affect the depth buffer.) Mip level 4 is the first whose dimensions (103x111) are not evenly divisible by 2.

Mip levels 4, 5, and 6 generated from the depth buffer corresponding to the color buffer shown above them. (Note that much of the scene is transparent and thus does not affect the depth buffer.) Mip level 4 is the first whose dimensions (103x111) are not evenly divisible by 2.

An alternative method of generating the mip chain

The goal of the previously-described algorithm is to preserve depth query accuracy in texture coordinate (or NDC) space. For completeness (and because I forgo this guarantee for Technique 2 below), I want to show another method that I’ve seen presented (e.g., in this article).

Note that as with the previous method, this alternative method is designed to address non-power-of-2 full-res buffers (though of course it works on power-of-2 buffers as well).

In this alternative method, when downsampling a level with odd width (or height), instead of incorporating an additional column (or row) of texels from the previous (lower) level for every output texel, we do this only for the output’s highest-indexed “edge” texels. The only thing that changes in the above fragment shader is how shouldIncludeExtraColumnFromPreviousLevel and shouldIncludeExtraRowFromPreviousLevel are set:

// If the previous level's width is odd and this is the highest-indexed "edge" texel for
// this level, incorporate the rightmost edge texels from the previous level. The same goes
// for the height.
bool shouldIncludeExtraColumnFromPreviousLevel =
    (previousMipLevelBaseTexelCoords.x == u_previousLevelDimensions.x - 3);
bool shouldIncludeExtraRowFromPreviousLevel =
    (previousMipLevelBaseTexelCoords.y == u_previousLevelDimensions.y - 3);

The effect of this is that the highest-indexed edge texels become very “fat”, as each division by 2 of an odd dimension causes them to occupy a proportionally wider range of normalized texture coordinate space.

The downside of doing things this way is that it becomes trickier to query the depth of higher mip levels accurately. Instead of simply using normalized texture coordinates, it’s necessary to first determine the full-res texel that corresponds to these coordinates and then translate that texel’s coordinates to the corresponding ones in the mip level being queried.

The following code snippet does the translation from NDC space (\( [-1, 1]^2 \)) to texel coordinates in mip level higherMipLevel:

vec2 windowCoords = (0.5 * ndc.xy + vec2(0.5)) * textureSize(u_depthBuffer, 0);
// Account for texel centers being halfway between integers.
ivec2 texelCoords = ivec2(round(windowCoords.xy - vec2(0.5)));
ivec2 higherMipLevelTexelCoords =
    min(texelCoords / (1 << higherMipLevel),
        textureSize(u_depthBuffer, higherMipLevel).xy - ivec2(1));

Technique 2: Generating A Single Hi-Z Level Using a Compute Shader

Generating the full mip chain is pretty fast, but I was slightly bothered that my app was generating all those levels and then using only one of them (level 4). Besides this small inefficiency, I also wanted to see how much faster I could get things if I used a single compute shader dispatch to generate just the level I needed. (Note that my app could stop with level 4 when using the multi-pass fragment shader approach, so that’s what I use as the basis for comparison in the timings at the end of this section.)

A large fraction of Hi-Z use cases only need one downsampled depth level, so I believe this is a common scenario. I wrote a compute shader for my specific need (generating level 4, which is 1/16th x 1/16th resolution). Similar code could be used to generate different levels.

A compute shader is a good fit for this problem because it can take advantage of workgroup shared memory for inter-thread communication. Each workgroup is responsible for a single output texel (of the downsampled buffer), and the threads in the workgroup divide the work of min‘ing the corresponding full-res texels, communicating their results through shared memory.

I tried two main compute-shader-based approaches. The first was to have each thread call atomicMin on a single shared memory variable.

Note that because it’s not possible (without vendor-specific extensions) to do atomic ops on non-integer values (and my depths are stored as float), a trick is necessary here. Because non-negative IEEE 754 floating point values maintain their ordering when their bits are interpreted as unsigned integers, we can use floatBitsToUint to “reinterpret cast” float depth values to uint before calling atomicMin (and then uintBitsToFloat on the final minimum uint value).

The most obvious atomicMin approach is to have 16x16 thread groups, where each thread fetches a single texel and atomicMin’s it with the shared memory value. I compared this to having smaller thread blocks (8x8, 4x8, 4x4, 2x4, 2x2), where each thread fetches a region of texels and computes its own local minimum before calling atomicMin.

The fastest of all these atomicMin approaches on both the NVIDIA and AMD GPUs I tested was 4x4 thread blocks (where each thread itself fetches a 4x4 region of texels). I am not entirely sure why this was the fastest, but perhaps it reflects a tradeoff between atomic op contention and independent thread computation. Note also that the 4x4 workgroup size uses only 16 threads per warp/wave (compared to the available 32 or 64), which is curious. The example code below implements this approach.

As an alternative to using atomicMin, I tried doing a parallel reduction using the techniques in this commonly-cited NVIDIA presentation. (The basic idea is to use a shared memory array of the same size as the number of threads in the workgroup and \( \log_{2} (n) \) passes to successively min together the per-thread minimums until the final workgroup-wide minimum is obtained.)

I tried this approach with all the same workgroup dimensions as I tried for the atomicMin approach. Even with all the optimizations in the NVIDIA presentation linked above, the parallel reduction approach was slightly slower (by a single-digit percentage on both GPUs) than the atomicMin approach I had arrived at. The atomicMin approach was also substantially simpler code-wise.

Example code

For this method, it’s simplest if don’t try to maintain the correspondence in normalized texture coordinate space between reduced- and full-res texels. It’s easy to convert from full-res texel coordinates to reduced-res texel coordinates:

ivec2 reducedResTexelCoords = texelCoords / ivec2(downscalingFactor);

For my usage (generating the equivalent of mip level 4), downscalingFactor is 16.

As mentioned above, this GLSL compute shader implements the atomicMin approach with 4x4 workgroup size and each thread fetching a 4x4 texel region from the full-res buffer. The resulting downsampled min depth buffer is 1/16th x 1/16th the size of the full-res one (rounded up when the full-res dimensions aren’t evenly divisible by 16).

uniform sampler2D u_inputDepthBuffer;
uniform restrict writeonly image2D u_outputDownsampledMinDepthBufferImage;
// The dimension in normalized texture coordinate space of a single texel in
// u_inputDepthBuffer.
uniform vec2 u_texelDimensions;

// Resulting image is 1/16th x 1/16th resolution, but we fetch 4x4 texels per thread, hence
// the divisions by 4 here.
layout(local_size_x = 16/4, local_size_y = 16/4, local_size_z = 1) in;

// This is stored as uint because atomicMin only works on integer types. Luckily
// (non-negative) floats maintain their order when their bits are interpreted as uint (using
// floatBitsToUint).
shared uint s_workgroupMinDepthEncodedAsUint;

void main() {
	if (gl_LocalInvocationIndex == 0) {
        // Initialize to 1.0 (max depth) before performing atomicMin's.
		s_workgroupMinDepthEncodedAsUint = floatBitsToUint(1.0);
	}

	memoryBarrierShared();
	barrier();

	// Fetch a 4x4 texel region per thread with 4 calls to textureGather. 'gatherCoords'
    // are set up to be equidistant from the centers of the 4 texels being gathered (which
    // puts them on integer values). In my tests textureGather was not faster than 
    // individually fetching each texel - I use it here only for conciseness.
    //
    // Note that in the case of the full-res depth buffer's dimensions not being evenly 
    // divisible by the downscaling factor (16), these textureGather's may try to fetch 
    // out-of-bounds coordinates - that's fine as long as the texture sampler is set to 
    // clamp-to-edge, as redundant values don't affect the resulting min.

	uvec2 baseTexelCoords = 4 * gl_GlobalInvocationID.xy;
	vec2 gatherCoords1 = (baseTexelCoords + uvec2(1, 1)) * u_texelDimensions;
	vec2 gatherCoords2 = (baseTexelCoords + uvec2(3, 1)) * u_texelDimensions;
	vec2 gatherCoords3 = (baseTexelCoords + uvec2(1, 3)) * u_texelDimensions;
	vec2 gatherCoords4 = (baseTexelCoords + uvec2(3, 3)) * u_texelDimensions;

	vec4 gatheredTexelValues1 = textureGather(u_inputDepthBuffer, gatherCoords1);
	vec4 gatheredTexelValues2 = textureGather(u_inputDepthBuffer, gatherCoords2);
	vec4 gatheredTexelValues3 = textureGather(u_inputDepthBuffer, gatherCoords3);
	vec4 gatheredTexelValues4 = textureGather(u_inputDepthBuffer, gatherCoords4);

	// Now find the min across the 4x4 region fetched, and apply that to the workgroup min
    // using atomicMin.
	vec4 gatheredTexelMins = min(min(gatheredTexelValues1, gatheredTexelValues2),
                                 min(gatheredTexelValues3, gatheredTexelValues4));
	float finalMin = min(min(gatheredTexelMins.x, gatheredTexelMins.y),
                         min(gatheredTexelMins.z, gatheredTexelMins.w));
	atomicMin(s_workgroupMinDepthEncodedAsUint, floatBitsToUint(finalMin));

	memoryBarrierShared();
	barrier();

    // Thread 0 writes workgroup-wide min to image.
	if (gl_LocalInvocationIndex == 0) {
		float workgroupMinDepth = uintBitsToFloat(s_workgroupMinDepthEncodedAsUint);

		imageStore(u_outputDownsampledMinDepthBufferImage,
		           ivec2(gl_WorkGroupID.xy),
                   // imageStore can only be passed vec4, but only a float is stored.
				   vec4(workgroupMinDepth));
	}
}

Performance

I used the above compute shader to process the same full-res depth buffer dimensions as I used when generating the full mip chain (two 1648x1776 eye buffers). It ran in 0.12 ms on an NVIDIA GTX 980 and 0.08 ms on an AMD R9 290. Compared to the time to generate just mip levels 1–4 (0.22 ms on NVIDIA, 0.25 ms on AMD), the compute shader approach was 87% faster on the NVIDIA GPU and 197% faster on the AMD one.

Granted this is not a huge speedup in an absolute sense, but every 0.1 ms counts, especially for VR :)

Follow me on Twitter: @miketuritzin