Computer Graphics TU Braunschweig

Logo CG

 

Home

Team

Research

Teaching

Publications

 


 

GigaVoxels: Ray-Guided Streaming for Efficient and Detailed Voxel Rendering – Cyril Crassin, Fabrice Neyret, Sylvain Lefebvre, Elmar Eisemann - presented by Martin Wahnschaffe

Abstract

Ray casting of huge volumetric datasets is a time and memory consuming process. This article will explain an approach to ray cast large voxel scenes on the graphics processor in real-time. It can deal with several billion voxels, because its main focus lies on an efficient update method for the data needed in graphics memory. The rendering not only considers which data is visible and which is occluded, it also takes account of the needed detail level and filtering.

Table of Contents

1. Introduction

1.1  Basics of Volume Ray Casting

1.2  Advantages & Problems

2. Data Structure

2.1  Node Pool & Brick Pool

3. Rendering

3.1 Ray Casting

3.1.1 Tree traversal

3.1.2 Mipmaps

3.2 Data update

3.3 Rendering output

3.3.1 Compaction

4. Results

5. Summary 

1. Introduction

Volume data based on voxels is often used for scientific purposes, like displaying a computed tomography 3D scan. Besides, voxel engines are capable of rendering cinematic effects with very fuzzy or complex data. An example for this is the visualization of clouds, smoke or boats in Pirates of the Caribbean.

This first chapter will explain the general volume ray cast rendering process with its inherent advantages and problems.

1.1 Basics of Volume Ray Casting

The idea behind ray casting is to cast rays from the viewer’s eye into the scene [2]. In general this scene consists of 3D objects like spheres or more complex objects based on triangles. An alternative to this is the usage of voxels, which can be seen as small cubes filled with an opaque or transparent colour. The important difference to normal 3D geometry is that voxels also form the inner of an object. Therefore the rendering process is called volume ray casting. For each ray an intersection test against all relevant voxels in the scene is done in direction of the ray. The colour of the hit voxels is used to calculate the displayed pixel colour with a transfer function. In difference to ray tracing, ray casting does not work recursively – this means there are no secondary rays as needed for reflection, refraction or shadows.

 

 Fig.1 

Basic volume ray casting.

Take a look at figure 1 to comprehend this process. Here the camera symbolizes the viewer’s eye and the black grid represents the image rendered to your display. The scene consists of voxels which form a grey cloud. For each displayed pixel a ray is cast into the scene and all hit voxels are collected in direction of the ray. The colour and transparency values are used to calculate the resulting colour for the displayed pixel. A ray stops when it leaves the scene or when then accumulated voxel's colour becomes opaque.

1.2 Advantages & Problems

Volumetric data can be used for very detailed and realistic scenes and has an easy regular structure. Additionally, the basic volume ray casting is simple, as you have seen.
Especially the regular structure gives several advantages in contrast to other rendering techniques:

·         Volumetric data is easily manipulable.

·         Filtering methods can be applied in order to avoid aliasing.

·         Well defined level of detail (lod) techniques increase performance.

·         Occlusion culling can be done without difficulty.

·         Procedural data generation allows additional details and quality.

Nevertheless voxels are seldom used in applications that need real time performance. The reason is the huge amount of data needed for a scene built of voxels. As an example consider a 3D scan with a resolution of 2048³ voxels. Each voxel needs 4 bytes for its colour, which sums up to 32 GB of memory required. Rendering that much data is very expensive and takes a lot of CPU time.

Using the graphics processing unit (GPU) instead speeds up the rendering process, but comes with another problem: Transferring 32 GB of data to the graphics card clearly exceeds the available memory and would make real time performance impossible again.

Summarized, the following two problems have to be solved:

1.     Speed up the rendering process.

2.     Find an intelligent way to update the data in the graphics memory.

An important fact that helps to solve these problems is the following: Only a small subset of all data is needed for a given viewpoint, because most voxels are not in view, hidden by others or far away.

 

 Fig.2 

Photograph of a canopy of trees.

A good example is the rendering of a pseudo-surface like the canopy of trees shown in figure 2. From distance it looks like a closed surface. If you come closer, the single leaves will be visible and the former surface becomes a very detailed structure. An intelligent rendering system should be able to handle these requirements. The following chapters will explain the GigaVoxels approach - starting with the underlying data structure.

2. Data Structure

The data structure used for GigaVoxels is primarily a N³-tree. For example, with N=2 this means that every node has eight children (an octree). The tree partitions the whole scene into smaller and smaller parts. The smallest partition unit of a scene is called brick. It contains M³ voxels, where values around 16-32 are commonly used for M, depending on the size of the scene.
A feature of the GigaVoxels approach is that not only the leaves point to bricks, but also each parent node. Their bricks are calculated as sum of the child bricks and form different detail levels, called mipmaps. This allows high quality filtering at low performance costs.

 

 Fig.2 

A N²-tree. Each node points to a brick with a specific detail level [1].

2.1 Node Pool & Brick Pool

The node and brick pools are subsets of the full N³-tree and the bricks held in main memory. They are needed because the GPU memory is too small to hold the complete data. The node pool contains all nodes that are probably needed for the rendering on the GPU. Accordingly, the brick pool holds all bricks needed by the actual node pool. Both pools are kept in the graphics memory and a copy is kept in CPU memory to allow a faster data update.

 

 Fig.3 

Node and brick pool as used by the GPU [1].

As figure 3 shows, all children of a node are grouped. This saves memory, because a node needs just a single pointer to all its children. In addition, each node has a pointer to a brick in the brick pool, which is held in a large 32bit 3D texture. Correspondingly, the node pool is saved into a luminance alpha 3D texture with 64 bit per texel. The 3D structure is necessary, because the grouped child nodes have to be ordered in a 3D manner, based on their location within the parent node. Figure 4 shows the data structure in detail:

 

 Fig.4 

Data structure of a node texel [1].

The luminance part of the texel (orange in figure 4) holds a 30 bit pointer to the child nodes. If this pointer is zero no children are available in the active node pool. A complementary “max subdivision flag” indicates whether this node has children in the full tree or not. The last bit of the luminance part is used for a “node type flag” to determine if the alpha part of the texel (blue in figure 4) contains a constant colour or a brick pointer. The constant colour is an optimization to save memory for bricks that consist of homogenous voxels, for example empty regions.

3. Rendering

The following list of steps is meant to give you a basic understanding of the rendering process and are done each rendering:

1.     Ray casting: Cast a ray for each rendered pixel (GPU)

1.1. Traverse the tree to find a node which contains the ray’s origin and has an appropriate level of detail

1.2. Use the node’s brick pointer or constant colour to calculate the visible colour

1.3. Update the ray’s position and repeat the traversal until the accumulated color is opaque or the ray leaves the scene

1.4. Collect all nodes used during traversal

2.     Rendering output: Transfer the information about all used nodes to the CPU

2.1. Timestamp the used nodes and bricks (CPU)

3.     Data update: Update node and brick pools based on the set timestamps (CPU)

3.1. Add more level of detail to nodes that have recently been used

3.2. Remove the oldest nodes not used to free memory for the new nodes

In the following subchapters this process will be discussed in more detail.

3.1 Ray Casting

At this point it is assumed that all data needed for rendering is present in the graphics memory. In a fragment shader (= pixel shader) a ray is cast for each rendered pixel. The shader program initializes the ray’s origin and direction by using the near plane. A kd-restart [3] like algorithm is used to locate the origin in the N³-tree. It starts each traversal from the root node and therefore does not need a stack. The top-down traversal stops at a node with the needed level of detail. As mentioned in 2.1, this node contains either a constant colour (a) or a brick pointer (b).

a)     Constant colour: Integrate the node’s colour over the intersection length.

b)    Brick pointer: Sample the brick’s voxels (as described in 1.1).

The position where the ray leaves the node serves as new origin for the next iteration. A ray stops when its accumulated colour is opaque or when the scene’s bounding box is left. For the calculation of the final pixel colour pre-integrated transfer functions [4] are used. Additionally, a phase function and phong lighting based on GPU computed normals are applied.

Let us have a closer look at the math behind the tree traversal.

3.1.1 Tree traversal

As mentioned above, each traversal starts in the root node. During the traversal the active node is interpreted as a volume with range [0, 1]³. The ray’s origin x ϵ [0, 1]³ is located within this node. Now it is necessary to find out which child-node n is situated at the origin x.
The needed formula is:

       (1)

Here c points to the grouped children in the node pool of the N³-tree. As explained in 2.1 the child nodes are ordered based on their location within the parent node. This is why the integer parts of (x * N) point to the correct child node.

 

 Fig.5 

Example node group in a node pool of a N² tree with N=2.

 

Since the traversal is not finished yet, it is necessary to update x to the range of the new node.
The appropriate formula is:

      (2)

The new x is the remainder of the (x * N) integer conversion. Now the traversal continues with formula (1) and loops until the needed level of detail is reached.

3.1.2 Mipmaps

What is the correct level of detail? To answer this question another observation is considered: A simple ray casting algorithm uses single voxels – as shown in figure 1. Actually, the part of the scene visible through a pixel is not a single ray, but a whole view cone of rays (figure 6). Therefore, this cone will hit big groups of voxels instead of single voxels, depending on the distance.

 

 Fig.6 

Volume ray casting with a view cone.

To simulate this behaviour with a single ray, the mipmaps that have been introduced in chapter 2 are used. The correct level of detail is reached when each voxel of the actual brick maps approximately to one pixel of the display. Referring to the view cone, this means that the voxel’s size is about the diameter of the cone. Since the bricks are quite large, the best level of detail might change while doing ray-marching within the brick. To follow this issue the last three mipmaps are collected during tree traversal.

A nice side-effect of the mipmap approach is that it automatically adds filtering and therefore a better image quality to the render process. Furthermore, the ray casting will be faster, because the tree traversal mostly stops before reaching a leaf.

3.2 Data update

In 3.1 it was assumed that all needed data is available while rendering on the GPU. In practice this seldom is the case, because the GPU memory is too small for the complete scene data. This leads to the core competence of GigaVoxels – the data update. While the rendering is in progress the tree traversal might come to a node whose children are not available in GPU memory. A data update is needed, which takes five steps:

1.   Cancel the rendering.

2.   Save the current state by locking each finished pixel in the z-buffer.

3.   Tell the CPU which data was missing.

4.   Copy the needed data from CPU memory to GPU memory.

5.   Start a new rendering pass.

These steps are very expensive and therefore should be avoided as often as possible. A way to do this is to take a compromise: If a child node is missing take its parent instead. This means a less detailed brick will be used for the rendering, which results in a worse image quality. In fact, this is not as problematic as it sounds, because data updates are only needed when the visible scene changes. This is especially the case, while the camera is moving fast and therefore the lower details will hardly be remarkable. A stop of the rendering process on the GPU is only done when no higher mipmap level is available or the data is too coarse for the current ray position.

Nevertheless, the data in GPU memory still has to be updated. Whenever data is missing, it should at least be available for the next rendering. To assure this a technique called “last recently used (LRU)” is applied:

 

 Fig.7 

Volume ray casting with a view cone.

While rendering, a list of all used nodes and nodes needing subdivisions is created by the GPU. Next, this list is transferred to the CPU, which documents all used data by setting a timestamp for each node und brick (LRU data in figure 7). These timestamps are used afterwards to update the node and brick pool duplicates in CPU memory – recently used nodes replace the oldest nodes. When the next rendering is started the pool-changes are copied to GPU memory via texture-update calls.

3.3 Rendering output

How is the list of used nodes transferred to the CPU? A 2D RGBA32 render target texture serves as data container, allocating 128 bit for each pixel. A single render target can hold the ids of 4 used nodes for each related ray, including 1 bit as marker for needed subdivision. Tests showed that using 3 render targets for the output is sufficient to handle the needed information. This means that 12 nodes can be reported to the CPU for each rendered pixel and ray.

The number of actually transferred node ids can be further extended by taking advantage of coherence.

 

 Fig.8 

The nodes of 4 rays are summarized. [1]

The two types of coherence exploited here are spatial and temporal coherence:

a.      Spatial coherence means that neighbouring rays are likely to traverse the same nodes. This can be used to effectively save more nodes by grouping rays and sharing their nodes. Instead of creating a node list for each single ray, a node list is created for each group of 2x2 rays. As shown in figure 8, this extends the node lists from 12 to 48 nodes. During traversal the first 12 nodes hit by the 4 rays are saved in the first ray’s list (green), the second 12 in the second ray’s list (red) and so on.

b.      Temporal coherence means that rays are likely to traverse the same nodes as during the last rendering call – especially when the camera has not moved. For example the first 48 nodes traversed in a frame will be nearly the same 48 nodes traversed first in the next frame. To make use of this fact, a window of 48 nodes is shifted over all used nodes. In the first frame, the first 48 node ids are written to the render targets. In the next frame, the 49th to 96th nodes of the traversal are written to the render targets. For scenes with a high node count this might be continued for 1-3 more frames. Following, the shifting restarts at the first 48 nodes.

3.3.1 Compaction

At the current point there are three render targets that together contain a list of 12 nodes in each texel (the coherence optimizations do not change the actual count of nodes per texel). This is a lot of data and will slow down the rendering when transferred from GPU to CPU. In fact the data contains a lot of redundancy, because nodes are likely to be referenced multiple times. It is sufficient for the data update when each used node occurs once in the output. Therefore, it is possible to do some more compaction in order to transfer as few data from GPU to CPU as possible.

The goal now is to find out which nodes are required. To do this a selection mask is generated. This mask uses 32 bit for each ray’s node list and marks which nodes are needed.

 

 Fig.9 

(a) Nodes are not kept (NK) if they occur in a neighbour list. Otherwise they are kept (K). [1]

(b) Neighbours used for the comparison. [1]

The first 12 bits in this mask indicate whether each corresponding node should be kept or not. To identify the referred node list, its texel position is saved in the remaining 20 bits. The decision, whether to keep or not to keep the node, depends on the neighbour lists (figure 9 (b)). The lists are compared and nodes that occur multiple times are kept only once. Figure 9 (a) shows an example: The “yellow” node entry is not kept (NK) because it already occurs in the left neighbour. The top-left “blue” node is kept (K), because the considered neighbours do not reference it. For purposes of comprehensibility this explanation ignores the spatial coherence optimization made in 3.3.

The comparison of the node lists can be speed up by another simplification:

 

 Fig.10 

Comparison is restricted to the elements (i-1, i, i+1) of the neighbouring list. [1]

During the rendering process most nodes will be crossed at the same moment by neighbouring rays. For the created node lists this means that the node will have about the same index in all neighbours. Accordingly, it is a good approximation to compare the node with list index i to the indices i-1, i, i+1 of its neighbour lists.

To build a compact output of all used nodes, based on the generated selection mask, a technique called histogram pyramids [5] is used. In general, histogram pyramids are used to convert a sparse matrix into a point list. Here they are utilised to fill a render target (the point list) with all nodes marked as “kept” in the selection mask (the sparse matrix). Since the selection mask contains mostly 2-3 “kept” entries, one render target gives sufficient space in practice.

This concludes the rendering output. Using the node ids, the CPU can update the node and brick pool for the next rendering as explained in 3.2.

4 Results

All images you see in figure 11 are taken from real-time renderings at a resolution of 512x512 using a Core2 Duo E6600 CPU at 2,4 GHz and an NVIDIA 8800 GTS graphics card with 512 MB memory and G92 GPU. The underlying data structure is mostly an octree and the brick pool is set to a size of 42³ entries, needing about 430 MB of the GPU memory.

 

 Fig.11 

(a) Trabecular bone [6].

(b) Procedural volume [6].

(c) Animal cranium (screenshot of a video) [6].

(d) Human being CT scan [6].

Image (a) shows a scanned volume of a trabecular bone. It has a resolution of 1024³ voxels with 4GB of data, arranged into bricks with 16³ voxels. For the rendered scene the volume is copied multiple times to simulate a resolution of 8192³. With additional perlin noise filtering the scene is rendered at 20-40 frames per second (FPS), without at 60 FPS.

Image (b) is based on a single brick of 81³ voxels, used recursively in a fractal dataset. Here N=3 is used for the tree. Because of the single brick, it does not need any data updates and allows a virtual resolution of 8192³ at around 60 FPS.

Image (c) and (d) show an animal cranium and a CT scan of a human being. Both consist of 2048³ RGBA voxels (brick size M=32) and need more than 32GB memory. This means that missing parts have to be loaded from hard disk. Furthermore, the human being’s data is enriched with perlin noise, making it smoother and more detailed by adding transparency. Nevertheless, the GigaVoxels rendering still achieves 10-20 FPS including the correct visualization of semi-transparent parts.

5 Summary

The GigaVoxels ray casting technique shows a way to efficiently render large voxel scenes on the GPU. The biggest problem of the GPU volume ray casting is the expensive transfer of the needed data to the limited graphics memory. The GigaVoxels approach points out a ray-guided solution to this problem. While rendering a compacted list of all needed scene parts is created. Afterwards, the GPU memory is updated with the presumably needed rendering data for the next frame. The usage of mipmaps allows filtering, speeds up the rendering and saves memory on the graphics card.

The future development of the rendering system will cover shadows and effects like depth of field. Further interesting enhancements are the handling of animations and the integration into triangle scenes, for example in video games.

Bibliography

[1]

C. Crassin, F. Neyret, S. Lefebvre, E. Eisemann: GigaVoxels: Ray-Guided Streaming for Efficient and Detailed Voxel Rendering. ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games 2009.

[2]

K. Engel, M. Hadwiger, J.M. Kniss, A.E. Lefohn, C.R. Salama, D. Weiskopf: Real-time volume graphics. ACM SIGGRAPH 2004 Course Notes.

[3]

D. R. Horn, J. Sugerman, M. Houston, P. Hanrahan: Interactive k-d tree GPU raytracing. ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games 2007.

[4]

K. Engel, M. Kraus, T. Ertl: High-quality pre-integrated volume rendering using hardware-accelerated pixel shading. Visualization and Interactive Systems Group, University of Stuttgart 2001.

[5]

G. Ziegler, A. Tevs, C. Theobalt, H.-P. Seidel: GPU point list generation through histogram pyramids. Proceedings of VMV (2006), pp. 137–141.

[6]

Cyril Crassin Home page. http://artis.imag.fr/Membres/Cyril.Crassin/

Line
TU Braunschweig - Fakultät für Mathematik und Informatik - Computer Graphics - CG Seminar SS 09