A group called Euclideon from Australia has made extraordinary claims about Unlimited Detail technology – ability to achieve much more detail in a world as you zoom in or out by representing structures in terms of atoms of arbitrary precision. It is widely believed that they use sparse voxel octrees (or similar variations like kd-trees) to achieve these properties. Euclideon has been silent about such speculation.
The notion of unlimited detail interests me with regards to scientific visualizations, zoomable user interfaces (ZUI), augmented reality (with glasses), and gaming.
The ZUI principle (my terminology) is that computation costs (CPU, network, and memory) associated with a UI should be roughly proportional to screen real-estate, i.e. independent of world size, number of objects, or precision of available detail. Even logarithmic costs in these other factors are undesirable, though acceptable if necessary. Aiming to achieve the ZUI principle has a profound impact on how you model the world and objects and details. Unlimited detail technology seems to promise the ZUI principle.
Unlimited detail is, of course, ultimately limited by ability to retrieve tiny details in real-time. We must compute color for each pixel in our display very quickly (preferably at least 60 times each second) as we manipulate a camera. Euclideon describes their technology as finding one atom for each pixel. That sounds like ray tracing. Euclideon has claimed it is not ray tracing. Regardless, it involves a lot of lookups, each of which might penetrate many transparent volumes (where we test for an intersection but don’t find one). There is much exploitable parallelism and redundancy, both spatial (aggregate computations) and temporal (caching).
Sparse voxel octrees (SVOs) are one well-known technology that supports effective compression and real-time lookup of very fine details. (This article will assume you are already familiar with how they work.) However, SVOs face at least three significant challenges:
- Compression by SVOs, though excellent for an indexed structure, leaves much to be desired – e.g. pick up your typical chunk of bark and scan it at high precision, compress it into an SVO, and you’ll have megabytes of data. For one piece of bark. It’s wonderfully detailed bark, but you need more than one piece to build a world. Consequently, you’d be tempted to use the same chunk of bark many hundreds or thousands of times to amortize that cost. (You can see this technique, along with tiling of superstructures that repeat a piece of bark again and again, in Euclideon’s example video. Ugh.)
- It is difficult to animate SVOs of unlimited detail. The traditional imperative animation techniques (bones, skinning, etc.) fare poorly at manipulating large numbers of tiny voxels. I’ve often seen comments along lines of: “You’d have to move millions of these atoms dozens of times per second along their correct paths. Instead of a polygon budget, we have an animated atom budget. You can have an amazing 3d landscape so long as the trees never blow in the wind.”
- It seems difficult to leverage GPUs to render many highly detailed SVO structures. SVOs consume much memory, but allow cheap searches for just the details that are actually rendered. GPUs tend to be memory bound, but are able to perform reasonably rich computations on the details that are available in memory (shaders, etc.). I.e. GPUs are a poor match for the challenges SVOs present. Euclideon has avoided GPUs.
In this article, I will address all three concerns. I caution the reader that the solutions I present here are untested design hypotheses. I would like to eventually test these with an implementation, but it has lower priority than some of my other hobby work.
Scene Graphs for Rigid Body Animations
Scene-graphs can easily address many common, inflexible animations: scaling, rotation, and transition. We represent the world as a spatially indexed collection of nodes. Each node has a volume at some location. The most useful and common node represents simply a `bounding volume`, which may contain more scene-graph nodes or directly renderable structures (such as an SVO, or a mesh), which exist entirely within that volume. Nodes may overlap arbitrarily – somewhat like map layers, but for 3D space. When nodes overlap, computation is necessary to determine which intersections occur earlier in the camera’s Z axis. Where scene-graphs benefit is that many nodes (or sub-nodes) do not overlap. Via the spatial index, it is easy to find which bounding volumes are potentially visible from a camera, and thus reduce the search space for objects presented to screen.
In addition to bounding volumes, scene graphs may support a variety of other node types. For lighting, one might specify (for each light) several volumes from which the light is potentially visible such that, at any given point and time, you can find all the lights that might illuminate it. Nodes may exist to support occlusion analysis, fog, refraction (e.g. for heat mirages or rippling water). These volumes are often bounding boxes or spheres. I am partial to using spheres. They are easy to index, and the computations are easier in case of rotation, and it encourages more flexible techniques to modeling walls or long wires (lots of small spheres, etc.).
With scene-graphs, we can rotate, translate, or scale whole nodes – without concerning ourselves about the individual internal nodes. This does introduce some cost to maintain the scene-graph’s spatial index, but it is relatively low if we move objects of coarse granularity. In some cases, we may also alias a scene-graph node and use it in many locations, saving space for fractal or repetitive structures (which are not uncommon in architecture). There is nothing unique to SVOs, here: the same techniques work for SVOs as work for meshes, and the two may even coexist.
SVOs themselves may be understood as scene-graphs – albeit, very rigid scene-graphs that enable very tight compression.
Traditional techniques for animating scene-graphs have been applied to SVOs, for example in this animation of an ogre (a skin and bones animation) described in Dennis Bautembach’s 2011 thesis, Animated Sparse Voxel Octrees. While Bautembach’s work is technically impressive (making SVOs work on the GPU, and performing animation via transforms on the GPU), it is a brute force solution and will not scale cheaply to larger worlds or higher levels of detail. Using Bautembach’s approach, we really will have an “animated atom budget”, and we must keep much information on the GPU that will be potentially occluded or exposed by the animation. Bautembach’s solution is very memory-bound, and violates the ZUI principle.
I advocate scene-graphs for moving SVO structures through a world, and for composing the world, but at a relatively coarse granularity – e.g. per character and vehicle, and perhaps a few levels deeper to support achieve a level of component-level configuration (e.g. switching heads and weapons on a stock body, or coolness components for a motorcycle). For performance and scalability, I must propose other solutions for detailed animations.
Spatial-Temporal Data Structures for “Static” Animations
One direct way to animate SVOs is to have a separate SVO for each pose. However, this technique is far too expensive (with respect to storage) to be practical. The difficulty is that, while each SVO compresses fairly well, we’re failing to achieve compression in the temporal dimension. SVO per pose is analogous to Motion JPEG for video – easily an order of magnitude more expensive than video streams that utilize temporal compression. We can take a lesson from video.
It is straightforward to extend SVOs, kd-trees, etc. in the temporal dimension. Each voxel then describes a volume at a particular location in time. To fit the SVO techniques, our SVOs must be bounded in the temporal dimension – i.e. having a clear “center”, just as they do in the spatial dimensions. Thus, there must be a limited number of poses represented. It is useful to generally arrange things so that a series of poses corresponds to a particular animation – e.g. walking, sitting, standing, jumping. Since rotation and translation are handled by other means, they do not need to be addressed – the character would always be at the center of this animated model.
The compression schemes associated with SVOs would then apply directly to the temporal dimension – i.e. if many points in time are filled by identical voxels (which happens when a voxel is not moving, or when identical voxels are flowing through a volume), we can combine those points. Granted, the compression is less efficient due to `spikiness` as we add dimensions. It seems that further compression should be achievable by leveraging that the “camera” is usually centered with respect to time – i.e. analogous to a camera above a 3D scene that looks straight down. But the mathematics I’ll leave to others.
Aside: isometric projection on the temporal dimension would model motion blur, which can be useful for temporal anti-aliasing. There may be other advantages of being able to angle the camera upon the time axis, e.g. for temporal pattern recognition (verbs and adverbs) and scientific visualization. See Bret Victor’s Up and Down the Ladder of Abstraction for some examples of spatially representing the temporal abstraction.
Many animations would be described by temporal loops within the spatial-temporal model. Scene-graph nodes must, in addition to describing spatial translations, describe time-translation functions applied to sub-structures. We could display the same character/body in many different poses based on temporal translation functions, while only paying once for the storage.
Euclideon boasts of their ability to convert meshes into detailed atomic structures (which are possibly modeled by SVOs). Similar can be achieved for converting animations generated by existing tools into spatial-temporal SVOs.
Higher level scene-graphs can also be described and indexed as temporal structures. The compression advantages are weaker, but are still accessible if you plan to reuse many instances of one scene-graph. Also, bounding volumes must be larger to account for the range of motion, which can undermine precision of lookups and cost more CPU – but this can be kept under control; it’s easy to switch techniques where necessary. The primary advantage of temporal scene-graphs is to save CPU costs for predictable, relatively cyclic animations – avoiding efforts to continuously update and repair (or create and garbage collect) the indexes. Storage costs (and potentially bandwidth requirements) will be higher, but we weigh that against the advantage of passively representing a heavily animated world.
By handling the predictable, “static” animations with passive temporal structures, we avoid tapping that “animated atom budget”. Our CPUs can operate at a higher level, and focus upon the reactive and interactive aspects of animation.
Procedural Generation for Compression, Variation, and Data Binding Animation
A promising approach to avoid the repetitive worlds presented by Euclideon is procedural generation. (And here the phrase “procedural generation” should be accepted in broad terms to include any computational paradigm, not just procedural programming.) Procedural generation allows a form of semantic compression, in terms of a high level world-description grammar. Instead of storing a 2 MB SVO description of bark (just in case we look very closely), we store ‘this is a piece of bark’ and a procedure that generates an SVO or mesh for bark at whatever level of detail we need. For variation, we may introduce parameters such as adjectives (e.g. ‘aged, birch bark’), and we can use noise models (e.g. Perlin noise) to create many variations of a theme, as needed.
Similarly, since it would be irritating to specify all the bark in the scene, we could generate the scene and detritus on the fly. Procedural generation can be used to build scene-graphs, meshes, SVOs, and more procedures (i.e. meta-programming). Procedural generation is great: it can save a lot of space and achieve richer variation and describe our scenes at a much higher level and in a language adaptable (via libraries and extensions) to the artist. Procedural generation has been effectively used for artistic purposes (e.g. POV-Ray and context free art). It has seen real-time use in games like .kkrieger and the more recent Malevolence (which boasts an “infinite” world).
Taken to its extreme, and in an ideal world, procedural generation would by itself be sufficient. However, procedural generation is often ad-hoc, with complex dependencies, hence difficult to parallelize. And often computing for one pixel must compute for many neighboring volumes and pixels, so we should not waste that effort. Thus, procedural generation “renders” to an intermediate scene-graph with meshes or SVOs and the ability to obtain more information as needed.
An interesting advantage of procedural generation is that it can be bound easily to external data resources – e.g. records of weather and stocks, traffic reports, open street map, directories and websites. Such information sources can serve as a basis for structured noise and parameters. Time-varying information becomes a basis for slow animations (e.g. a generated city might change over time) without need for randomness or continuous computations. A lot of real world data is stable, or relatively so, and the occasional vast changes can make for interesting player experiences.
Procedural generation for unlimited detail does face several challenges:
- stability on regeneration – Often procedural generation is because we cannot afford to keep the full renderable version of the world in memory, nor even on disk. We must garbage collect parts of detailed world that we expect won’t be using again soon. (We can use a number of heuristics to judge what to GC, such as direction of the character and least-recently-used.) It is important, when we do revisit a location, that it be consistent with its prior version – change can be tolerated, but should be explainable.
- consistency with neighbors – Often the structure of a region will depend on the structure of neighboring regions. Roads and rivers must align, for example. And weather effects must often taper off gently – e.g. patches of snow near the snowy mountain. The difficulty is achieving this consistency without generating the whole world up front, and without losing stability.
- progressive detail – If a player stands in a tower and looks down upon an affluent garden, we cannot afford to procedurally generate every blade of grass, every cobblestone, every rose petal in the whole (vast) range of view. However, we also cannot afford for dominant colors to “pop in” or shift when we move closer to the garden – i.e. we need to know the dominant colors, and especially colors with strong contrasts, even if those colors are provided by many very small structures. We also can’t have larger structures popping in.
Ideally, artists could specify as much or as little detail as they wish, to any depth, with hard and soft preferences. I feel that constraint logic is a good match for high-level descriptions of worlds and the occasional low-level detail. Stability and consistency are easy to achieve with constraint logic, while allowing artists to provide just the details that interest them (test rendering and tweak constraints to satisfaction).
Progressive detail, OTOH, seems difficult to model. There is similarity to those “bounding volumes” for scene-graphs – i.e. once it is decided something is inside the bounding-volume, it cannot be placed outside without breaking the rendering algorithms. I would like a similar constraint for dominant colors – i.e. once it is decided that a structure has a dominant color at a given angle and level of detail, it cannot be changed without breaking the user experience. Unfortunately, while it’s easy to enforce bounding volumes (e.g. by type system or assertion), it does not seem so easy to enforce “dominant colors” that result from many very small objects (without actually computing the result to the finest level of detail). Lacking more effective answers, I would leave this property to the discipline, foresight, and systematic testing.
Procedural Generation of Spatial-Temporal Structures
Procedural generation has been explored in the spatial dimension – i.e. nouns. And, even restricted to spatial models, it seems to be a field where any determined and interested developer could achieve vast advances (much low hanging fruit). Procedural generation has also been explored in some temporal structures, such as voice and music. An area that does not appear to have been explored much is procedural generation of spatial-temporal structures.
It is feasible to directly generate temporal scene-graphs and SVOs as we need them.
The challenge, as above, is to do so while achieving stability, consistency, and progressive detail. The stability aspect means that animations should be regenerable as if they had been active the entire time (a living world; the world isn’t in stasis when we aren’t looking). Consistency means the animations must align with each other smoothly. Regarding progressive detail, it should be acceptable to render fewer intermediate frames for objects that lack user focus – so long as we don’t skip anything that should have been obvious (e.g. such as a high contrast blinking red light).
At the moment, I am not aware of a good programming language for describing such spatial-temporal structures. Natural languages, such as English, certainly have extensive vocabularies for describing things and their actions together (verbs, adverbs). But programming languages tend to separate them – e.g. create a spatial structure, then animate it by some entirely different mechanism. For example, we might create a mesh (spatial), add some bones to it (spatial), then use scripts to describe motion of those bones. This separation is useful for flexible animation, but can be awkward. And that flexibility often has a cost for later computation.
Reactive programming models – functional reactive programming, synchronous reactive, and my reactive demand programming – are not the solution. They, too, separate space and time and tend to fare poorly for describing continuous animations. Reactive models offer an effective way to access and leverage temporal scene-graphs, just not generate them with constraints of procedural generation or store them efficiently.
Even without great ways to procedurally generate spatial-temporal structures on-the-fly, we can still less effective ways – e.g. reconstructing and explicitly caching animated scene-graphs or SVOs when we need them. That’s probably where I’d start, and build a language or DSL for animated scene-graph description as it becomes more obvious what would be useful.
Leveraging the GPU
The design elements I’ve been presenting rely even more heavily on indexed data structures. I’m aiming for representing motion with passive structure, such that rendering becomes an almost uniform process involving pixel lookups. Given my preference for reactive programming, by ‘pixel lookups’ I mean continuous queries with reactive semantics. The lookups themselves would be on semi-lazy, procedurally generated structures that present a view as a scene-graphs. The structure can keep details it will need, and garbage-collect those details it believes it will not need (to recompute later, using stable model mathematics). Continuous lookups offer an advantage for deciding what to cache, anticipating what details will be needed in the future, and supporting parallelism.
To the extent we need big indexes, a lot of ad-hoc lazy procedural generation, cache management, and binding external data resources, the GPU seems to be a poor fit. If we’re getting precise pixels from remote models, the GPU lacks ability to query for it on demand. GPU forces us to have more of the model built up front and cached in GPU memory, which doesn’t scale well to very small details or very large worlds.
But GPUs are still powerful computing devices, and we should leverage them.
Reactive dataflow models actually fit very well into the GPU architecture, and it is quite feasible to model the GPU as performing parts of an FRP or RDP sequence that composes scene-graph elements into a coherent rendered image. In case of RDP, there may be `side` outputs – demands on other behaviors – which can be maintained over frames. Conversely, from a reactive model (FRP, Sirea, etc.) one could leverage a GPU implicitly by use of packages like Haskell’s accelerate, so long as we can express a sufficiently large portion of a program as array computations to be worth the overhead of compiling CUDA code and managing memory.
Much of that lazy, ad-hoc procedural generation could be performed on GPU, even compressed and prepped for indexing, then stored – all transparent to someone who is performing the lookup. (Or we could be more explicit about it, and have the lookup for some bounding nodes return procedures.)
Outside of rendering, GPUs and like devices are also good for physics modeling and simulations, i.e. where most strong interactions are local. By comparison, light is all about action-at-a-distance: it bounces, reflects, refracts, and primarily influences intelligent entities very far from the object off of which it bounced. The use of GPUs for rendering (given that GPUs excel at local computations) is decidedly awkward.
I see the role of GPUs shifting more towards modeling, simulation, and procedural generation, while the role for rendering gets pushed to large, mostly passive indexes and flexible processors.