An interpretable, data-efficient, and scalable neural scene representation.
We propose Scene Representation Networks (SRNs), a continuous, 3D-structure-aware scene representation that encodes both geometry and appearance. SRNs represent scenes as continuous functions that map world coordinates to a feature representation of local scene properties. By formulating the image formation as a neural, 3D-aware rendering algorithm, SRNs can be trained end-to-end from only 2D observations, without access to depth or geometry. SRNs do not discretize space, smoothly parameterizing scene surfaces, and their memory complexity does not scale directly with scene resolution. This formulation naturally generalizes across scenes, learning powerful geometry and appearance priors in the process.
SRNs explain all 2D observations in 3D, leading to unsupervised, yet explicit, reconstruction of geometry jointly with appearance. Normal maps may visualize the reconstructed geometry and make SRNs fully interpretable. On the left, you can see the normal maps of the reconstructed geometry - note that these are learned fully unsupervised! In the center, you can see novel views generated by SRNs, and to the right, the ground-truth views. This model was trained on 50 2D observations each of ~2.5k cars in the Shapenet v2 dataset.
SRNs generate images without using convolutional neural networks (CNNs) - pixels of a rendered image are only connected via the 3D scene representation and can be generated completely independently. SRNs can thus be sampled at arbitrary image resolutions without retraining, and naturally generalize to completely unseen camera transformations. The model that generated the images above was trained on cars, but only on views with a constant distance to each car - yet, it flawlessly enables zoom and camera roll, though these transformations were entirely unobserved at training time. In contrast, models with black-box neural renderers will fail entirely to generate these novel views.
By generalizing SRNs over a class of scenes, they enable few-shot reconstruction of both shape and geometry - a car, for instance, may be reconstructed from only a single observation, enabling almost perfectly multi-view consistent novel view generation.
Because surfaces are parameterized smoothly, SRNs naturally allow for non-rigid deformation. The model above was trained on 50 images each of 1000 faces, where we used the ground-truth identity and expression parameters as latent codes. A single identity has only been observed with a single facial expression. By fixing identity parameters and varying expression parameters, SRNs allow for non-rigid deformation of the learned face model, effortlessly generalizing facial expressions across identities (right). Similar to the cars and chairs above, interpolation latent vectors yields smooth interpolation of the respective identities and expressions (left). Note that all movements are reflected in the normal map as well as the appearance.
Here, we show first results for inside-out novel view synthesis. We rendered 500 images of a minecraft room, and trained a single SRN with 500k parameters on this dataset.
Neural Scene Representations. Computer vision has developed many different mathematical models of our 3D world, such as voxel grids, point clouds, and meshes. Yet, feats that are easy for a human - such as inferring the shape, material, or appearance of a scene from only a single picture - have eluded algorithms so far.
The advent of deep learning has given rise to neural scene representations. Instead of hand-crafting the representation, they learn a feature representation from data. However, many of these representations do not explicitly reason about geometry and thus do not account for the underlying 3D structure of our world, making them data-inefficient and opaque.
The trouble with voxel grids. Recent work (including our own) explores voxel grids as a middle ground. Features are stored in a 3D grid, and view transformations are hard-coded to enforce 3D structure. Voxel grids, however, are an unlikely candidate for the "correct" representation, as they require memory that scales cubically with spatial resolution. This is acceptable for small objects, but doesn't scale to larger scenes. Lastly, voxel grids do not parameterize scene surfaces smoothly, and priors on shape are learnt as joint probabilities of voxel neighborhoods.
SRNs. With SRNs, we take steps towards a neural scene representation that is interpretable, allows the learning of shape and appearance priors across scenes, and has the potential to scale to large scenes and high spatial resolutions.