Obsolete. This documentation refers to either a deprecated system or an unsupported function or feature. This will be removed in the 3.7.x series.
With the Photobump tool you can create high resolution normal maps from uncalibrated photos taken with an off-the-shelf hand-held digital camera. Crytek relies on two images and an automated system, where the only required user input is to select the first image of the sequence. The method first reconstructs the underlying 3D object geometry using a novel dense matching diffusion algorithm, starting from an initial sparse set of automatically detected matching points. Normal map reconstruction examples are shown and reconstruction results are compared with state of the art results in computer vision research. Limitations and possible improvements are discussed as well.
For a more specific tutorial on using the Photobump tool, please see the .
Figure1: The normal map (bottom) is automatically generated from the 2 rock photos above.
Figure2: : The untextured 3D model (bottom left) is reconstructed from the two images shown in the top row. The bottom right image shows the same 3D model as seen from the recovered virtual camera position where the 2nd shot was taken (top right).
Bump maps are usually created by artists, using real world photo references, to create the illusion of certain surfaces or surface structures. Simply converting a color image to grayscale as it was done in the past it will not work, because bright and dark regions of the image have no correspondence whatsoever to the underlying surface shape. Therefore to obtain an accurate normal map, artists create two version of the same model from a reference picture: a very high polygon version, in the order of several hundred thousands of polygons, and a low polygon version displayed at runtime, often consisting only of a flat surface for all environmental art (walls, grounds, rocks etc.) A separate tool, such as Crytek's Polybump creates the normal map from the two models, which is then used for real time per-pixel lighting calculation over the low poly model, to simulate the original high poly model. Accuracy depends on normal map quality ' the more details (and hence number of polygons) have been put into the high poly model, the more detailed the resulting normal map will be. The process of hand-modeling the high poly model and creating the normal map is very time consuming.
This tool creates normal maps directly from real world photo references. It requires taking two separate pictures of the surface of interest at slightly different locations. Pictures don't have to be calibrated, i.e. they can be taken with a hand-held digital camera, where no prior knowledge of the camera parameters is required. The system will first generate the underlying high poly 3D mesh of the model (in the order of millions of polygons), and then project it to a simplified surface to generate the normal map.
The process for generating normal maps is roughly subdivided into 3 major steps (although this is all transparent to the user): calibration, 3D reconstruction and normal map extraction.
The first step consists of three sub-steps: detecting interesting features in both images, matching them, and simultaneously calibrating the cameras and the initial seed points. The tutorial assumes that the user has taken two separate photos of the surface of interest at slightly different locations and angles (please read the user manual for more details about this). For the tool to automatically load the pictures, they should be given sequential numbers, for example: 'Surface01.jpg', 'Surface02.jpg' and they should reside in the same folder. Digital cameras automatically assign sequential numbers to photos.
Run the Photobump tool (Photobump20.exe) and load the first image of the sequence into the tool (File->Open):
This opens a standard Windows browsing interface. Locate and click on the first image of the sequence, in the case above would be 'DSC_4940.jpg' Upon selecting the image, the Reconstruction Dialog Options will pop up.
Figure3: Reconstruction dialog options.
In the dialog you can change the following values or leave the defaults:
Click Ok and the process will start. The first time the sequence is loaded and calibrated, Photobump will generate some binary files in the same folder, for image segmentation, 3D points and camera calibration. After the image is reconstructed the first time, the above values can be changed much quicker by reloading again the first image of the sequence Photobump will find the saved binary files in the folder and will skip steps 1 and 2 to proceed directly to normal map extraction. The first time takes about 10 minutes to process 2 images at resolution 2240x1536 on a 2.4 ghz machine.
Starting from the two images, the first step consists in relating them to each other. This is not as easy problem. A restricted number of corresponding points is sufficient to determine the geometric relationship between the images. Since not all points are equally suited for matching (e.g. a pixel in a homogeneous region), the first step consist of selecting a number of interesting points or feature points. To detect the features, first convert the images to greyscale to minimize (constant) changes in illumination, then use the algorithm proposed by Carlo Tomasi and Kanade in 'Detection and tracking of point features (1991)'. To avoid corners due to image noise, first smooth the images with a bilateral filter: (See the article Bilateral Filtering for Gray and Color Images for more information.)
Figure4: Two images showing extracted corners.
Feature points are compared and a number of potential correspondences are obtained. From these the multi-view constraints can be computed. Keep the initial number of features below 1000, in order not to detect features that aren't strong enough. Use a radius of 5 pixels around the interesting point to match the features against the other image and try a variety of orientation and scales. The majority of points will not be matched between the two photos, and since the correspondence problem is an ill-posed problem, the set of corresponding points is typically contaminated with a number of wrong matches or outliers. This initial set of matching points is too low for starting the diffusion process for dense matching and for calibrating the cameras. You should boost the number of matches by matching points along the edges, using this initial set of points to guide the search along the edges. Extract edges from the images by applying a Canny edge detector filter, and then calculate the unit normal orientation vector of each pixel detected as an edge.
Figure5: Edge detection example.
Now, recursively match points along the edges by matching their corresponding normal vector. This usually boosts up the number of matched points by a factor of 10 and gives us enough points for the calculation of the Fundamental Matrix using a robust RANSAC algorithm.
Figure 6: Epipolar geometry. Corresponding features must line on corresponding epipolar lines (here for clarity only a few lines are shown).
Using the fundamental matrix you can recover the epipolar geometry. All points that are not within 1 pixel of distance from the corresponding epipolar line are considered false matches and are rejected. The remaining points will be used for camera calibration.
You are now left with 2D point correspondences and epipolar geometry. The calibration when no internal camera parameters and no 3d points in the scene are known is called self-calibration, and relays upon the projection of the absolute conics on the images; more details can be found in this article on Self-Calibration & Metric Reconstruction. It is , however, considered numerically hard and requires at least 3 images to upgrade to an Euclidean reconstruction.
Instead, solve the problem using the following approach: since the model has an unknown orientation in space (you don't know the real location and position where the photo was taken) you set the first camera to be at position (x=0,y=0,z=0) and looking at direction (x=0,y=0,z=0). You don't know the internal camera parameters (mainly the focal length) used to take the first photo, but you know that they didn't change in the second photo. Therefore, you transform the first camera into a standard OpenGL-like camera, transforming the focal length into an arbitrary field of view. The idea is that if you had a 3D model of the surface, it does exist a set of two cameras that would have produced the model as shown in the 2 photos. You also don't know the position in space of the 2D feature points. However they project along a line of sight from the camera center C; therefore by back-projecting a set of points from the first camera, you extract two planes in 3D space; the intersection of these 2 planes gives us the baseline CC', i.e. the line connecting the 2 cameras on the epipolar plane (see figure 6). From the epipolar geometry theory, you know that the 2nd camera is positioned somewhere on the epipolar plane along the baseline CC' (see Figure 6).
Since you don't know the real scale of the photo subject (it could be several meters tall or a few centimeters) you should position the second camera at an arbitrary distance from the first along the baseline.
You are now left with the problem of finding the orientation (i.e. the view angles) of the second camera. You do that by hierarchically matching the epipolar lines on screen, using a refining step until the error value is below a threshold. Once cameras and initial 3D structure has been obtained, perform a bundle adjustment process to refine the data. Typical bundle adjustment theory requires solving a global minimization problem via Levenberg-Marquardt. Instead use again an iterative process like you did for computing camera orientation, this time using only the points closest to the center of the picture, where radial distortion is lower. You can iteratively refine focal length (to account for slightly changes with auto-focus) of cameras, their orientation and the position of the initial 3D structure. The results are shown in Figure 7.
Figure 7: Example of Photobump OpenGL preview of the initial reconstruction. The green points are the features matched in both photos now reconstructed in 3D space. The recovered cameras are shown in brown, with their orientation and field of view shown in white. The white lines show the bounding box surrounding the object.
The previous step delivers a sparse surface model based on distinct feature points. However this is not sufficient to reconstruct geometrically correct and visually pleasing surface models. This task is accomplished by our dense matching hierarchical diffusion algorithm that estimates correspondences from the photos directly by exploiting epipolar geometry and additional constraints. Stereo matching is a problem that has been studied over several decades in computer vision and many researchers have worked at solving it. The stereo matching problem can be solved much more efficiently if images are rectified. This step consists of transforming the images so that the epipolar lines (figure 8) are aligned horizontally. This happens when the images are taken with a calibrated stereo camera rig. Users should instead take photos with a normal hand-held camera. Therefore, after self-calibrating the cameras in step1, you could transform the image planes to that they are aligned in space and rectify them for radial distortion as well.
Instead move efficiently by propagating and matching along the epipolar lines (Fig. 6) so the images don't have to be rectified. The system uses a novel hierarchical diffusion algorithm. In order to recover the 3D structure, the approach is to recover a dense inverse depth map (i.e. recover the depth values for each pixel) from the first camera view. The initial seeds are the recovered 3D points (Fig. 7), which are added directly to the depth map. Each pixel is propagating in the 3x3 neighborhoods, looking for corresponding matches in the other image. The error measure used to verify matching is a robust normalized-cross correlation in a rectangular block of 5x3 pixels oriented along the epipolar lines. All the directions around each pixel are tested, and the one with higher normalized cross correlation score is taken, as long as it is higher than a threshold.
In fact the approach uses an iterative threshold process, starting from a high threshold value of 0.9 (normalized cross correlation scores are in the range (-1'1)) and diminishing the error threshold at every iteration. The effect is that when decreasing the error threshold in order to propagate more pixels, these pixels have been driven by a previous higher cross correlation score and are therefore diffused correctly. Stop the process when no more pixels can be diffused at the lowest threshold iteration.
Figure 8: Example of inverse depth map extracted from a stereo sequence (brighter values are closer to the camera).
The initial depth map will present many artifacts. To solve these problems you need to somehow obtain reliable depth for depth boundaries and thin structures, obtaining correct depth for textureless regions, and hypothesizing correct depth for unmatched regions. Therefore perform image color segmentation on the original photos to overcome many of the problems listed above.
Figure 9: Example of initial color segmented image.
The segmentation will break the image into many small segments or regions (Figure 9). Usually the segmented image contains too many regions, so you should merge small regions into bigger ones. Consider each region to be a continuous surface. Therefore based on the depth values already calculated (Figure 8) you fit each color segment with either a plane or a paraboloid surface, depending on which one minimizes the error of squared sum distances. This model guarantees smoothness in textureless and unmatched regions, as well as precise depth boundaries. To give an overall smoothness to the structure, fit the entire model with a bicubic spline and smooth all valid depth points through it. Finally a bilater filtering pass on the depth map is given to get rid of small noise introduced during the segmentation process while preserving the depth boundaries. After the depth map has been denoised, convert the depth values to 3D values and rectify them as they could have been slightly moved out of their original position by the whole smoothing process.
Now that because you have obtained the underlying high poly 3D mesh of the surface, you can easily generate a normal map by projecting the mesh over a simplified flat surface. Obtain the normal map vector for each pixel and repack the normalized vector into an RGB vector in the normal map image (Fig.1). As a final step align the normal map to the Z-Axis (could be any other axis). The reason for this is that the photos, being taken with a hand held camera, are not aligned to any plane and the model itself is not flat, but you must make sure that when two identical normal maps are aligned together in a virtual environment, the user won't notice any discontinuity. To align the normal map calculate the average error against the axis aligned vector by convolving the normal map image using a Gaussian filter and then rotating each vector by the amount needed to align it to the one axis.
Note that you can also export both the recovered high poly model (1 triangle per pixel) and a simplified low-poly version. The simplification process is based on edge detection; if the low poly version is going to be used in real time, the 2 models can be used to generate a classic normal map using the Polybump system. After the model has been generated the first time, you can quickly re-open it and change normal map parameters such as details and scale in the reconstruction dialog option (Figure 3). The user manual document covers the usage of the tool.
The feature matching sub-system could be improved using some kind of automatic histogram quantization to take into account non-constant changes in exposure and lighting between the two photos. Changes in lighting and camera rotations around the picture's center are often causing problems in detecting correspondences between the two pictures. Also shadows could be causing problems during the reconstruction. Would be nice to perform automatic tiling of the reference photo before generating the normal map, in order to generate a tileable diffuse and normal map texture. Recent research on image completion techniques could come in handy.
Same other examples are shown in Figure 1 ' 2 and in the user manual document.
Two high resolution photos (2240 x 1488) of an ancient wall in Rome.
Reconstructed 3D model (about 3 millions triangles)
Aligned normal map (generated at 2240 x 1488) for usage on a flat surface (on the left). Inversed depth map (brighter means closer to the camera) generated from the photos (on the right).
Normal map generated from two photos of car trails on sand. Note how, although the shoes detail in the picture was probably not meant to be part of the model, they have been reconstructed by the tool.
Close-ups of the model visible in Figure 2 ' left image shows a detail of the reconstructed model in wireframe mode.
2 Photos of a canyon-like rock formation.
Inverse depth map (brighter means closer to the camera) and normal map.
In-game diffuse texture and normal map, after being tiled, cropped and cleaned up by the texture artist Pino Gengo. Diffuse texture (coming from the reference photo)
Normal map corresponding to the texture above . This normal map was generated from a 3+ million polygons' model.
In-game screenshot of a generated normal map, applied to a sea rock.