### Archive format

All downloads are provided as compressed archives in the format of the 7-Zip program due to its good compression results. For Windows, the program can be downloaded from its webpage. For Ubuntu Linux, it can be installed and used to extract an archive as follows (it may also work from your file manager's right-click menu after installation):

# Installation
sudo apt-get install p7zip-full

# Extract archive.7z
7z x archive.7z

### Format of multi-view data

#### Image formats

Images of the Nikon D3X camera are provided in three versions:

• In the camera's raw image format (with lossless compression, file extension NEF)
• As JPEG files (with lossy compression) which were directly saved by the camera in addition to the raw image files
• In a pre-undistorted version as JPEG files

A few images which we had to mask for data protection reasons are only available as JPEG files.

Images of the multi-camera rig are provided as PNG files (with lossless compression) as original images and in a pre-undistorted version.

Note that custom undistortion of the original images can be done, for example in order to change how blank pixels are handled. See the custom undistortion section.

#### Calibration

The calibration is provided in the format of the COLMAP Structure-from-Motion program. Camera intrinsics are given in a text file cameras.txt, while image extrinsics are given in a text file images.txt.

In addition to this, keypoint observations and triangulated keypoint matches are provided for better direct usability in COLMAP and other reconstruction programs. COLMAP was used to compute these based on the images only. They are provided because they may be helpful, but since they are not based on ground truth data, they may be safely ignored. This data is given in images.txt and points3D.txt.

Notice that COLMAP comes with Matlab and Python scripts to load this format. The dataset files can be imported into COLMAP to view the image poses and triangulated points.

We additionally provide a standalone C++ implementation for loading the intrinsic and extrinsic calibrations from cameras.txt and images.txt, which only depends on the Eigen library: Format loader on GitHub. Using CMake, it should compile for Linux, Windows, and MacOS.

The relevant parts of the data format are described below.

#### cameras.txt

This text file defines the intrinsic calibration of each camera used in the scene. Due to potentially differing camera settings, in some cases multiple different intrinsics blocks are given for images of the same physical camera. Lines starting with a hash (#) are comments. Each other (non-empty) line specifies an intrinsics block as a list of parameters, separated by spaces. The format for these lines is:

CAMERA_ID MODEL WIDTH HEIGHT PARAMS[]

An example line is:

0 THIN_PRISM_FISHEYE 6048 4032 3408.35 3408.8 3033.92 2019.32 0.21167 0.20864 0.00053 -0.00015 -0.16568 0.4075 0.00048 0.00028

• The first parameter, 0 in the example, specifies the ID of this intrinsics block, which is referenced by images which use this block.
• The second parameter, THIN_PRISM_FISHEYE, specifies the camera model. For multi-view datasets from this benchmark, it is always THIN_PRISM_FISHEYE if the original images are used. For the undistorted images, the camera model is PINHOLE. These models are defined in the camera models section.
• The next two parameters, 6048 4032, are the width and height of the camera images.
• The remaining parameters are camera model parameters. See the description of the camera models.

#### images.txt

This text file provides the extrinsic calibration of each image in the scene, and the keypoint observations of each image. As for cameras.txt, lines beginning with a hash (#) are comments. Other non-empty lines specify an image's data. The format for lines with image extrinsics data is:

IMAGE_ID QW QX QY QZ TX TY TZ CAMERA_ID NAME

An example line is:

49 0.995 0.00592 -0.0882 0.0311 3.21 3.60 9.19 1 dslr_images/DSC_0323.JPG

• The first number, 49 in the example, is the image ID. For the purposes of the benchmark, this is irrelevant.
• The following 7 numbers specify the image pose as the transformation that brings 3D points from the global coordinate system to the image's local coordinate system. This local system is defined such that the x-axis points to the right, the y-axis points down, and the z-axis points forward (in the direction of the optical axis) relative to the image. The origin is at the projection center. The global system is a right-handed system with arbitrary origin and orientation.
The first four numbers specify the rotation as a quaternion, the following numbers specify the translation. The quaternion is given in the Hamilton convention (as used by the Eigen library, for example).
• The next number, 1, is the ID of the intrinsics block which is used by this image.
• After this, the path to the image file is given.

After the extrinsics line for an image, a second line provides the keypoint observations of this image. The format for these lines is a list of tuples:

POINTS2D[] as (X Y POINT3D_ID)

An example line is:

64.3 1948.0 60 237.1 897.6 -1

X and Y are given in pixel coordinates. The example line defines two observations. For the first one, the ID of the observed 3D point is 60, while for the second one it is -1. The special value -1 is used to indicate that this keypoint observation does not correspond to a triangulated 3D point.

The keypoint observations are provided since they may be useful, for example for co-visibility estimation or depth range estimation, but they are not ground truth data and may thus be safely ignored.

#### points3D.txt

This text file provides the IDs and locations of the triangulated 3D points. As for the other files, lines beginning with a hash (#) are comments. Other non-empty lines specify point data. The format of each such line is:

POINT3D_ID X Y Z R G B ERROR TRACK[] as (IMAGE_ID POINT2D_IDX)

An example line is:

569 -0.567627 -3.79941 4.89522 241 255 255 3.63797 2 4568 13 379

• The first item is the ID, which is referenced by the keypoint observations in images.txt.
• The next three numbers define the point position in global coordinates.
• The following three numbers define the point color, with values in [0, 255].
• Then the reprojection error is given.
• As the last component, the line contains the track, which is a list of all images in which this point has been observed, together with the index of the observation.

As for the keypoint observations, the triangulated points are not ground truth data and may be safely ignored.

#### Training data

For the training datasets, the laser scan point clouds and depth maps rendered from these point clouds are publicly available. The laser scans were recorded with a Faro Focus X 330 scanner and include point colors (in the "raw" and "clean" versions of the point clouds). The laser scanner measurement origin is at the coordinate system origin of each scan point cloud. The point clouds are provided as binary PLY files, which can for example be loaded using PCL.

We offer each laser scan in three different versions:

• raw: The raw scan (including outliers) as recorded by the scanner.
• clean: A version with outliers removed (using an automatic tool and manual work).
• eval: A version with outliers removed, and all points removed which are assumed to be observed by at most one image only. This is the version used for evaluation. If a scene is available with both high-res multi-view and low-res many-view images, then there are two different evaluation point clouds, corresponding to the two scenarios. These versions of the point cloud do not include point colors.

In addition, for scenes with more than one laser scan, a file scan_alignment.mlp gives the relative poses of each scan. This is a MeshLab project file (and can thus be opened with MeshLab to view the laser scans), which follows a simple XML-based format. The transformation bringing points from the scan (mesh) coordinate system into the global coordinate system is given as a 4x4 transformation matrix within the MLMatrix44 tag of each MLMesh tag corresponding to one laser scan. As an example, the pose information for a scan file scan1.ply might look as follows:

<MLMesh label="scan1.ply" filename="scan1.ply">
<MLMatrix44>
-0.196132 0.980193 -0.0274673 -4.17914
-0.378691 -0.101553 -0.919935 0.96876
-0.904503 -0.170027 0.391108 2.36282
0 0 0 1
</MLMatrix44>
</MLMesh>

A point $$\mathbf{p}_{\text{local}}$$ taken from a point cloud can thus be transformed to global coordinates with a homogeneous multiplication with the point cloud's pose matrix $$\mathbf{M}$$: $$\begin{pmatrix} \mathbf{p}_{\text{global}} \\ 1 \end{pmatrix} = \mathbf{M} \cdot \begin{pmatrix} \mathbf{p}_{\text{local}} \\ 1 \end{pmatrix}$$

If you prefer to work with a single point cloud which already has the correct transformation (and do not require the positions of the laser scanner measurement origins), you can use MeshLab to merge the point clouds as follows (tested with MeshLab version 1.3.2):

• Open the project file in MeshLab.
• Choose View -> Show Layer Dialog.
• Right-click a layer item in the dialog and choose Flatten Visible Layers.
• Activate Keep unreferenced vertices, then click Apply.
• Export the resulting merged point cloud.

The rendered depth maps are provided in the archives ending on _depth.7z. They contain depth images that match the original (distorted) versions of the images, not the pre-undistorted ones. The occlusion mesh and image masks are considered in the renderings. The depth map files are provided as 4-byte float binary dumps with one float for each pixel in the image, in row-major storage. Pixels without depth value are set to infinity. Notice that the depth values are generally not dense in the DSLR images (when using them at their original resolution).

#### Occlusion data

In addition to the laser scan point clouds, separate archives are provided with files for occlusion handling. Correct occlusion handling for example enables rendering depth maps while discarding laser scan points which are occluded (rendered depth maps are provided separately in the files ending on _depth.7z). The occlusion files consist of a (potentially edited) Poisson surface reconstruction of the point cloud (as .ply mesh), a splat cloud (as .ply mesh), and image masks (as .png images). The files may be used for occlusion handling as follows:

• Render the surface reconstruction and splat cloud as a depth map into an image of the dataset. Distort the rendering to match the distorted version of the image.
• Discard pixels close to depth discontinuities as those can be uncertain.
• Read the image mask for the image for which occlusion handling is performed, and the camera mask for this image's camera (if those masks exist). If a pixel is set to value 2 in at least one of those (grayscale) masks, discard it.

Furthermore, the masks may contain pixels with the value 0 which means they are not masked, and pixels with the value 1 which means that they were not used for image to laser scan registration. The latter is for example applied to reflective surfaces, specular reflections, shadow boundaries that move over time, and lens artifacts.

#### Custom undistortion

In order to undistort the original images, one must decide how to treat pixels for which there is no corresponding pixel in the original image. The provided undistorted images were created by running COLMAP's image undistorter with default parameters. The following command can be used to create custom undistorted images:

colmap_build_folder/src/exe/image_undistorter --input_path /path/to/cameras-images-points3D.txt --image_path /path/to/cameras-images-points3D.txt/images --output_path /path/to/output

The following command line parameters control the undistortion:

--blank_pixels arg (=0)

• blank_pixels=0 will change the image size such that no pixel is blank in the undistorted image (leading to missing pixels of the original image in case of distortion).
• blank_pixels=1 will change the image size such that all pixels in the original image will be in the undistorted image (leading to blank pixels in case of distortion).

--min_scale arg (=0.2)

Minimum scaling factor of the image to achieve to above constraints.

--max_scale arg (=2)

Maximum scaling factor of the image to achieve the above constraints.

--max_image_size arg (=-1)

Maximum image size to achieve the above constraints.

#### IMU data

For the multi-camera videos added on 2017-10-04 we provide inertial measurement unit (IMU) measurements since they can be useful if the datasets are used for odometry / SLAM experiments. The data is not useful for the tasks evaluated in the ETH3D benchmark, however, and can thus be safely ignored. This section describes the data format.

Each camera on the multi-camera rig has an associated IMU. We recorded the IMU readings of two of them. Those are provided in rosbag files in the topics /uvc_camera/cam_0/imu and /uvc_camera/cam_1/imu. The "cam_0" in the rosbag corresponds to images_rig_cam7 in the dataset, and "cam_1" in the rosbag corresponds to images_rig_cam6 in the dataset. The message type is sensor_msgs/Imu. The rosbags cover a longer time span than the datasets. The image time stamps are given in the image filenames (in nanoseconds).

In addition, we provide the noise statistics of the IMUs and a camera-IMU calibration obtained from a calibration dataset with Kalibr. This is included with each IMU data archive.

### Format of two-view data

The two-view datasets provide stereo-rectified image pairs, i.e., for a given pixel in one image the corresponding epipolar line in the other image is the image row having the same y-coordinate as the pixel. These datasets, just like the multi-view datasets, come with the cameras.txt and images.txt files specifying the intrinsic and extrinsic camera parameters of the images. See above for a description of their format. In the two-view case all images are pre-undistorted, so their camera model is PINHOLE. This model is defined in the camera models section. We do not provide keypoint matches and triangulated keypoints for this type of data.

Furthermore, the two-view datasets also come with a file calib.txt which is formatted according to the Middlebury data format - version 3. Note that those files do not provide any information about the disparity range: the corresponding field is set to the image width.

#### Training data

The ground truth follows the same format as the Middlebury stereo benchmark, version 3. The ground truth disparity for the left image is provided as a file disp0GT.pfm in the PFM format using little endian data. Therefore, the ASCII header may look as follows, for example:

Pf
752 480
-1

The first line is always "Pf", indicating a grayscale PFM image. The second line specifies the width and height of the image. The third line is always "-1", indicating the use of little endian. After this header (where each line is followed by a newline character), the ground truth disparity image follows in row-major binary form as 4-byte floats. The rows are ordered from bottom to top. Positive infinity is used for invalid values.

The occlusion mask for the left image is given as a file "mask0nocc.png". Pixels without ground truth have the color (0, 0, 0). Pixels which are only observed by the left image have the color (128, 128, 128). Pixels with are observed by both images have the color (255, 255, 255). For the "non-occluded" evaluation, the evaluation is limited to the pixels observed by both images.

Additional training data is available as part of the multi-view training datasets from the multi-camera rig: The archives named [dataset]_rig_stereo_pairs_gt.7z contain rectified two-view stereo frames with ground truth disparity for each image pair in the video. The ground truth disparity quality may be slightly worse for these videos since no additional image masking was done, nevertheless the high amount of data may be helpful for training two-view stereo algorithms.

The format is the same as for the individual stereo frames. The video datasets additionally contain the files disp1GT.pfm and mask1nocc.png for each frame, the disparity and occlusion mask for the right image of the pair.

### Camera models

Given a 3D point in global coordinates, it can be transformed to the local coordinate system of an image using the extrinsic parameters of the image, as described in the sections above. Given a local 3D point, the camera model defines how the corresponding pixel location in the image is calculated.

#### PINHOLE camera model

This camera model is used for undistorted images. It has the following parameters, which are listed in this order in cameras.txt:
$$f_x ~~ f_y ~~ c_x ~~ c_y$$ Projecting a local 3D point to the image plane is done as follows:

1. If the z-coordinate of the point is smaller than or equal to zero, the point is unobserved.
2. Project the point onto the virtual image plane by dividing its x and y coordinates by its z coordinate, resulting in normalized image coordinates $$(u_n, v_n)$$.
3. Convert to pixel coordinates $$(u, v)$$: $$u = f_x \cdot u_n + c_x$$ $$v = f_y \cdot v_n + c_y$$
4. Check whether the pixel coordinates are within the image bounds. The convention for the pixel coordinate origin is that (0, 0) is the top-left corner of the image (as opposed to the center of the top-leftmost pixel).

#### THIN_PRISM_FISHEYE camera model

This camera model is used for the distorted (original) images. The model corresponds to the ThinPrismFisheyeCameraModel class in COLMAP. It has the following parameters, which are listed in this order in cameras.txt:
$$f_x ~~ f_y ~~ c_x ~~ c_y ~~ k_1 ~~ k_2 ~~ p_1 ~~ p_2 ~~ k_3 ~~ k_4 ~~ s_{x1} ~~ s_{y1}$$ Projecting a local 3D point to the image plane is done as follows:

1. If the z-coordinate of the point is smaller than or equal to zero, the point is unobserved.
2. Project the point onto the virtual image plane by dividing its x and y coordinates by its z coordinate, resulting in the distorted x and y values $$(x_d, y_d)$$.
3. Apply the equidistant radial distortion model to get $$(u_d, v_d)$$: $$r = \sqrt{x_d^2 + y_d^2}$$ $$\theta = \tan^{-1}(r)$$ $$\begin{pmatrix}u_d\\v_d\end{pmatrix} = \frac{\theta}{r} \begin{pmatrix}x_d\\y_d\end{pmatrix}$$
4. Apply further distortion compensation to get un-distorted normalized image coordinates $$(u_n, v_n)$$: $$t_r = 1 + k_1 r^2 + k_2 r^4 + k_3 r^6 + k_4 r^8$$ $$u_n = u_d t_r + 2 p_1 u_d v_d + p_2 (r^2 + 2 u_d^2) + s_{x1} r^2$$ $$v_n = v_d t_r + 2 p_2 u_d v_d + p_1 (r^2 + 2 v_d^2) + s_{y1} r^2$$
5. Convert to pixel coordinates $$(u, v)$$: $$u = f_x \cdot u_n + c_x$$ $$v = f_y \cdot v_n + c_y$$
6. Check whether the pixel coordinates are within the image bounds. The convention for the pixel coordinate origin is that (0, 0) is the top-left corner of the image (as opposed to the center of the top-leftmost pixel).

With this camera model, certain parameter values for the radial distortion may cause parts of the scene which are close to 90 degrees to the viewing direction to be projected into the image again. For example, this could be the case if the polynomial used for radial distortion goes to minus infinity for its variable going towards infinity. Therefore, in addition to the process described above, we employ a step to prevent this issue: For a given camera, we unproject all pixels at the border of its image to determine the maximum distance to the optical center $$r$$ of coordinates on the virtual image plane which should still project into the image. In the projection process, we then discard all points as unobserved which have a larger distance. Notice that in this process, since the unprojection consists of an optimization procedure that may find any result depending on its initialization, it may be necessary to take some care to obtain the innermost results. This can be done by starting the optimization from several starting points and keeping the innermost result only. At the same time, results which are significantly farther out provide an upper boundary for the cutoff distance (since they should not be projected into the image).

We provide a C++ implementation of this camera model, both as a CUDA-compatible variant and as a variant without any dependencies. In addition, we provide a short octave / Matlab script which can be used to compute derivatives of the projection: Camera model implementations on GitHub.

### Evaluation

For details on the evaluation procedure, please see our paper.

#### Multi-view evaluation

Download the multi-view evaluation program source code (on GitHub) or Windows binaries. This program depends on Boost.Filesystem, Eigen, and PCL. Using CMake, it should compile for Linux, Windows, and MacOS.

#### Two-view evaluation

Download the two-view evaluation program source code (on GitHub) or Windows binaries. This program depends on Boost.Filesystem and libpng. Using CMake, it should compile for Linux, Windows, and MacOS.

### Result submission

Important: Please note that we are interested in the methods' runtimes and request these as mandatory information for each submission. Therefore, please record the runtime of your method when you submit, on each dataset individually. File I/O should ideally be excluded from this. See below for how the submission format, including the runtimes, should look like.

Partial submissions are possible, but we only compute category averages (and therefore rank a method in a category) if results for all of the datasets within a category are provided. If a method fails on a dataset and you would still like to rank it in a category containing this dataset, you can provide an empty result file for this dataset.

#### Submission format

Results for a method must be uploaded as a single .zip or .7z file (7z is preferred due to smaller file sizes), which must in its root contain at least one of the following folders (case matters for all directories and filenames): high_res_multi_view, low_res_many_view, low_res_two_view. In each folder, the results for the corresponding scenario must be placed (both training and test data can be evaluated). There must not be any additional files or folders in the archive.

For the multi-view scenarios, results should be provided as binary PLY files having the same name as the dataset, with the extension .ply. For example, the result file for the courtyard dataset should be named courtyard.ply. Please make sure that the files can be loaded by PCL. You can run the evaluation program on one of your results on a training dataset to test. See an example python script for Linux which runs the COLMAP multi-view stereo pipeline and produces the correct output format.

For the two-view scenario, results should be generated as 4-byte floating-point PFM files, i.e., the same format which is used for the ground truth training data. The files should have the same name as the dataset, for example "delivery_area_1s", with the file extension ".pfm". Sparse results can be submitted: missing pixels can be set to NaN, infinity, or a disparity beyond the image width. See an example python script for Linux which runs the ELAS program and produces the correct output format.

In addition, for both scenarios, for each result file a metadata file having the same name, but with extension .txt, must be provided. For example, for the result file courtyard.ply the metadata file would be named courtyard.txt. Currently, this file is only required to contain the method's runtime, as described above. The format is plain text with the following structure:

runtime <runtime_in_seconds>

Replace "<runtime_in_seconds>" with your runtime, and do not include the <> characters. For example:

runtime 1.42

The number of seconds does not need to be an integer.

In sum, the file structure for a multi-view submission archive should look as follows if results for both the high-res multi-view and low-res many-view scenarios are submitted:

+── high_res_multi_view
│   +── courtyard.ply
│   +── courtyard.txt
│   +── delivery_area.ply
│   +── delivery_area.txt
│   +── ...
+── low_res_many_view
│   +── delivery_area.ply
│   +── delivery_area.txt
│   +── electro.ply
│   +── electro.txt
│   +── ...

The file structure for a two-view submission archive should look as follows:

+── low_res_two_view
│   +── delivery_area_1l.pfm
│   +── delivery_area_1l.txt
│   +── delivery_area_1s.pfm
│   +── delivery_area_1s.txt
│   +── ...

We also provide some example result archives which demonstrate this file structure: