close
close

Stereoscopic depth perception through foliage

Stereoscopic depth perception through foliage

Occlusions caused by vegetation can severely hinder aerial operations, such as search and rescue, wildfire detection, wildlife observation, security, or surveillance. For example, it is almost impossible to detect a standing person in the thermal drone recording shown in Fig. 1b (in the blue box). One of the most promising solutions to this problem is synthetic aperture sensing1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17, in which multiple images taken at different positions are computationally combined to simulate an advanced (virtual) sensor of a wider (synthetic) aperture. An example result of such an integral image is shown in Fig. 1b, in which a standing person can be much more easily identified (in the blue box). Here we show the result for thermal imaging, but synthetic aperture sensing is equally applicable to radar18,19,20, radio telescopes21,22, interferometric microscopy23, sonar24,25, ultrasound26,27, LiDAR28,29, and optical imaging30,31,32.

Fig. 1

(a) Optical synthetic aperture sensing principle. Registering and integrating multiple images captured along a synthetic aperture of size a while computationally focusing on focal plane F at distance h will defocus occluders O at distance o from F (with a point-spread of b) while focusing targets on F. (b) Conventional thermal aerial image of woodland with an occluded person on the ground (blue box). (c) The same scene as (b) but with suppressed occlusion by integrating 30 thermal images captured along a synthetic aperture of a=14 m at h=26 m AGL. (d) An ambiguous example of an integral image in which true (lying and standing persons in the green box) and false (heated ground patches in red boxes) detections can be made. They cannot be differentiated by other discriminators, such as shape.

An optical synthetic aperture image is formed by superimposing regular images taken with small-aperture optics so that the depth of interest (e.g., the ground level) is brought into focus. This is illustrated in Fig. 1, in which a drone captures multiple images in a fly path over a woodland. The pixels from each camera image are projected onto a hypothetical (virtual) focal plane at distance h from the synthetic aperture’s plane (i.e., at the altitude of the flight path) — see the cyan lines projecting on the ground plane in Fig. 1a. Even though the object of interest is occluded in some camera images (dashed lines in Fig. 1a), other views will reveal the object under the foliage. Aligning the focal plane with the forest floor and repeating this for all of its locations results in a shallow depth-of-field integral image of the ground surface (cf. Figs. 1c,d). It approximates the signal of a physically impossible optical lens of the size of the synthetic aperture. The optical signal of out-of-focus occluders, such as the tree crowns, is suppressed (blurred) — see pink lines projecting on the ground plane in Fig. 1a, while focused targets on or near the ground are emphasized. Computation of the integral images can be achieved in real-time and is wavelength-independent. Thus, the method can be applied in the visible range, near-infrared range, or far-infrared range (thermal) to address many different use cases. It has previously been explored in search and rescue with autonomous drones8,9, bird census in ornithology5, and through-foliage tracking for surveillance and wildlife observation12,14.

The main limitation of optical synthetic aperture sensing is that its results can be ambiguous if true targets cannot be differentiated from false targets on the basis of clear features such as shape. An example of this is illustrated in Fig. 1d where strong thermal signatures of multiple potential targets near the forest floor are visible. While some of them are the results of sun-heating, only two are the true thermal signatures of people. With two-dimensional information alone, a clear distinction is impossible. However, the height differences between people and the forest floor could serve as an additional cue if it can be preserved in the final image. A computational 3D reconstruction from the sampled multi-view aerial images or the corresponding integral images is currently impossible with state-of-the-art methods in the case of strong occlusion1, as shown and explained in the Appendix. Airborne laser scanning, such as LiDAR, has clear advantages over image-based 3D reconstruction when it comes to partially occluded surfaces, but it also has clear limitations1: First, it is not sensitive to the target’s emitted or reflected wavelengths. Thus, far-infrared (thermal) signals, for instance, cannot be detected. Second, the point clouds cannot be scanned in high resolution and in real-time due to mechanical laser deflection and high processing requirements. This makes laser scanning unsuitable for many applications that require instant results and high resolutions.

In this article, we explore the synergy between optical synthetic aperture sensing and the human’s ability to sense depth in stereoscopic images. We introduce binocular disparity to the optical synthetic aperture images, which then serves as additional cue and discriminator in identification and classification tasks. This enables tasks that cannot be completed with human or computer vision alone. To prove that binocular depth perception is possible for thermal optical synthetic aperture images, which are unnatural for human vision, we test whether human observers can infer depth from such images and complete high-level tasks.

Let us theoretically analyze under what conditions the visual system can fuse and discriminate depth differences between small and occluded targets, such as standing humans (up to 2 m) occluded by tall trees (15–20 m), seen from high altitudes (20–30 m for drones flying above tree level). First, objects that differ much in height (e.g., tree crowns vs. targets on the ground) and are located closely together in the image will result in large disparity gradients (disparity difference divided by the distance between two objects). If the disparity gradient exceeds the limit of human visual perception, diplopia33will result, making stereoscopic function impossible. Second, if objects are seen from relatively far distances, and their height difference is small, the disparity difference between them may fall below the stereo acuity limit34,35,36, which makes depth discrimination impossible. The latter problem can be addressed by enlarging (scaling) disparities by assuming large viewing baselines (e.g., much larger than a typical inter-ocular distance of 6.5 cm).

Fig. 2
figure 2

The increase in perceived target height (PTH) with an increasing stereo baseline for three unoccluded objects of different heights (solid plots): tree crowns, lying person, and standing person. Stereo acuity sets the just-detectable depth interval (JDDI) required for perceiving height differences (dashed lines). Both the conservative (0.3 acrmin) and the realistic (6 arcmin) JDDIs are plotted. Disparities, or rather disparity gradients (numbers next to the markers), limit the maximum length of the baseline above which objects cannot be fused due to diplopia. Consequently, the grayed region represents the range in which depth can be perceived (assuming, for example, a disparity gradient limit of 1.0 and a stereo acuity of 6.0 arcmin). Display disparities are given with respect to the ground level and for 60 arcmin object distances. For this plot we assume the capturing and display parameters provided in the “Methods” section.

Fig. 2 illustrates these two problems for the unoccluded case. Let us consider the some geometric constraints: For a given screen distance v, inter-ocular distance e, and object disparity d, the perceived object distance zis given by37, ch. 9.2.2

$$\begin{aligned} z=\frac{ev}{e-d}. \end{aligned}$$

(1)

It follows that

$$\begin{aligned} d=\frac{e(z-v)}{z}. \end{aligned}$$

(2)

Applying Eqn. 2 to compute the disparity on the focal plane at distance \(v_f\) (equals h in Fig. 1a), camera baseline \(e_f\) on the synthetic aperture plane, and target distance \(z_f=v_f-h_t\) (\(h_t\) is the target height) from the synthetic aperture plane; and then scaling the resulting disparity to the display parameters to determine the perceived object distance on the display \(z_d\) using Eqn. 1 results in

$$\begin{aligned} z_d=\frac{e_dv_d}{e_d-\frac{v_d\tan (FOV_d/2)e_f(z_f-v_f)}{v_f\tan (FOV_f/2)z_f}}, \end{aligned}$$

(3)

where \(e_d\) and \(v_d\) are the inter-ocular distance and the distance of the display image plane, and \(FOV_d\) and \(FOV_f\) are the fields of view of the display and camera, respectively.

Consequently, the perceived target height is

$$\begin{aligned} PTH=v_d-z_d. \end{aligned}$$

(4)

The just-detectable depth interval is given by38,39

$$\begin{aligned} JDDI=\frac{d_\gamma v_d^2}{ce_d+v_d}, \end{aligned}$$

(5)

where \(d_\gamma\) is the stereo acuity (in arcmin) and \(c=3437.75\) (1 radian in arcmin).

Now, considering Fig. 2 and the above geometric constraints, it can be seen that the perceived target height (PTH, y-axis) increases with an increased stereo baseline (x-axis). The solid lines show the increase in perceived target height for three different object types: tree crowns at 21 m (green), a lying person at 0.3 m (blue), and a standing person at 1.8 m above the surface (orange). The numbers above the markers indicate the corresponding display disparities and disparity gradients for a given stereoscopic display (assuming minimal object distances of 60 arcmin or 1 deg). The just-detectable depth interval (JDDI) threshold (dashed lines) varies between individuals. With poorer stereo acuity, larger depth intervals are required for perceiving height differences. Under these conditions and assuming an inter-ocular distance of 6.5 cm (the leftmost point in Fig. 2), the height differences between target objects on the ground are unlikely to be detected — even if excellent stereo acuity is assumed. Larger baselines improve the ability to discriminate depth, but they also increase the disparity gradients. If the disparity gradient is excessively large (e.g., 133– 340, outside the gray box in Fig. 2), the stereo images cannot be fused.

The third problem is that view-dependent occlusion in the stereo pairs causes binocular rivalry. The rivalry appears when radically different images are presented to each eye, and when it is too strong, it prevents stereoscopic fusion41,42. Examples are illustrated in Fig. 3. It has been found that, if partially occluded object fragments are horizontally aligned and match a continuous surface, our visual system tends to extrapolate a coherent surface at an incorrect depth43. Horizontally aligned continuous object surfaces, however, are usually not present under realistic occlusion conditions such as ours. Although depth cannot be reconstructed computationally, we show that surface continuity can be reconstructed computationally, which enables human depth perception.

Fig. 3
figure 3

Stereoscopic thermal aerial recordings (left- and right-eye image pairs) of a sparsely occluded, person standing with arms outstretched to the sides (blue box) in woodland. View-dependent partial (a) or full (b) occlusion in stereo pairs cause binocular rivalry and prevent stereo fusion and consequently depth perception.

In this approach, we suppress occlusion by means of optical synthetic aperture sensing, as explained above and illustrated in Fig. 1. This also implies that we can compute stereoscopic integral images with suppressed occlusions for a given synthetic aperture of size a and for two different viewing positions within a and separated by a given baseline d. While for the monoscopic case, the center of the synthetic aperture is used as the reference perspective of the resulting integral image, the two baseline-shifted viewing positions are applied for the stereoscopic case. This results in two integral images that reveal a parallax for all objects not located on the focal plane. With a large d (larger than inter-ocular distance), we upscale disparities so they do not fall below the limits of stereo acuity. The larger a, the more occlusion is suppressed, and binocular rivalry and extreme disparity gradients caused by tree crowns can consequently be reduced. However, a wide synthetic aperture also leads to a shallow depth of field and thus to defocus blur and lower contrast. The reduction in contrast and the loss of high spatial frequencies result in degradation of stereo acuity44. This is illustrated in Fig. 4.

Fig. 4
figure 4

Integral stereo pairs of the scenario shown in Fig. 3, where the synthetic aperture a applied is smaller (a) or wider (b). The larger a, the more shallow the depth of field. This leads to a reduction in sharpness and contrast.

With the results presented below, we make three main findings: First, occlusion removal in stereoscopic images is of fundamental importance for object identification tasks. Stereoscopic perception alone leads to no significant improvement in the presence of occlusions. In fact, in all test cases with occlusions, observers’ performance for stereoscopic images was comparable to that for monoscopic images, and it was not improved by the introduction of motion parallax. Second, while discriminating depth computationally (e.g., using 3D reconstruction from sampled multi-view images) is currently impossible with state-of-the-art methods in the case of strong occlusion, it becomes feasible visually by fusing binocular images with scaled disparities, which can be easily generated with optical synthetic aperture sensing.

Third, the sampling and visualization parameters (best baseline and synthetic aperture size), although restricted by the acuity limits and disparity gradients (refer to Fig. 2), were found to be fairly consistent across all test cases evaluated.

Our findings are discussed in Summary and Conclusion. They demonstrate that it is possible to discriminate the depths of objects seen through foliage on the basis of optical synthetic aperture imagery captured with first-person-view (FPV) controlled drones or a manned aircraft. It has the potential to support challenging search and detection tasks in which occlusion caused by vegetation is currently the limiting factor. This includes use cases such as search and rescue, wildfire detection, wildlife observation, security, and surveillance.

Related Post