Visual Slam-Based Mapping and Localization for Aerial Images

: Fast and accurate observation of an area in disaster scenarios such as earthquake, ﬂood and avalanche is crucial for ﬁrst aid teams. Digital surface models, orthomosaics and object detection algorithms can play an important role for rapid decision making and response in such scenarios. In recent years, Unmanned Aerial Vehicles (UAVs) have become increasingly popular because of their ability to perform diﬀerent tasks at lower costs. A real-time orthomosaic generated by using UAVs can be helpful for various tasks where both speed and eﬃciency are required. An orthomosaic provides an overview of the area to be observed, and helps the operator to ﬁnd the regions of interest. Then, object detection algorithms help to identify the desired objects in those regions. In this study, a monocular SLAM based system, which combines the camera and GPS data of the UAV, has been developed for mapping the observed environment in real-time. A deep learning based state-of-the-art object detection method is adapted to the system in order to detect objects in real time and acquire their global positions. The performance of the developed method is evaluated in both single and multiple UAVs scenarios.


Introduction
Classical 2D image stitching methods that perform real-time mapping from monocular camera in aerial images are built based on feature extraction and matching in consecutive frames [1]. These methods are mainly based on the calculation of homography, which defines the motion between two image planes. Since the calculations are limited to a planar surface in these methods, the 3D structure of the observed environment cannot be obtained. To solve this problem, authors in [2] used the Kanade-Lucas-Tomasi feature tracker, and fused the UAV's IMU (Inertial Measurement Unit) and GPS (Global Positioning System) sensor data. Dense point cloud and digital surface model were generated with the 3D camera position obtained by sensor fusion techniques.
Structure from Motion methods can also be used in orthomosaic generation. There are several algorithms that use Structure from Motion methods, such as OpenMVG [3], PhotoScan [4]. These methods generally follow feature extraction and matching, image alignment and bundle adjustment algorithm, sparse point cloud generation, dense point cloud and mesh generation, orthomosaic stages. In order to generate the final orthomosaic image with Structure from Motion methods, all images to be used in the mapping process must be prepared in advance and the mapping process takes a long time. Therefore, Structure from Motion based methods are not suitable for real-time and incremental usage.
Aside from Structure from Motion algorithms, SLAM (Simultaneous Localization and Mapping) methods are used for real-time 3D mapping and localization. Monocular camera based SLAM applications have recently become one of the most studied topics in robotics and computer vision. SLAM is considered a key technique for navigation and mapping in unknown environments. Monocular SLAM algorithms are basically categorized as feature-based and direct methods. Feature-based SLAM algorithms detect features in frames and track them in consecutive frames. Then, they use the resulting features to estimate the camera pose and generate the point cloud map [5,6,7]. On the other hand, direct methods directly use pixel intensities of the images instead of extracting features from the images. Therefore, direct methods tend to create a much more detailed map of the observed environment since they use more information coming from images [8,9]. However, in case of illumination changes and sudden camera movements, feature-based methods are more robust and can estimate camera pose with higher accuracy compared to direct methods. There are also semi-direct approaches such as SVO [10] that compute strong gradient pixels and achieve high speed. In [11], the authors proposed a dense monocular reconstruction method that integrates SVO as camera pose estimation module. In [12], authors use a feature-based SLAM method and the GPS data of the UAV to generate 2D orthomosaic maps from aerial images. In this study, to generate 2D and 3D orthomosaics, ORB-SLAM [6] which is a very fast and robust feature-based monocular SLAM method is used for camera pose estimation and spare point cloud generation. In [13], author propose a monocular SLAM-based method to generate 2D and 3D orthomosaic images. Similar to our study, the method uses ORB-SLAM as the monocular SLAM method. In addition, in their paper, values of the cells on the overlapping regions were determined by using a probabilistic approach at the orthomosaic stage. As opposed to this method, the values of the cells are determined according to the minimum elevation angle similar to [2] in our proposed method. Moreover, in our method, a deep learning based object detection method which is trained on a novel dataset was integrated to the mapping pipeline to detect objects on the rectified images. By marking these detections on 2D and 3D maps, the real world positions of the detected objects can be calculated, and these positions can be used to create a better map. Another method which is similar to ours is GPS-SLAM [14]. GPS-SLAM is expected to perform well on scarce datasets where FPS (frames per second) is 1 or below. The method augments ORB-SLAM's pose estimation by fusing GPS and inertial data. In addition to this, the authors increase the number of features that are extracted by ORB-SLAM, which highly affects and reduces the computation speed. The method works robust and more accurately compared to the ORB-SLAM on scarce datasets where FPS is 1 or less. A drawback is that as FPS increases (above 1 FPS) GPS-SLAM fails to track and estimate the camera pose which prevents the usage of the method in the pose estimation stage of an end-to-end mapping and localization pipeline. Unlike GPS-SLAM, we can achieve more robust camera pose estimation compared to ORB-SLAM at higher FPS as demonstrated in our experiments.
Object detection is the process of estimating the position and scale of an object instance in an image. Recently, deep learning based methods have achieved the highest accuracies in object detection. With the increasing processing capabilities of GPUs, deep learning based detection methods can now operate in real time. YOLOv3 [15], RetinaNet [16], SSD [17], Faster R-CNN [18] are the most important object detectors that can work in real time with high accuracies. Recently, anchor-free methods are proposed for object detection task [19,20]. In this study, YOLOv3, which is a very fast and highly accurate object detection network, is used in the proposed mapping pipeline.
Our Contributions: In this study, we propose a novel realtime mapping and localization pipeline for aerial images built on top of a highly accurate monocular feature-based SLAM method. We fuse GPS and SLAM data for better localization of the UAV and to be able to map the observed environment in multiple UAVs scenario. Although there are methods using SLAM as backbone in mapping methods, we use a fast interpolation method to densify the sparse depth map of feature-based SLAM in order to generate dense 3D maps. While this approach generates semantically strong maps, its main advantage is fast operating time which is crucial for multiple UAVs scenario. In addition, we integrate a state-of-the-art object detection method to the pipeline that allows to acquire real-world positions of the desired objects in the observed environment. Detection of certain objects will be extremely useful for certain applications such as locating lost persons in wilderness or locating some military targets (e.g., vehicles or radar systems) in military operations. Moreover, successful object detection systems return positions of these objects more accurately, therefore these positions can be used as fiducial marker points to align different frames and improve the accuracy of the created 2D/3D maps. To this end, the positions of stationary objects such as buildings, bridges, or military targets such as airports, radomes, and heliports will be extremely useful. An overview of the proposed mapping pipeline is given in Figure 1.

Method
In this study, a real-time 2D/3D mapping and object detection pipeline has been developed for both single and multiple UAVs scenarios. The main steps applied for this purpose are analyzed under the following sections: global pose estimation, dense point cloud generation, digital surface map, orthorectification, object detection and orthomosaic.

Camera Pose Estimation, Sparse Point Cloud Generation and Global Pose Estimation
Digital surface model and orthorectification steps are required to create an orthomosaic with images acquired from a ground-facing monocular camera attached to a UAV. For this purpose, the camera pose estimation and sparse point cloud generation steps are computed by using a featurebased SLAM method in the proposed method. The chosen monocular SLAM method in our work, ORB-SLAM, is a very fast and robust algorithm that can perform in large scale environments. It can compute camera pose accurately due to its robust ORB-based feature tracking, pose graph optimization and loop detection features. However, in monocular SLAM systems, the real world scale cannot be estimated, and the camera pose is given in a local coordinate system. The transformation of the camera pose estimated by the ORB-SLAM to the global coordinate system can be achieved by fusing the GPS data of the UAV and the computed camera pose. In order to calculate the real world scale, the first 20 frames were used for scale estimation at the initialization stage. To this end, the visual and geographic distances between frames are calculated first. Then, the average scale is computed by averaging the scales of the distances between frames which is calculated as, where m gps is the distance between the GPS positions of the two frames, and m slam is the distance between the visual camera pose of the two frames. Once the scale is computed and initialization is done, the camera pose transformation is calculated for the next frames. In order to find the camera pose transformation, a transformation function must be calculated between the two 3D point models of the global positions in UTM (Universal Transverse Mercator) coordinates and the visual poses obtained by ORB-SLAM. Least-squares estimation of transformation parameters between two point patterns [21] is used to calculate the transformation parameters between two 3D point models. After calculating the scale and similarity transformation matrix, the camera pose matrix can be denoted in the global coordinate system as, where r r i j represents the rotation matrix and t represents the position in UTM coordinate system.

Dense Point Cloud Generation
Sparse cloud interpolation is used to densify the sparse point cloud of the images after computing global camera pose. First, the depth values of the 3D points calculated by ORB-SLAM are written to the pixel positions in the image plane according to, where r 3 r 3 j is the last row of the rotation matrix, x, y, z, 1 is the point in the world frame, and z is the height. After re-projecting depth values to the image plane, inpainting is applied to fill the spaces between sampling points. Navier-Stokes based Inpainting [22] method is used for inpainting process which is inspired from fluid dynamics. This method travels along the edges of the known regions to the unknown regions as the edges are intended to be continuous. It continues isophotes (lines joining points with the same intensity) and matches gradient vectors at the border of inpainting regions continuously. Empty pixels are filled in a way to reduce the minimum variance in the region. Although the method cannot produce the details of the structures in the observed regions, the output depth maps usually have low noise and homogeneous appearance, and the method works very fast. Sparse point cloud computed by ORB-SLAM must be highly accurate for the method to produce accurate dense depth maps.

Digital Surface Map
A grid-based method as used in [2] is adapted to create a digital surface map using the generated dense point cloud in the previous stage. GridMap library [23] is used to efficiently handle RGB, elevation, point cloud data with multi-layer grid maps. The library provides rapid manipulation of the image data represented in global coordinates and speeds up the global orthomosaic operation with multi-layer approach.
In order to represent the observed surface with a 2.5D grid map containing elevation data, the x and y coordinates of the dense point cloud are structured as a 2-dimensional binary kd-tree [24]. Within a pre-defined interpolation radius (20 pixels radius), the nearest neighbors of the points are calculated. Then, inverse distance weighting is applied to interpolate the map points. Inverse distance weighting intuitively determines the height of the cell using a linear weighted combination of the nearest neighbors. Thus, by giving higher weight to the points closer to the center of the cell, interpolation provides smooth transition and noiseless elevations in the elevation map.

Orthorectification
After digital surface map generation stage, we have all the data that is required to create the final orthomosaic image. However, the visual distortion caused by the viewing angle and surface structure of the images should be corrected. Rectification of the images is done by using the camera pose and the digital surface model. Incremental grid-based orthomosaic method [2] is used in orthorectification step.
With the corresponding camera pose and the intrinsic matrix of the camera, it is checked whether each cell is in the visible camera cone. The grid map containing the elevation layer of the image is re-sized to the desired ground sampling distance. A 3D point X x i , y i , z i ) is created for each cell of the grid. These 3D points are reprojected to the camera image as, x KR|tX, where K is the intrinsic matrix, R|t is the global camera pose matrix. Due to the noisy and erroneous pose estimations on the elevation map, some of the projected points can fall outside of the image boundaries. These points are specified as invalid points and were not back-projected to the image plane.

Object Detection
The aim of object detection is to classify and localize objects with their labels and bounding boxes in an image.
The architectures of generic object detectors can mainly be categorized into two types as regression-based and region proposal based. Region proposal based frameworks first propose regions and then apply regression and classification on these proposals. The other regression-based frameworks skip the region proposal stage and directly predict class probabilities and their bounding boxes using an end-to-end learnable one stream network. In this study, once the rectification stage is completed, object detection can be performed on the RGB data stored in the multi-layer grid map. We use YOLOv3 for object detection in the proposed method. YOLOv3 is an extremely fast regression-based object detection algorithm. YOLOv3 divides images into S × S grids and each grid predicts a fixed number of bounding boxes. YOLOv3 predicts offsets tx, ty, tw, th to the anchor boxes and box confidence score that represents the class probabilities. Instead of directly computing the bounding box coordinates, computing the offsets to the ground-truth box results in a more robust training procedure. YOLOv3 architecture uses a more robust and deeper network called Darknet-53 instead of Darknet-19 [25]. There are residual connections between convolutional layers similar to ResNet [26] structure. YOLOv3 makes predictions at three different scales to improve the performance of detecting small objects in the images which is crucial for aerial images captured in high altitudes. We also used YOLOv3-Tiny architecture which is a faster version of YOLOv3. The local and global coordinates of the detected object can be obtained on rectified images. Then, these positions can be used as fiducial marker points to align consecutive frames and improve the quality of the resulting maps.

Orthomosaic
All of the data computed in previous stages can be combined to form high resolution 2D and 3D maps. First image taken from the orthorectification stage initializes the global map. After the initialization, the 2D and 3D maps are continuously updated with dense point clouds and orthorectified images using the multi-layer grid map. There are two conditions in the map updating stage. Observed area in the image can be a completely unknown region or the image can overlap with the global map. In the first case, new image data is directly added to the global map and the map is

Experiments
The performance of the developed mapping and object detection pipeline is evaluated in both single and multiple UAVs scenarios on two different datasets. In addition, we also collected a new dataset to train the utilized car detection algorithm. Here, we will first explain our car detector below and then evaluate the performance of the proposed mapping and localization system on two datasets.

Car Detection:
Here, we focus on cars in the mapping area, and train YOLOv3 and YOLOv3-Tiny for car detection in aerial images. To this end, we created our own data set consisting of colored digital images obtained in different weather conditions and scales by using DJI Matrice 600 Pro and DJI Inspire 1 unmanned aerial vehicles. The data set consists of approximately 10.000 colored digital images, and it contains approximately 30.000 aerial view car images. We annotated the cars by using the bounding boxes and created the data belonging to the positive class. Figure 2 shows examples from the data set that contains positive images. We used 80% of the data (approximately 24 K positive samples) for training the YOLOv3 and YOLOv3-Tiny detectors and the remaining 20% of the data is used for testing. NVIDIA Quadro P5000 GPU was used to train the YOLOv3 and YOLOv3-Tiny car models.
To evaluate the performance of the trained detectors, we used PASCAL VOC criterion which is the most commonly used metric for object detection. According to this metric, the position of the object is classified as true or false in accordance with the overlapping ratio of the detected coordinates and the ground truth positions. This overlapping was calculated by using area|Q∩R| area|Q∪R| formula. In this formula, Q shows the ground-truth location of object and R shows the location returned by the algorithm. If this ratio is over 50%, the detected position is considered as true positive -TP, otherwise, it is considered as false positive -FP. Then the mean average precision-mAP was determined by using www.jenrs.com Journal of Engineering Research and Sciences, 1(1): 01-09, 2022 precision-recall curves. Table 1 gives the mAP scores and speeds of the car detectors on the created dataset. The trained YOLOv3 and YOLOv3-Tiny detectors achieved accuracies of 83.4% and 80.2%, respectively in terms of mAP score. In terms of the speed, YOLOv3 and YOLOv3-Tiny methods operate at 22 FPS and 55 FPS respectively, on a laptop with i7 7700HQ processor and NVIDIA GTX1050TI GPU.  Orthomosaic Generation: For this purpose, aerial images captured with a ground-facing camera attached to a UAV at an altitude of approximately 100 meters were used. Images are captured at 10 FPS with a GoPro Hero3+ camera. The dataset consists of 1500 frames with a size of 1280×720 and the GPS data of each frame is kept in Exif format. Fig. 3 shows the generated orthomosaic and the detected objects in the single UAV scenario. Centers of the detected objects on the map are indicated with red boxes. The orthomosaic given in the figure is generated by using the incremental grid-based orthomosaic approach, selecting the cell values of the image which has the smallest elevation angle in the intersection regions. A good orthomosaic should have high resolution, homogeneous appearance and low geometric distortion. The observed area should be represented as much detailed as possible and aligned correctly in global scale. In the figure, incorrect alignments and seams at the alignment points can be observed due to the incorrect estimation of the global positions computed by fusing the SLAM and GPS data. However, in the high resolution orthomosaic, structures and the roads can be observed well and cars are localized correctly. The observed environment is reconstructed in a global coordinate system using UTM coordinates. Thus, distances can be measured on a realworld scale, and real-world positions of desired objects can be obtained.
The real world scaling accuracy of the method can be observed better in multiple UAVs scenario. In order to simulate the multiple UAVs scenario, the dataset is divided into two sets and these sets are arranged to have a small intersection between them. The method successfully maps the observed regions to the same plane as shown in Fig. 4 by using the data incoming from different sources. The generated 3D map by using sparse cloud interpolation is given in Fig. 5. Sparse cloud interpolation fails to recover information in sharp elevation transitions. This stems from the fact that the utilized interpolation approach is based on the sparse point cloud computation of ORB SLAM, which uses highly discriminative features for 3D mapping. Consequently, the low resolution of the sparse cloud leads to uncertain elevations, which distorts visual appearance. However, the generated 3D map is homogeneous with low noise and contains sufficient information about the structure of the observed area. Since there is no ground-truth data to evaluate the performance of the proposed method, ground-truth data is generated with PhotoScan [4] which is a highly accurate offline photogrammetry software. PhotoScan is an advanced computer vision algorithm that robustly estimates camera positions and creates high quality 2D-3D maps. Although it does not perform well in oblique images, it works better than other methods such as Bundler and Photosynth in moving, still, sequential and unordered images. The algorithm consists of three steps. The first step is alignment of the images, estimating camera parameters and positions. Second step consists of generating dense point clouds and building the geometric scene details on these aligned images. Final step is the texturing of the map with the images. Map and camera positions generated by using Photoscan are considered as ground-truth data.  The relative pose error (RPE) and absolute pose error (APE) metrics were used to evaluate the performance of the camera pose estimation [27]. APE, is also called the absolute trajectory error, makes a direct comparison between predictions and reference camera locations. APE tests the global consistency of the trajectory. RPE compares camera movements, motions and calculates translational and rotational drift per meter. We generated trajectories of the proposed method, ORB-SLAM and GPS on the dataset. RPE and APE values of the generated trajectories are compared to the ground-truth data generated by Photoscan in Table 2 and Table 3. The low standard deviation (std) values in the table (below 1 meters for both RPE and APE) indicate that camera pose estimation has low error. In addition, proposed method obtains lower error values than ORB-SLAM and GPS. Which shows that proposed method increases the robustness of ORB-SLAM's pose estimation.   6 shows the trajectory graphs according to the plane x, y, z for orthomosaic generated in single UAV scenario.
In the x and y planes, the method has not suffered from any shifts or errors. However, when the elevation increases, an instant shift and error occurs in the z plane. The main reason for this situation is ORB-SLAM's erroneous depth estimation in regions with insufficient features to detect and track. Generated trajectory of the UAV is given in Fig. 7. Robustness and success of the method can also be observed from the figure since trajectory generated with the method is almost equal to ground-truth trajectory. Operation time for the dataset is approximately 4 FPS.
The performance the method is also evaluated in single UAV scenario on phantom3-village dataset introduced in Map2DFusion [12]. Video sequence is recorded at an altitude of approximately 160 meters using DJI Phantom3 equipped with a ground-facing GoPro Hero3+ camera. The dataset consists of 200 frames. Fig. 8  tory results on plain regions and areas containing buildings and trees. The method fails to recover information on water regions since generation of the map is highly dependent to the feature-based SLAM approach. This results in some mis-alignments on the generated orthomosaic. Since the altitude is very high, YOLOv3 model failed to detect some of the vehicles.
x (m )  The 3D map generated by using sparse cloud interpolation is given in Fig. 9. Phantom3-village dataset is captured at 1 FPS. Low frame rate severely reduces the ORB-SLAM's depth map generation performance and this highly affects the depth map interpolation. Pipeline cannot generate 3D surfaces properly on featureless regions due to nature of the feature-based SLAM. Sparse cloud interpolation has difficulties in sharp elevation transitions as in the previous experiment. As seen in the generated 2D map, the method also fails to generate map points for water regions because of the featureless surface of the water. Since there is no ground-truth data for the dataset, ground-truth data is generated with Photoscan as before. We generated trajectories of the proposed method, ORB-SLAM and GPS on the dataset. RPE and APE values of the methods are compared to the ground-truth data in Table  4 and Table 5. Similar to first dataset, proposed method achieves lowest error for both APE and RPE. The large standard deviation (std) values indicate erroneous camera pose estimations. In the dataset, images captured in high altitudes. This highly affects ORB-SLAM's pose estimation performance. Fig. 10 shows the trajectory graphs according to the x, y, z plane for orthomosaic generated in single UAV scenario on phantom3-village dataset.   Fig. 6, the method has not suffered from any shifts or errors in the x and y planes. However, instant shifts and errors can be observed in the z plane. These errors are due to the erroneous depth estimations in featureless regions such as water regions and due to the capturing images at very high altitudes. Generated trajectory of the UAV is given in Fig. 11. Operation time for the dataset is approximately 4.5 FPS.

Conclusion
In this study, a real-time 2D and 3D mapping pipeline with object detection ability is developed by combining the camera and GPS data of UAVs. The method uses a robust monocular SLAM method, ORB-SLAM, and point cloud interpolation algorithm, which operates at a very high speed to generate the dense point cloud efficiently. The proposed method creates semantically strong, high-resolution maps and detects objects in real-time using incremental grid-based mosaic and YOLOv3 object detection methods. The proposed method also reconstructs the map in a global coordinate system and obtains the real-world positions of the detected objects. The developed method performs global scaling, object detection, alignment operations efficiently and accurately in both single and multiple UAVs scenarios.
The experimental results on two tested datasets show that the mapping pipeline generates a 3D map in real world scale, operates in real time, and the resulting generated map contains semantically strong information about structure of the observed region.

Conflict of Interest
The authors declare no conflict of interest.