STEP-Mover


STEP-Mover: Stratified and Tiered Elimination Process for Efficient LiDAR Dynamic Removal

T-ITS 2025

Yanpeng Jia1,2          Ting Wang1*      Shiliang Shao1      Xieyuanli Chen3     

* Corresponding Author

1State Key Laboratory of Robotics at Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China     2University of Chinese Academy of Sciences, Beijing, China     3National University of Defense Technology, Beijing, China    

Video

Abstract

Clean and high-fidelity global static maps are essential for enabling precise localization and efficient navigation. However, dynamic objects in the environment may introduce "ghost trail" during map construction, which significantly degrades map quality and limits the execution of downstream tasks. Existing dynamic removal methods often struggle to balance computational efficiency and accuracy. To address this issue, this paper presents a Stratified and Tiered dynamic object removal framework, STEP-Mover, which efficiently filters dynamic points from the global map to preserve its high-fidelity feature. The proposed method generates a voxel-based multi-scale descriptor and progressively localizes dynamic regions in a Coarse-to-Fine-to-Refine method by leveraging feature differences between the query scan and the prior map. To compensate for blind spots in the query scan, a high point retrieval strategy is introduced to retrieve static points outside the field of view. Finally, the roughly extracted ground points are used as priors, and union-find-based connectivity clustering is applied to refine ground point recovery. Extensive comparative experiments are conducted on the SemanticKITTI, HeLiMOS, and self-collected M2UD datasets. Quantitative and qualitative results demonstrate that, compared with other baselines, the proposed method achieves superior performance in both accuracy and efficiency.

System Framework

Figure illustrates the overall system architecture. The input to the system includes the prior map M, which contains dynamic points to be removed, the query scans in the LiDAR coordinate system L, and the pose transformation from the LiDAR coordinate system to the world coordinate system W. The system first applies voxel hashing to partition and manage both the prior map and the query scans, enabling efficient data processing. Multi-scale descriptors are then generated to facilitate subsequent dynamic point identification. Subsequently, a coarse-to-fine-to-refine dynamic point elimination strategy is employed, in which dynamic voxels are detected based on the feature differences between the query scans descriptor and the prior map descriptor. To mitigate quantization effects, dynamic voxels are further refined using union-find-based connectivity clustering, followed by fine-level point retrieval. Static points outside the sensor's FOV are restored through a high point extension method and fine ground fitting guided by the coarsely estimated prior ground. Ultimately, this pipeline enables efficient elimination of dynamic points from the prior map and outputs a clean, high-fidelity global point cloud map.

Experiment Setup

To comprehensively evaluate the performance of the baseline methods, we conduct extensive experiments on the SemanticKITTI, HeLiMOS, and M2UD datasets. These datasets cover diverse environments including urban streets, highways, and rural areas, and involve various LiDAR types and configurations. Detailed dataset statistics are provided in Table. I.




Quantitative Results

1) SemanticKITTI Dataset: The comparison results on the SemanticKITTI dataset are presented in Table. II. Learning-based methods achieve competitive performance on most sequences, benefiting from extensive pre-training on the dataset. Among traditional methods, ERASOR suffers from quantization errors, which lead to the removal of many static points and thus degrade its overall performance. OctoMap and OctoMapFG remove dynamic objects using ray-casting-based visibility checking. However, inaccuracies in incident angle and pose estimation reduce their static accuracy. DUFOMap and BeautyMap performs well across multiple sequences by incorporating localization uncertainty or combining binary encoding and static recovery, respectively. However, their performance drops on Sequence 02, which significantly differs from other sequences and likely reflects the challenges posed by rural environmental characteristics. In contrast, our method maintains stable performance across all five sequences. Although our method does not always achieve the highest scores in either SA or DA, it consistently obtains the highest HA results with the most times, demonstrating its superior overall performance.

HeLiMOS Dataset: To further evaluate the generalization ability of the baselines, we conduct experiments on the HeLiMOS dataset involving multiple types of LiDAR sensors. Ground-truth poses are used as input to evaluate the effect of localization errors-free, and the results are presented in Table.III. Learning-based methods exhibit significant domain adaptation issues, resulting in substantial performance degradation or even complete failure when applied to different LiDAR types. For sparse LiDAR data, Removert fails to generate robust range image projections and performs poorly on the Velodyne sequence. ERASOR retains ability for dynamic object removal across varying LiDAR types though the scan ratio test. However, its reliance on polar coordinate partitioning can cause large blank bins, leading to inefficient resource utilization. BeautyMap fails on solid-state LiDAR, likely due to FOV inconsistencies negatively affecting the static recovery module. Our method maintains stable performance across all sequences and achieves competitive results on multiple evaluation metrics, which is attributed to the voxel-based representation and the use of multi-scale descriptors, enhancing the generalization capability of our approach.

Furthermore, to evaluate the effect of additional localization errors on dynamic object removal performance, we adopt the pose estimates from KISS-ICP as input to evaluate. The results are presented in Table. IV. When localization errors are introduced, the performance of all methods degrades, with visibility-based methods being particularly affected. Benefiting from the binary occupancy comparison and GMM descriptor in the fine dynamic removal stage, our method exhibits a certain level of robustness to localization errors and achieves satisfactory performance.




Qualitative Results

To intuitively illustrate the performance differences between our method and baseline approaches, we present qualitative results on the SemanticKITTI, HeLiMOS, and M2UD datasets. Figure presents the accuracy and recall results for dynamic object removal on the SemanticKITTI and HeLiMOS datasets.

To further evaluate the effectiveness of our method for dynamic object removal on robots equipped with low-cost sensors in real-world scenarios, we conduct experiments on the self-collected M2UD [26] dataset. Specifically, we select park sequences with low dynamics involving pedestrians and cyclists, as well as urban sequences containing various types of highly dynamic vehicles. Pose estimation is provided using GR-LOAM [34]. The results are shown in Figure, STEP-Mover achieves high-precision dynamic point removal even when deployed on a low-speed robot equipped with lowcost sensors. In park scenarios with heavy tree occlusion, our method accurately identifies dynamic points. In highly dynamic urban environments, the proposed algorithm effectively remove dense dynamic objects. Furthermore, the static structures in the scenes are largely preserved, which can be attributed to the fine static retrieval module.




Ablation Study

To further investigate the contribution of each component in our algorithm, we conducted an ablation study focusing on voxel resolution and individual modules. Figure illustrates the effect of varying voxel resolutions in the sy-axis on dynamic object rejection performance.

Table. V presents the impact of different modules on overall system performance.




Runtime Analysis

Table. VI presents the computational time of each module in the proposed method and compares it with that of the baselines. The proposed method achieves real-time performance across various LiDAR types. Furthermore, the coarse-to-fine-to-refine design effectively reduces the number of voxels processed by subsequent modules using a rough yet efficient strategy, thereby achieving a favorable trade-off between accuracy and efficiency. However, the fine retrieval module consumes a significant portion of the computational time. Further analysis reveals that even with coarsely extracted ground points as prior guidance, the ground fitting module still consumes the majority of the runtime in the entire pipeline. Compared with baselines, the proposed method demonstrates superior and stable efficiency across LiDAR types, highlighting its potential for integration into SLAM systems.