UniStateDLO: Unified Generative State Estimation
and Tracking of Deformable Linear Objects
Under Occlusion for Constrained Manipulation

Anonymous Author(s)
Submitted to IEEE T-RO


We propose UniStateDLO, a novel unified perception framework for deformable linear objects (DLOs) that supports both single-frame state estimation and cross-frame tracking of DLOs under severe occlusions. Leveraging diffusion-based generative modeling, UniStateDLO reconstructs complete DLO configurations from even highly partial point clouds with strong accuracy, robustness and real-time performance. Trained entirely on synthetic data, it generalizes in a zero-shot manner to diverse real-world DLOs and provides a reliable perception front-end for constrained manipulation tasks.

Abstract

Perception of deformable linear objects (DLOs), such as cables, ropes, and wires, is fundamental to downstream robotic manipulation. Despite extensive progress, vision-based perception remains highly vulnerable to occlusions arising from obstacles and large deformations. Moreover, the high dimensionality of the state space, the lack of distinctive visual features, and the presence of sensor noises further compound the challenges of reliable DLO perception. To address these open issues, this paper presents UniStateDLO, the first complete DLO perception pipeline with deep-learning methods that achieves robust performance under severe occlusion, covering both single-frame state estimation and cross-frame state tracking from partial point clouds. Both tasks are formulated as conditional generative problems, leveraging the strong capability of diffusion models to capture the complex mapping between highly partial observations and high-dimensional DLO states. Trained solely on large-scale synthetic data, UniStateDLO achieves strong data efficiency by enabling zero-shot sim-to-real generalization without any real-world training data. Comprehensive simulation and real-world experiments demonstrate that UniStateDLO outperforms all state-of-the-art baselines in both estimation and tracking, producing globally smooth yet locally precise DLO state predictions in real time, even under substantial occlusions. Integration into a closed-loop DLO manipulation system further validates its ability to support stable feedback control in complex, constrained 3-D environments.


Video

Method Overview

Overview of the proposed UniStateDLO pipeline, comprising Single-Frame State Estimation for initialization and Cross-Frame State Tracking for sequential motion tracking. Given a partial DLO point cloud, state estimation module first produces coarse predictions through two complementary branches based on PointNet++ features, and then refines them via a diffusion model. For cross-frame tracking, a KNN-based feature aggregation module extracts node-wise local features around the previous frame's predictions, followed by another diffusion model to infer per-node cross-frame motion.

Large-Scale Data Synthesis in Simulation

The DLOs are randomly deformed to generate a dataset of 300K samples only in simulation, where the simulator is based on Unity3D engine in combination with the Obi Rope package. The lengths, diameters, stiffness of DLOs and camera viewpoints are randomized.

Real-World Experimental Results

The UniStateDLO model trained on the synthetic dataset can be directly applied on diverse real-world DLOs without collecting any realistic data or re-training. All inference is performed in real time on a single NVIDIA RTX 4090 GPU, where the single-frame estimation stage runs at on average 94.19 ms/frame and cross-frame tracking at 89.35 ms/frame.

DLOs Used in Real-World Experiments

We use four DLOs with distinct materials and physical properties to evaluate the real-world generalization performance of proposed UniStateDLO.

Real-World Single-Frame State Estimation

Quanlitative comparison with single-frame estimation baselines. (Use the left and right buttons to switch between different cases)

More visualized cases of real-world state estimations achieved by UniStateDLO.

Real-World Cross-Frame State Tracking

Quanlitative comparison with cross-frame tracking baselines.

Tracking performance on a long-term DLO motion sequence under dynamic, severe occlusions, and large-scale deformation.

Tracking performance on a long-term DLO motion sequence under dynamic, severe occlusions, and large-scale deformation.

Integration in Constrained DLO Manipulation

A dual-arm robot rigidly grasps the two ends of a DLO and manipulates it toward a desired 3-D configuration,
where continuous collision avoidance and occlusion-robust perception is required. The proposed UniStateDLO serves as
the front-end perception module, which provides real-time feedback for the downstream controller.

DLO Shape Control Task 1

Accurate state estimation and tracking with occlusions in initial states, supporting to manipulate the DLO to desired shape while avoiding obstacles.

DLO Shape Control Task 2

Reliable during intermediate stage of manipulation, finally reaching complex configurations with self-intersections.

DLO Shape Control Task 3

Robust under large-scale occlusion, even when one endpoint of DLO becomes completely invisible for an extended period.