(last updated March 19th, 2026)
Below we provide a short overview of approaches and a summary at the end.
Wu et al., 2024 (UniGarmentManip) — Dense Correspondence
Wu, R., Lu, H., Wang, Y., Wang, Y., & Dong, H. (2024). Unigarmentmanip: A unified framework for category-level garment manipulation via dense visual correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16340-16350).
Input: partial 3D point cloud
Output: dense topological correspondence field on garment surface
Metrics: task success rate (unfolding, folding, hanging)
Performance: ~83–89% success across tasks (sim), 11/15 real folding success
Dataset: CLOTH3D-based simulation + small real-world test → sim + limited real
Code is public.
Longhini et al., 2025 (Cloth-Splatting) — 3D Reconstruction
Longhini, A., Büsching, M., Duisterhof, B. P., Lundell, J., Ichnowski, J., Björkman, M., & Kragic, D. (2025, January). Cloth-Splatting: 3D Cloth State Estimation from RGB Supervision. In Conference on Robot Learning (pp. 2845-2865).
Input: RGB images
Output: full 3D cloth mesh (via Gaussian splatting refinement)
Metrics: median trajectory error (MTE), which measures the distance between the estimated cloth and ground-truth tracks; position accuracy (δ), which measures the percentage of tracks within the pre-defined distance thresholds 10, 20, 40, 80, and 160mm to the ground truth; and the survival rate, which assesses the average number of frames until the tracking error exceeds a predefined threshold, which we set to 50mm.
Performance:
MTE(mm): 4.928± 5.240(shorts) 1.703± 1.213 (towel) 3.159± 2.818(t-shirt)
Δ (): 0.851 ± 0.075 (shorts) 0.879 ± 0.057 (towel) 0.858 ± 0.080 (t-shirt)
Survial rate: 0.917 ± 0.067(shorts), 0.927 ± 0.059 (towel) 0.888 ± 0.084(T-shirt)
Dataset: VR-Folding + CLOTH3D → sim
Code is public.
Tian et al., 2025 — UniClothDiff: Diffusion-based State Estimation
Tian, T., Li, H., Ai, B., Yuan, X., Huang, Z., & Su, H. (2025). Diffusion dynamics models with generative state estimation for cloth manipulation. arXiv preprint arXiv:2503.11999.
Input: sparse RGB-D observations
Output: full cloth state reconstruction + dynamics prediction
Metrics: state reconstruction error, long-horizon prediction error
Performance: order-of-magnitude reduction in prediction error vs GNN baselines
Dataset: simulated cloth environments (CLOTH3D-like) + real robot demos → sim + partial real
https://uniclothdiff.github.io/
Wang et al., 2023 — TRTM: Template-based Mesh Reconstruction
Wang, W., Li, G., Zamora, M., & Coros, S. (2024, May). Trtm: Template-based reconstruction and target-oriented manipulation of crumpled cloths. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (pp. 12522-12528)
Input: single depth image (top view)
Output: full cloth mesh (vertex positions + visibility) + grasp vertices
Metrics: vertex error, reconstruction loss, task coverage (flattening)
Performance: improved flattening efficiency vs baselines; near-target shapes in 1–2 steps
Dataset: synthetic mass–spring + ~3k real RGB-D cloth states → sim + real
Lips et al., 2024 — Keypoints (sim-to-real)
Lips, T., De Gusseme, V. L., & Wyffels, F. (2024). Learning keypoints for robotic cloth manipulation using synthetic data. IEEE Robotics and Automation Letters, 9(7), 6528-6535.
Input: single RGB image
Output: semantic garment keypoints (e.g., sleeves, corners)
Metrics: mean Average Precision (mAP)
Performance: 64.3% mAP (synthetic only), 74.2% mAP (after real fine-tuning)
Dataset: aRTF Clothes (~2k real images, >100 garments) + synthetic training → real + sim
Code, dataset, and trained models are released.
Tabernik et al., 2024 (CeDiRNet-3DoF) — Grasp Point Detection
Tabernik, D., Muhovič, J., Urbas, M., & Skočaj, D. (2024). Center direction network for grasping point localization on cloths. IEEE Robotics and Automation Letters, 9(10), 8913-8920
Input: RGB images (optionally depth)
Output: 2D grasp points (corners) + orientation (3DoF)
Metrics: F1 score
Performance: ~78% F1 (best), >97% in easy (fully visible) cases
Dataset: ViCoS Towel Dataset (8k real + 12k synthetic images) → real + sim
Deng & Hsu, 2025 (CLASP) — Semantic Keypoints + Language
Deng, Y., & Hsu, D. (2025, May). General-purpose clothes manipulation with semantic keypoints. In 2025 IEEE International Conference on Robotics and Automation (ICRA) (pp. 13181-13187)
Input: depth images
Output: semantic keypoints with language labels (e.g., sleeve, collar)
Metrics: AKD measures the average distance between ground truth and detected keypoints, while AP represents the proportion of correctly detected keypoints under the given threshold. Given that the observed depth images have a resolution of 224 × 224, we set thresholds at 8, 4, and 2 pixels.
Performance: AKD 3.8 px, AP_8 91.0%, AP_4 75.4% and AP_2 50.2%
Dataset: SoftGym + CLOTH3D (simulation) → sim
Code is available.
Tzelepis et al., 2024 — Semantic State Classification
Input: RGB images / video frames
Output: cloth manipulation state (flat, folded, crumpled, etc.)
Metrics: classification accuracy
Performance: 96.0% (human data), 82.2% (UR robot), 70.6% (Kinova, domain transfer)
Dataset: 33.6k human images + 48.2k robot images → real
Code is available.
Qian et al., 2020 — Cloth Region Segmentation for Grasping
Qian, J., Weng, T., Zhang, L., Okorn, B., & Held, D. (2020, October). Cloth region segmentation for robust grasp selection. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 9553-9560).
Input: depth image (RGB-D used during training)
Output: pixel-wise segmentation (edges, corners, regions) + grasp location and direction
Metrics: grasp success rate (task-driven evaluation)
Performance: higher grasp success than classical and learning baselines, particularly robust in crumpled configurations
Dataset: ~8 minutes of real RGB-D video with self-supervised labels + real robot experiments → real
Summary
- Sparse outputs (keypoints, grasp points) → strong real-world performance, limited expressiveness
- Dense / geometric outputs (correspondences, meshes) → richer representation, more simulation reliance
- Full-state / generative models → highest expressiveness, still weak real-world benchmarking
- Datasets: recent works increasingly include real data, but evaluation is still often task-driven rather than purely perceptual
Current cloth perception systems remain far from real-time, deployable pipelines due to several structural gaps.
First, there is a trade-off between representation richness and speed: sparse keypoints and grasp points run in real time but lack sufficient state information, while dense or full-state methods (meshes, correspondences, diffusion-based reconstructions) are computationally too heavy for closed-loop control.
Second, partial observability and self-occlusion are still poorly handled—most methods rely on single-view inputs and do not robustly infer hidden layers or topology changes.
Third, there is limited temporal consistency and tracking: perception is often frame-based, with weak integration of dynamics, leading to unstable estimates during manipulation.
Fourth, generalization across garment categories and materials remains unresolved, especially for multi-layer or highly deformable items.
Fifth, there is a lack of standardized real-time benchmarks, meaning latency, robustness, and failure modes are rarely reported. Finally, integration with control is still shallow: few systems provide uncertainty-aware, actionable representations that can be directly used in fast feedback loops.
Addressing these gaps likely requires hybrid approaches combining compact state representations, temporal modeling, and physically grounded priors with strict real-time constraints.
