(Last updated March 18th, 2026)
Planning for robotic cloth manipulation has progressively shifted from explicit, model-based formulations toward representation-driven and data-driven approaches. Early methods relied on geometric or physics-based models to plan over simplified cloth states, but suffered from scalability and model mismatch. More recent work instead focuses on learning compact and task-relevant representations—ranging from images and optical flow to latent embeddings and semantic keypoints—that make action selection tractable under high-dimensional and partially observable dynamics. As a result, much of what is termed “planning” is now implemented implicitly through goal-conditioned or predictive policies, often combined with hierarchical decomposition or intermediate canonical states to handle long-horizon tasks such as folding. At the same time, emerging latent-space planners and foundation models introduce new paradigms, enabling either structured global planning in learned spaces or end-to-end generalist policies trained on large-scale data. Despite these advances, key challenges remain unresolved: robust long-horizon reasoning, consistent generalization across fabrics and garment categories, tight integration between high-level task structure and low-level contact-rich control, and the lack of standardized evaluation protocols that allow fair comparison across methods.
Summary of planning approaches
| Representation | Paper | Planning level | Training data | Input | Action representation | Output | State scope | Performance | Generalization | Key idea | Main limitations |
| Mesh / explicit geometry | [3] Arnold et al. (2021) | Explicit planning (trajectory optimization) | Simulation + limited real | Mesh / voxel + goal | Continuous Cartesian trajectory / control sequence | Action sequence | Cloth (+ partial robot) | Moderate | Low–moderate | Planning via forward dynamics optimization | Model mismatch; expensive |
| Image | [4] Hoque et al. (2020) | Predictive (visual MPC) | Large sim + real | Image + goal image | End-effector displacement (sampled action sequences) | Best action sequence | Cloth (implicit) | Good (short horizon) | Moderate | Plan via learned visual dynamics | Sampling cost; error accumulation |
| Image | [5] Weng et al. (2021) | Goal-conditioned reactive | Real + sim | Image + flow | Pick-and-place in pixel space (2D → 3D projection) | Single action | Cloth (flow/image) | High | Moderate–high | Flow defines action target | Limited structure; no long-horizon |
| Image | [6] Ha & Song (2021) | Reactive (closed-loop) | Large real dataset | Image | Parameterized dynamic primitive (pick point + fling motion) | Primitive execution | Cloth (image) | High | Moderate | Dynamic primitives reduce planning need | Task-specific |
| Latent | [2] Tanaka et al. (2018) | Latent one-step planning | Simulation | Image → latent | Latent action vector | Next state / action | Cloth (latent) | Moderate | Low–moderate | Planning in latent space | Hard to interpret; limited horizon |
| Latent | [11] Lippi et al. (2021) | Graph-based global planning | Sim + real | Image → latent | Edges in latent roadmap (discrete transitions) | Latent path + actions | Cloth (latent graph) | Moderate–high | Moderate | Global planning in latent graph (LSR) | Requires coverage; scaling |
| Latent | [14] Yan et al. (2020) | Model-based predictive planning (latent dynamics) | Simulation (large-scale random interaction) + sim2real transfer | Image → latent embedding + goal | Continuous latent action via learned dynamics model (implicit through prediction) | Action sequence (via planning in latent space) | Cloth (latent dynamics model) | Moderate–high (improves over standard visual model-based methods) | Moderate (sim2real works with domain randomization) | Learn joint representation + dynamics via contrastive estimation and plan in latent space | Latent space quality critical; planning still short-horizon; limited structure for long tasks |
| Semantic | [10] Deng et al. (2024) | Hierarchical planning | Real datasets | Image / point cloud → keypoints | Keypoint-conditioned actions (e.g., grasp semantic points) | Action | Cloth (semantic) | High | Higher than most | Use semantic structure for planning | Perception dependency |
| Hybrid (image + task stage) | [7] Avigal et al. (2022) | Hierarchical staged planning | Real + sim | Image + stage | Discrete primitives (pick, place, smooth) with parameterization | Primitive action | Cloth + task state | High | Moderate | Task decomposition simplifies planning | Manual design; brittle |
| Hybrid (canonical state) | [8] Canberk et al. (2022) | Planning via canonicalization | Real + sim | Image | Pick-and-place toward canonical alignment | Action | Cloth (canonicalized) | High | Moderate–high | Funnel states to canonical config | Canonicalization limits |
| Hybrid (unified pipeline) | [9] Xue et al. (2023) | Unified hierarchical planning | Real + sim + HITL | Partial observation | Parameterized pick-and-place (3D grasp + target) | Action | Cloth + task abstraction | High | Moderate | Unified multi-stage planning | Still task-family specific |
| Symbolic / classical | [1] Stria et al. (2014) | Predefined sequence | Minimal training | Segmented garment | Predefined grasp + trajectory templates | Fixed sequence | Cloth (symbolic) | High (controlled) | Very low | Hard-coded actions | No adaptability |
| Geometric / symbolic | [12] Miller et al. (2011) | Explicit planning (geometric, constrained) | No learning (analytical model) | Polygonal garment model + predefined fold sequence | Parameterized fold primitives (g-folds) | Action sequence (fold plan) | Cloth (geometric, quasi-static) | High (in controlled flat-folding tasks) | Very low (requires flat, known garment, controlled setup) | Planning via geometric reasoning over constrained cloth state (quasi-static folds) | Strong assumptions (flat cloth, known model); no dynamics; poor generalization |
| Foundation / generative policies (no planning) | |||||||||||
| Image + semantic (VLM) | [13] Black et al. (2024) π₀ | Reactive end-to-end policy (foundation model) | Large-scale multi-robot, multi-task real + sim datasets | Image + language instruction | Continuous action generation via flow model (trajectory / control distribution) | Action (per step, autoregressive) | Cloth + robot + environment (implicit, multimodal) | High (broad task success incl. cloth manipulation) | High (zero-shot across tasks, setups, embodiments) | Learn a generalist vision-language-action policy using large-scale data and flow matching | No explicit planning; hard to interpret; inconsistent success; sensitive to prompts |
References
[1] Stria, J., Průša, D., Hlaváč, V., Wagner, L., Petrik, V., Krsek, P., & Smutný, V. (2014, September). Garment perception and its folding using a dual-arm robot. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 61-67). IEEE.
[2] Tanaka, D., Arnold, S., & Yamazaki, K. (2018). Emd net: An encode–manipulate–decode network for cloth manipulation. IEEE Robotics and Automation Letters, 3(3), 1771-1778.
[3] Arnold, S., Tanaka, D., & Yamazaki, K. (2023). Cloth manipulation planning on basis of mesh representations with incomplete domain knowledge and voxel-to-mesh estimation. Frontiers in Neurorobotics, 16, 1045747.
[4] Hoque, R., Seita, D., Balakrishna, A., Ganapathi, A., Tanwani, A. K., Jamali, N., … & Goldberg, K. (2022). Visuospatial foresight for physical sequential fabric manipulation. Autonomous Robots, 46(1), 175-199.
[5] Weng, T., Bajracharya, S. M., Wang, Y., Agrawal, K., & Held, D. (2022, January). FabricFlowNet: Bimanual cloth manipulation with a flow-based policy. In Conference on Robot Learning (pp. 192-202)
[6] Ha, H., & Song, S. (2022, January). Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding. In Conference on Robot Learning (pp. 24-33).
[7] Avigal, Y., Berscheid, L., Asfour, T., Kröger, T., & Goldberg, K. (2022, October). Speedfolding: Learning efficient bimanual folding of garments. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 1-8). IEEE.
[8] Canberk, A., Chi, C., Ha, H., Burchfiel, B., Cousineau, E., Feng, S., & Song, S. (2023, May). Cloth funnels: Canonicalized-alignment for multi-purpose garment manipulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA) (pp. 5872-5879).
[9] Xue, H., Li, Y., Xu, W., Li, H., Zheng, D., & Lu, C. (2023, December). UniFolding: Towards Sample-efficient, Scalable, and Generalizable Robotic Garment Folding. In Conference on Robot Learning (pp. 3321-3341).
[10] Deng, Y., & Hsu, D. (2025, May). General-purpose clothes manipulation with semantic keypoints. In 2025 IEEE International Conference on Robotics and Automation (ICRA) (pp. 13181-13187).
[11] Lippi, M., Poklukar, P., Welle, M. C., Varava, A., Yin, H., Marino, A., & Kragic, D. (2022). Enabling visual action planning for object manipulation through latent space roadmap. IEEE Transactions on Robotics, 39(1), 57-75.
[12] Miller, S., Van Den Berg, J., Fritz, M., Darrell, T., Goldberg, K., & Abbeel, P. (2012). A geometric approach to robotic laundry folding. The International Journal of Robotics Research, 31(2), 249-267.
[13] Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., … & Zhilinsky, U. (2024). $\pi_0 $: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164.
[14] Yan, W., Vangipuram, A., Abbeel, P., & Pinto, L. (2021, October). Learning predictive representations for deformable objects using contrastive estimation. In Conference on Robot Learning (pp. 564-574).
