Planning – DOM-NET

(Last updated March 18th, 2026)

Planning for robotic cloth manipulation has progressively shifted from explicit, model-based formulations toward representation-driven and data-driven approaches. Early methods relied on geometric or physics-based models to plan over simplified cloth states, but suffered from scalability and model mismatch. More recent work instead focuses on learning compact and task-relevant representations—ranging from images and optical flow to latent embeddings and semantic keypoints—that make action selection tractable under high-dimensional and partially observable dynamics. As a result, much of what is termed “planning” is now implemented implicitly through goal-conditioned or predictive policies, often combined with hierarchical decomposition or intermediate canonical states to handle long-horizon tasks such as folding. At the same time, emerging latent-space planners and foundation models introduce new paradigms, enabling either structured global planning in learned spaces or end-to-end generalist policies trained on large-scale data. Despite these advances, key challenges remain unresolved: robust long-horizon reasoning, consistent generalization across fabrics and garment categories, tight integration between high-level task structure and low-level contact-rich control, and the lack of standardized evaluation protocols that allow fair comparison across methods.

Summary of planning approaches

Representation	Paper	Planning level	Training data	Input	Action representation	Output	State scope	Performance	Generalization	Key idea	Main limitations
Mesh / explicit geometry	[3] Arnold et al. (2021)	Explicit planning (trajectory optimization)	Simulation + limited real	Mesh / voxel + goal	Continuous Cartesian trajectory / control sequence	Action sequence	Cloth (+ partial robot)	Moderate	Low–moderate	Planning via forward dynamics optimization	Model mismatch; expensive
Image	[4] Hoque et al. (2020)	Predictive (visual MPC)	Large sim + real	Image + goal image	End-effector displacement (sampled action sequences)	Best action sequence	Cloth (implicit)	Good (short horizon)	Moderate	Plan via learned visual dynamics	Sampling cost; error accumulation
Image	[5] Weng et al. (2021)	Goal-conditioned reactive	Real + sim	Image + flow	Pick-and-place in pixel space (2D → 3D projection)	Single action	Cloth (flow/image)	High	Moderate–high	Flow defines action target	Limited structure; no long-horizon
Image	[6] Ha & Song (2021)	Reactive (closed-loop)	Large real dataset	Image	Parameterized dynamic primitive (pick point + fling motion)	Primitive execution	Cloth (image)	High	Moderate	Dynamic primitives reduce planning need	Task-specific
Latent	[2] Tanaka et al. (2018)	Latent one-step planning	Simulation	Image → latent	Latent action vector	Next state / action	Cloth (latent)	Moderate	Low–moderate	Planning in latent space	Hard to interpret; limited horizon
Latent	[11] Lippi et al. (2021)	Graph-based global planning	Sim + real	Image → latent	Edges in latent roadmap (discrete transitions)	Latent path + actions	Cloth (latent graph)	Moderate–high	Moderate	Global planning in latent graph (LSR)	Requires coverage; scaling
Latent	[14] Yan et al. (2020)	Model-based predictive planning (latent dynamics)	Simulation (large-scale random interaction) + sim2real transfer	Image → latent embedding + goal	Continuous latent action via learned dynamics model (implicit through prediction)	Action sequence (via planning in latent space)	Cloth (latent dynamics model)	Moderate–high (improves over standard visual model-based methods)	Moderate (sim2real works with domain randomization)	Learn joint representation + dynamics via contrastive estimation and plan in latent space	Latent space quality critical; planning still short-horizon; limited structure for long tasks
Semantic	[10] Deng et al. (2024)	Hierarchical planning	Real datasets	Image / point cloud → keypoints	Keypoint-conditioned actions (e.g., grasp semantic points)	Action	Cloth (semantic)	High	Higher than most	Use semantic structure for planning	Perception dependency
Hybrid (image + task stage)	[7] Avigal et al. (2022)	Hierarchical staged planning	Real + sim	Image + stage	Discrete primitives (pick, place, smooth) with parameterization	Primitive action	Cloth + task state	High	Moderate	Task decomposition simplifies planning	Manual design; brittle
Hybrid (canonical state)	[8] Canberk et al. (2022)	Planning via canonicalization	Real + sim	Image	Pick-and-place toward canonical alignment	Action	Cloth (canonicalized)	High	Moderate–high	Funnel states to canonical config	Canonicalization limits
Hybrid (unified pipeline)	[9] Xue et al. (2023)	Unified hierarchical planning	Real + sim + HITL	Partial observation	Parameterized pick-and-place (3D grasp + target)	Action	Cloth + task abstraction	High	Moderate	Unified multi-stage planning	Still task-family specific
Symbolic / classical	[1] Stria et al. (2014)	Predefined sequence	Minimal training	Segmented garment	Predefined grasp + trajectory templates	Fixed sequence	Cloth (symbolic)	High (controlled)	Very low	Hard-coded actions	No adaptability
Geometric / symbolic	[12] Miller et al. (2011)	Explicit planning (geometric, constrained)	No learning (analytical model)	Polygonal garment model + predefined fold sequence	Parameterized fold primitives (g-folds)	Action sequence (fold plan)	Cloth (geometric, quasi-static)	High (in controlled flat-folding tasks)	Very low (requires flat, known garment, controlled setup)	Planning via geometric reasoning over constrained cloth state (quasi-static folds)	Strong assumptions (flat cloth, known model); no dynamics; poor generalization
Foundation / generative policies (no planning)
Image + semantic (VLM)	[13] Black et al. (2024) π₀	Reactive end-to-end policy (foundation model)	Large-scale multi-robot, multi-task real + sim datasets	Image + language instruction	Continuous action generation via flow model (trajectory / control distribution)	Action (per step, autoregressive)	Cloth + robot + environment (implicit, multimodal)	High (broad task success incl. cloth manipulation)	High (zero-shot across tasks, setups, embodiments)	Learn a generalist vision-language-action policy using large-scale data and flow matching	No explicit planning; hard to interpret; inconsistent success; sensitive to prompts

References

[1] Stria, J., Průša, D., Hlaváč, V., Wagner, L., Petrik, V., Krsek, P., & Smutný, V. (2014, September). Garment perception and its folding using a dual-arm robot. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 61-67). IEEE.

[2] Tanaka, D., Arnold, S., & Yamazaki, K. (2018). Emd net: An encode–manipulate–decode network for cloth manipulation. IEEE Robotics and Automation Letters, 3(3), 1771-1778.

[3] Arnold, S., Tanaka, D., & Yamazaki, K. (2023). Cloth manipulation planning on basis of mesh representations with incomplete domain knowledge and voxel-to-mesh estimation. Frontiers in Neurorobotics, 16, 1045747.

[4] Hoque, R., Seita, D., Balakrishna, A., Ganapathi, A., Tanwani, A. K., Jamali, N., … & Goldberg, K. (2022). Visuospatial foresight for physical sequential fabric manipulation. Autonomous Robots, 46(1), 175-199.

[5] Weng, T., Bajracharya, S. M., Wang, Y., Agrawal, K., & Held, D. (2022, January). FabricFlowNet: Bimanual cloth manipulation with a flow-based policy. In Conference on Robot Learning (pp. 192-202)

[6] Ha, H., & Song, S. (2022, January). Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding. In Conference on Robot Learning (pp. 24-33).

[7] Avigal, Y., Berscheid, L., Asfour, T., Kröger, T., & Goldberg, K. (2022, October). Speedfolding: Learning efficient bimanual folding of garments. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 1-8). IEEE.

[8] Canberk, A., Chi, C., Ha, H., Burchfiel, B., Cousineau, E., Feng, S., & Song, S. (2023, May). Cloth funnels: Canonicalized-alignment for multi-purpose garment manipulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA) (pp. 5872-5879).

[9] Xue, H., Li, Y., Xu, W., Li, H., Zheng, D., & Lu, C. (2023, December). UniFolding: Towards Sample-efficient, Scalable, and Generalizable Robotic Garment Folding. In Conference on Robot Learning (pp. 3321-3341).

[10] Deng, Y., & Hsu, D. (2025, May). General-purpose clothes manipulation with semantic keypoints. In 2025 IEEE International Conference on Robotics and Automation (ICRA) (pp. 13181-13187).

[11] Lippi, M., Poklukar, P., Welle, M. C., Varava, A., Yin, H., Marino, A., & Kragic, D. (2022). Enabling visual action planning for object manipulation through latent space roadmap. IEEE Transactions on Robotics, 39(1), 57-75.

[12] Miller, S., Van Den Berg, J., Fritz, M., Darrell, T., Goldberg, K., & Abbeel, P. (2012). A geometric approach to robotic laundry folding. The International Journal of Robotics Research, 31(2), 249-267.

[13] Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., … & Zhilinsky, U. (2024). $\pi_0 $: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164.

[14] Yan, W., Vangipuram, A., Abbeel, P., & Pinto, L. (2021, October). Learning predictive representations for deformable objects using contrastive estimation. In Conference on Robot Learning (pp. 564-574).