Planning

(Last updated March 18th, 2026)

Planning for robotic cloth manipulation has progressively shifted from explicit, model-based formulations toward representation-driven and data-driven approaches. Early methods relied on geometric or physics-based models to plan over simplified cloth states, but suffered from scalability and model mismatch. More recent work instead focuses on learning compact and task-relevant representations—ranging from images and optical flow to latent embeddings and semantic keypoints—that make action selection tractable under high-dimensional and partially observable dynamics. As a result, much of what is termed “planning” is now implemented implicitly through goal-conditioned or predictive policies, often combined with hierarchical decomposition or intermediate canonical states to handle long-horizon tasks such as folding. At the same time, emerging latent-space planners and foundation models introduce new paradigms, enabling either structured global planning in learned spaces or end-to-end generalist policies trained on large-scale data. Despite these advances, key challenges remain unresolved: robust long-horizon reasoning, consistent generalization across fabrics and garment categories, tight integration between high-level task structure and low-level contact-rich control, and the lack of standardized evaluation protocols that allow fair comparison across methods.

Summary of planning approaches

RepresentationPaperPlanning levelTraining dataInputAction representationOutputState scopePerformanceGeneralizationKey ideaMain limitations
Mesh / explicit geometry[3] Arnold et al. (2021)Explicit planning (trajectory optimization)Simulation + limited realMesh / voxel + goalContinuous Cartesian trajectory / control sequenceAction sequenceCloth (+ partial robot)ModerateLow–moderatePlanning via forward dynamics optimizationModel mismatch; expensive
Image[4] Hoque et al. (2020)Predictive (visual MPC)Large sim + realImage + goal imageEnd-effector displacement (sampled action sequences)Best action sequenceCloth (implicit)Good (short horizon)ModeratePlan via learned visual dynamicsSampling cost; error accumulation
Image[5] Weng et al. (2021)Goal-conditioned reactiveReal + simImage + flowPick-and-place in pixel space (2D → 3D projection)Single actionCloth (flow/image)HighModerate–highFlow defines action targetLimited structure; no long-horizon
Image[6] Ha & Song (2021)Reactive (closed-loop)Large real datasetImageParameterized dynamic primitive (pick point + fling motion)Primitive executionCloth (image)HighModerateDynamic primitives reduce planning needTask-specific
Latent[2] Tanaka et al. (2018)Latent one-step planningSimulationImage → latentLatent action vectorNext state / actionCloth (latent)ModerateLow–moderatePlanning in latent spaceHard to interpret; limited horizon
Latent[11] Lippi et al. (2021)Graph-based global planningSim + realImage → latentEdges in latent roadmap (discrete transitions)Latent path + actionsCloth (latent graph)Moderate–highModerateGlobal planning in latent graph (LSR)Requires coverage; scaling
Latent[14] Yan et al. (2020)Model-based predictive planning (latent dynamics)Simulation (large-scale random interaction) + sim2real transfer Image → latent embedding + goalContinuous latent action via learned dynamics model (implicit through prediction)Action sequence (via planning in latent space)Cloth (latent dynamics model) Moderate–high (improves over standard visual model-based methods)Moderate (sim2real works with domain randomization)Learn joint representation + dynamics via contrastive estimation and plan in latent spaceLatent space quality critical; planning still short-horizon; limited structure for long tasks
Semantic[10] Deng et al. (2024)Hierarchical planningReal datasetsImage / point cloud → keypointsKeypoint-conditioned actions (e.g., grasp semantic points)ActionCloth (semantic)HighHigher than mostUse semantic structure for planningPerception dependency
Hybrid (image + task stage)[7] Avigal et al. (2022)Hierarchical staged planningReal + simImage + stageDiscrete primitives (pick, place, smooth) with parameterizationPrimitive actionCloth + task stateHighModerateTask decomposition simplifies planningManual design; brittle
Hybrid (canonical state)[8] Canberk et al. (2022)Planning via canonicalizationReal + simImagePick-and-place toward canonical alignmentActionCloth (canonicalized)HighModerate–highFunnel states to canonical configCanonicalization limits
Hybrid (unified pipeline)[9] Xue et al. (2023)Unified hierarchical planningReal + sim + HITLPartial observationParameterized pick-and-place (3D grasp + target)ActionCloth + task abstractionHighModerateUnified multi-stage planningStill task-family specific
Symbolic / classical[1] Stria et al. (2014)Predefined sequenceMinimal trainingSegmented garmentPredefined grasp + trajectory templatesFixed sequenceCloth (symbolic)High (controlled)Very lowHard-coded actionsNo adaptability
Geometric / symbolic[12] Miller et al. (2011)Explicit planning (geometric, constrained)No learning (analytical model)Polygonal garment model + predefined fold sequenceParameterized fold primitives (g-folds)Action sequence (fold plan) Cloth (geometric, quasi-static)High (in controlled flat-folding tasks)Very low (requires flat, known garment, controlled setup)Planning via geometric reasoning over constrained cloth state (quasi-static folds)Strong assumptions (flat cloth, known model); no dynamics; poor generalization 
Foundation / generative policies (no planning)
Image + semantic (VLM)[13] Black et al. (2024) π₀Reactive end-to-end policy (foundation model)Large-scale multi-robot, multi-task real + sim datasetsImage + language instructionContinuous action generation via flow model (trajectory / control distribution)Action (per step, autoregressive)Cloth + robot + environment (implicit, multimodal)High (broad task success incl. cloth manipulation)High (zero-shot across tasks, setups, embodiments) Learn a generalist vision-language-action policy using large-scale data and flow matchingNo explicit planning; hard to interpret; inconsistent success; sensitive to prompts

References

[1] Stria, J., Průša, D., Hlaváč, V., Wagner, L., Petrik, V., Krsek, P., & Smutný, V. (2014, September). Garment perception and its folding using a dual-arm robot. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 61-67). IEEE.

[2] Tanaka, D., Arnold, S., & Yamazaki, K. (2018). Emd net: An encode–manipulate–decode network for cloth manipulation. IEEE Robotics and Automation Letters, 3(3), 1771-1778.

[3] Arnold, S., Tanaka, D., & Yamazaki, K. (2023). Cloth manipulation planning on basis of mesh representations with incomplete domain knowledge and voxel-to-mesh estimation. Frontiers in Neurorobotics, 16, 1045747.

[4] Hoque, R., Seita, D., Balakrishna, A., Ganapathi, A., Tanwani, A. K., Jamali, N., … & Goldberg, K. (2022). Visuospatial foresight for physical sequential fabric manipulation. Autonomous Robots, 46(1), 175-199.

[5] Weng, T., Bajracharya, S. M., Wang, Y., Agrawal, K., & Held, D. (2022, January). FabricFlowNet: Bimanual cloth manipulation with a flow-based policy. In Conference on Robot Learning (pp. 192-202)

[6] Ha, H., & Song, S. (2022, January). Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding. In Conference on Robot Learning (pp. 24-33).

[7] Avigal, Y., Berscheid, L., Asfour, T., Kröger, T., & Goldberg, K. (2022, October). Speedfolding: Learning efficient bimanual folding of garments. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 1-8). IEEE.

[8] Canberk, A., Chi, C., Ha, H., Burchfiel, B., Cousineau, E., Feng, S., & Song, S. (2023, May). Cloth funnels: Canonicalized-alignment for multi-purpose garment manipulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA) (pp. 5872-5879). 

[9] Xue, H., Li, Y., Xu, W., Li, H., Zheng, D., & Lu, C. (2023, December). UniFolding: Towards Sample-efficient, Scalable, and Generalizable Robotic Garment Folding. In Conference on Robot Learning (pp. 3321-3341).

[10] Deng, Y., & Hsu, D. (2025, May). General-purpose clothes manipulation with semantic keypoints. In 2025 IEEE International Conference on Robotics and Automation (ICRA) (pp. 13181-13187).

[11] Lippi, M., Poklukar, P., Welle, M. C., Varava, A., Yin, H., Marino, A., & Kragic, D. (2022). Enabling visual action planning for object manipulation through latent space roadmap. IEEE Transactions on Robotics, 39(1), 57-75.

[12] Miller, S., Van Den Berg, J., Fritz, M., Darrell, T., Goldberg, K., & Abbeel, P. (2012). A geometric approach to robotic laundry folding. The International Journal of Robotics Research, 31(2), 249-267.

[13] Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., … & Zhilinsky, U. (2024). $\pi_0 $: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164.

[14] Yan, W., Vangipuram, A., Abbeel, P., & Pinto, L. (2021, October). Learning predictive representations for deformable objects using contrastive estimation. In Conference on Robot Learning (pp. 564-574).