点击下方卡片,关注“具身智能之心”公众号
更多干货,欢迎加入国内首个具身智能全栈学习社区:具身智能之心知识星球(戳我),这里包含所有你想要的。
前面一直分享VLA相关工作,这里也为大家汇总下具身+VA相关的工作( Vision + Action),涉及机器人操作、DP、全身控制、One-Shot 、sim2real、端到端等;内容出自具身智能之心知识星球!

2025年工作
[2025] Steering Your Diffusion Policy with Latent Space Reinforcement Learning [2025] [ByteDance Seed] Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation [2025] [RSS 25] Unified Video Action Model [2025] Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories [2025] Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition [2025] Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning [2025] BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities [2025] [RSS 25] Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation [2025] Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics [2025] You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations [2025] ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills [2025] VILP: Imitation Learning with Latent Video Planning [2025] Learning the RoPEs: Better 2D and 3D Position Encodings with STRING [2025] When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning [2025] RoboGrasp: A Universal Grasping Policy for Robust Robotic Control [2025] CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World [2025] Learning to Group and Grasp Multiple Objects [2025] Beyond Behavior Cloning: Robustness through Interactive Imitation and Contrastive Learning [2025] COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping [2025] DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References [2025] S2-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation [2025] MTDP: Modulated Transformer Diffusion Policy Model [2025] FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation [2025] RHINO: Learning Real-Time Humanoid-Human-Object Interaction from Human Demonstrations [2025] Responsive Noise-Relaying Diffusion Policy: Responsive and Efficient Visuomotor Control [2025] Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum [2025] IMLE Policy: Fast and Sample Efficient Visuomotor Policy Learning via Implicit Maximum Likelihood Estimation [2025] X-IL: Exploring the Design Space of Imitation Learning Policies [2025] Towards Fusing Point Cloud and Visual Representations for Imitation Learning [2025] Pick-and-place Manipulation Across Grippers Without Retraining: A Learning-optimization Diffusion Policy Approach [2025] FACTR: Force-Attending Curriculum Training for Contact-Rich Policy Learning [2025] DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning [2025] Human2Robot: Learning Robot Actions from Paired Human-Robot Videos [2025] AnyDexGrasp: General Dexterous Grasping for Different Hands with Human-level Learning Efficiency [2025] COMPASS: Cross-embOdiment Mobility Policy via ResiduAl RL and Skill Synthesis [2025] Retrieval Dexterity: Efficient Object Retrieval in Clutters with Dexterous Hand [2025] From planning to policy: distilling Skill-RRT for long-horizon prehensile and non-prehensile manipulation [2025] FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real [2025] Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation [2025] FuseGrasp: Radar-Camera Fusion for Robotic Grasping of Transparent Objects [2025] Sensor-Invariant Tactile Representation [2025] Generalist World Model Pre-Training for Efficient Reinforcement Learning [2025] ProDapt: Proprioceptive Adaptation using Long-term Memory Diffusion [2025] Falcon: Fast Visuomotor Policies via Partial Denoising [2025] HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models [2025] SHADOW: Leveraging Segmentation Masks for Cross-Embodiment Policy Transfer [2025] Phantom: Training Robots Without Robots Using Only Human Videos [2025] General Force Sensation for Tactile Robot [2025] Action Tokenizer Matters in In-Context Imitation Learning [2025] AVR: Active Vision-Driven Robotic Precision Manipulation with Viewpoint and Focal Length Optimization [2025] FRMD: Fast Robot Motion Diffusion with Consistency-Distilled Movement Primitives for Smooth Action Generation [2025] Variable-Friction In-Hand Manipulation for Arbitrary Objects via Diffusion-Based Imitation Learning [2025] Learning Dexterous In-Hand Manipulation with Multifingered Hands via Visuomotor Diffusion [2025] RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking [2025] ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation [2025] SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks [2025] GAGrasp: Geometric Algebra Diffusion for Dexterous Grasping [2025] OPG-Policy: Occluded Push-Grasp Policy Learning with Amodal Segmentation [2025] RA-DP: Rapid Adaptive Diffusion Policy for Training-Free High-frequency Robotics Replanning [2025] Robotic Compliant Object Prying Using Diffusion Policy Guided by Vision and Force Observations [2025] CoinRobot: Generalized End-to-end Robotic Learning for Physical Intelligence [2025] Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects [2025] How to Train Your Robots? The Impact of Demonstration Modality on Imitation Learning [2025] One-Shot Dual-Arm Imitation Learning [2025] GAT-Grasp: Gesture-Driven Affordance Transfer for Task-Aware Robotic Grasping [2025] Enhanced View Planning for Robotic Harvesting: Tackling Occlusions with Imitation Learning [2025] ES-Parkour: Advanced Robot Parkour with Bio-inspired Event Camera and Spiking Neural Network [2025] NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models [2025] World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning [2025] RILe: Reinforced Imitation Learning [2025] HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots [2025] Distillation-PPO: A Novel Two-Stage Reinforcement Learning Framework for Humanoid Robot Perceptive Locomotion [2025] Trinity: A Modular Humanoid Robot AI System [2025] LiPS: Large-Scale Humanoid Robot Reinforcement Learning with Parallel-Series Structures [2025] Elastic Motion Policy: An Adaptive Dynamical System for Robust and Efficient One-Shot Imitation Learning [2025] Learning Gentle Grasping Using Vision, Sound, and Touch [2025] RoboCopilot: Human-in-the-loop Interactive Imitation Learning for Robot Manipulation [2025] Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework [2025] MoE-Loco: Mixture of Experts for Multitask Locomotion [2025] Humanoid Policy ~ Human Policy [2025] Dense Policy: Bidirectional Autoregressive Learning of Actions [2025] Learning to Play Piano in the Real World [2025] CCDP: Composition of Conditional Diffusion Policies with Guided Sampling [2025] DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation [2025] AdaWorld: Learning Adaptable World Models with Latent Actions [2025] Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing [2025] Empirical Analysis of Sim-and-Real Cotraining Of Diffusion Policies For Planar Pushing from Pixels [2025] ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning [2025] Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation [2025] HACTS: a Human-As-Copilot Teleoperation System for Robot Learning [2025] ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos [2025] Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models [2025] Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets [2025] RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics [2025] Slot-Level Robotic Placement via Visual Imitation from Single Human Video [2025] Robust Dexterous Grasping of General Objects from Single-view Perception [2025] Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation [2025] ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping [2025] Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation [2025] Grasping Deformable Objects via Reinforcement Learning with Cross-Modal Attention to Visuo-Tactile Inputs [2025] Few-Shot Vision-Language Action-Incremental Policy Learning [2025] Latent Diffusion Planning for Imitation Learning [2025] Physically Consistent Humanoid Loco-Manipulation using Latent Diffusion Models [2025] PRISM-DP: Spatial Pose-based Observations for Diffusion-Policies via Segmentation, Mesh Generation, and Pose Tracking [2025] Rethinking Latent Representations in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation [2025] Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation [2025] Fast Flow-based Visuomotor Policies via Conditional Optimal Transport Couplings [2025] KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation [2025] CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations [2025] H3DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning [2025] UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations [2025] Learning Long-Context Diffusion Policies via Past-Token Prediction [2025] DataMIL: Selecting Data for Robot Imitation Learning with Datamodels [2025] [ICLR 25] Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning [2025] IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning [2025] NVSPolicy: Adaptive Novel-View Synthesis for Generalizable Language-Conditioned Policy Learning [2025] EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation [2025] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation [2025] Conditioning Matters: Training Diffusion Policies is Faster Than You Think [2025] H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos [2025] GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation [2025] Zero-Shot Visual Generalization in Robot Manipulation [2025] Object-Centric Representations Improve Policy Generalization in Robot Manipulation [2025] LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation [2025] GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation [2025] A Practical Guide for Incorporating Symmetry in Diffusion Policy [2025] Adaptive Visuo-Tactile Fusion with Predictive Force Attention for Dexterous Manipulation [2025] EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation [2025] Spatial RoboGrasp: Generalized Robotic Grasping Control Policy [2025] Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt [2025] [AAAI 25] FlowPolicy: Enabling Fast and Robust 3D Flow-Based Policy via Consistency Flow Matching for Robot Manipulation [2025] Object-centric 3D Motion Field for Robot Learning from Human Videos [2025] Evaluating Robot Policies in a World Model [2025] 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model [2025] SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game [2025] SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies [2025] Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation [2025] Touch begins where vision ends: Generalizable policies for contact-rich manipulation [2025] AMPLIFY: Actionless Motion Priors for Robot Learning from Videos [2025] GAF: Gaussian Action Field as a Dynamic World Model for Robotic Manipulation [2025] Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation [2025] Latent Action Diffusion for Cross-Embodiment Manipulation [2025] Vision in Action: Learning Active Perception from Human Demonstrations [2025] [IROS 25] Robust Instant Policy: Leveraging Student’s t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation [2025] [RSS 25] Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation [2025] DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy [2025] World4Omni: A Zero-Shot Framework from Image Generation World Model to Robotic Manipulation [2025] ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation [2025] [ICCV 25] Spatial-Temporal Aware Visuomotor Diffusion Policy Learning
2024年工作
[2024] Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching [2024] Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning [2024] [RSS 25] 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations [2024] Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning [2024] ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation [2024] 3d diffuser actor: Policy diffusion with 3d scene representations [2024] [ICLR 25] Diffusion Policy Policy Optimization [2024] Language-Guided Object-Centric Diffusion Policy for Collision-Aware Robotic Manipulation [2024] EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning [2024] Equivariant Diffusion Policy [2024] [IROS 25] Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models [2024] Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies [2024] Motion Before Action: Diffusing Object Motion as Manipulation Condition [2024] One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation [2024] Consistency policy: Accelerated visuomotor policies via consistency distillation [2024] SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation [2024] Few-Shot Task Learning through Inverse Generative Modeling [2024] G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation [2024] Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation [2024] Diffusion Policy Attacker: Crafting Adversarial Attacks for Diffusion-based Policies [2024] Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies [2024] Equivariant diffusion policy [2024] Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation [2024] Data Scaling Laws in Imitation Learning for Robotic Manipulation [2024] Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation [2024] Equivariant diffusion policy [2024] Learning universal policies via text-guided video generation [2024] Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning [2024] 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations [2024] Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [2024] GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy [2024] Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation [2024] Prediction with Action: Visual Policy Learning via Joint Denoising Process [2024] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations [2024] Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [2024] Streaming Diffusion Policy: Fast Policy Synthesis with Variable Noise Diffusion Models [2024] CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction [2024] In-Context Imitation Learning via Next-Token Prediction [2024] Learning Diffusion Policies from Demonstrations For Compliant Contact-rich Manipulation
2023年工作
[2023] Diffusion policy: Visuomotor policy learning via action diffusion [2023] Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods