🎬 ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

1Meta Reality Labs, 2SpAItial, 3University College London

🎬 ActionMesh is a feed-forward video-to-4D model which generates high-quality animated meshes.

Abstract

Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.

Its flexible design enables several applications:

📹 Video
📦 3D + ✏️ Text
"casually walking"
🖼️ Image + ✏️ Text
Mushroom input
"singing opera"
✏️ Text
"An octopus playing maracas"
🎬 ActionMesh
Video-to-4D
{3D+text}-to-4D
{Image+text}-to-4D
Text-to-4D

Drag to rotate • Scroll to zoom • Click button to pause/play animation

📹 Application 1: video-to-4D

A. Results on DAVIS

Input
Ours
Input
Ours

B. Comparison to SOTA

Input
LIM
DM4D
V2M4
SG4D
Ours

C. Longer animations generation (61 frames)

Input
Ours
Input
Ours

📦+✏️ Application 2: {3D+text}-to-4D

Input
Ours
Input
Ours
+
"casually walking"
+
"happily nodding"
+
"doing martial-art"
+
"spreading its tentacles"
+
"swimming"
+
"flapping its wings"
+
"skiing"
+
"playing jazz"

🖼️+✏️ Application 3: {image+text}-to-4D

Input
Ours
Input
Ours
Scene 1 Input
+
"dancing"
Scene 2 Input
+
"lying down"
Scene 3 Input
+
"showing her weapon"
Scene 4 Input
+
"side-jumping"
Scene 5 Input
+
"vibing"
Scene 6 Input
+
"singing opera"
Scene 7 Input
+
"angry"
Scene 8 Input
+
"rotating arms"
Scene 9 Input
+
"greeting"
Scene 10 Input
+
"resting"
Scene 11 Input
+
"curious"
Scene 12 Input
+
"surprised"

✏️ Application 4: text-to-4D

Input
Ours
Input
Ours
"An octopus playing maracas"
"A baby dragon drinking boba"
"A bear dancing ballet"
"A corgi taking a selfie"
"A crocodile playing a drum set"
"A sheepdog running"
"A beaver crying"
"A cute T-Rex flying"
"A kangaroo boxing"
"A pirate attacking"
"A cute spring bouncing"
"A squirrel lifting a dumbell"

🔀 Application 5: Motion transfer

Input
Output 1
Output 2

Limitations

We assume fixed connectivity and thus changes in topology cannot be modeled. Although our model is able to hallucinate parts that are not visible, it sometimes fail at reconstructing occluded regions, in particular when they are missing from the reference frame or when they disappear during a complex motion.

Topological changes
Occlusion on reference frame
Occlusion during motion