LIM: Large Interpolator Model for Dynamic Reconstruction

CVPR 2025

1University College London, 2Meta

Abstract

Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times \( t_{0} \) and \( t_{1} \), LIM produces a deformed shape at any continuous time \( t \in [t_{0},t_{1}] \) delivering high-quality interpolations in seconds (per frame). Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM [3]) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.

Select a version of LIM.

I - Frame Interpolation

Our first model, LIM directly interpolates color & opacity-field in 3D space in the multi-view setting. We observe that: (i) Linear interpolation in triplane space fails on dynamic parts (ii) Image-based interpolation (FILM [3]) in the multi-view setting leads to defective reconstructions (with ghosting around dynamic parts) (iii) LIM yields the most plausible results, without artifacts.

Figure2: Frame Interpolation. We add 2 interpolated frames between each frame rendered from multi-view LRM. Oracle is the upper-bound limit, where we perform LRM reconstruction with multi-views images on all frames (including the ones interpolated by the 3 methods). For each method, we render the video with the LRM renders and the interpolated frames in-between. We highlight a single interpolation step at the end.

II - 4D Reconstruction

An extension of our model, LIM, can trace the deformable shape through time. Given a multi-view sequence with RGB inputs, we start by extracting the triplane on the first frame with LRM. We render the triplane to obtain a depth map that we unproject to get multi-view renders of XYZ coordinates on the first (=canonical) frame. Finally, from the XYZ renders on the first frame, and RGB inputs on all the frames, LIM propagates the XYZ canonical coordinates on the first timestep to the other steps.

Figure3: XYZ coordinates tracking. LIM interpolates the XYZ coordinates on the first (=canonical) frame to the next ones.

We use these surface annotations to reconstruct a time-deforming mesh with fixed topology and texture.

Figure4: Mesh deformation. We reconstruct a time-deforming mesh with fixed topology, in the multi-view setting.

Application: Monocular Reconstruction

Our method, combined with a monocular-to-multiview video diffusion model, can be extended to the monocular reconstruction setting. We observe that results from TripoSR [2] (image-to-3D reconstructor) jitters since the frames are reconstructed independently. Consistent4D [1] renders are consistent in time, but the method is slow. Our method is consistent in time, has a single topology, and is significantly faster.

Figure5: Monocular 4D Reconstruction. Our method is the only one to output a time-deforming mesh with fixed topology and texture.