Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times \( t_{0} \) and \( t_{1} \), LIM produces a deformed shape at any continuous time \( t \in [t_{0},t_{1}] \) delivering high-quality interpolations in seconds (per frame). Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM [3]) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.
Our first model, LIM directly interpolates color & opacity-field in 3D space in the multi-view setting. We observe that: (i) Linear interpolation in triplane space fails on dynamic parts (ii) Image-based interpolation (FILM [3]) in the multi-view setting leads to defective reconstructions (with ghosting around dynamic parts) (iii) LIM yields the most plausible results, without artifacts.
An extension of our model, LIM, can trace the deformable shape through time. Given a multi-view sequence with RGB inputs, we start by extracting the triplane on the first frame with LRM. We render the triplane to obtain a depth map that we unproject to get multi-view renders of XYZ coordinates on the first (=canonical) frame. Finally, from the XYZ renders on the first frame, and RGB inputs on all the frames, LIM propagates the XYZ canonical coordinates on the first timestep to the other steps.
We use these surface annotations to reconstruct a time-deforming mesh with fixed topology and texture.
Our method, combined with a monocular-to-multiview video diffusion model, can be extended to the monocular reconstruction setting. We observe that results from TripoSR [2] (image-to-3D reconstructor) jitters since the frames are reconstructed independently. Consistent4D [1] renders are consistent in time, but the method is slow. Our method is consistent in time, has a single topology, and is significantly faster.