We present a method to build animatable dog avatars from monocular videos.
This is challenging as animals display a range of (unpredictable) non-rigid movements and have a variety of appearance details (e.g., fur, spots, tails). We develop an approach that links the video frames via a 4D solution that jointly solves for animal's pose variation, and its appearance (in a canonical pose).
To this end, we significantly improve the quality of template-based shape fitting by endowing the SMAL parametric model with Continuous Surface Embeddings (CSE), which brings image-to-mesh reprojection constaints that are denser, and thus stronger, than the previously used sparse semantic keypoint correspondences. To model appearance, we propose an implicit duplex-mesh texture that is defined in the canonical pose, but can be deformed using SMAL pose coefficients and later rendered to enforce a photometric compatibility with the input video frames . On the challenging CoP3D dataset, we demonstrate superior results (both in terms of pose estimates and predicted appearance) to existing template-free (RAC) and template-based approaches (BARC, BITE).
Traditional solutions used mainly only sparse keypoints. However, the training of these keypoint predictors is biased toward front-facing views. As a result. we observe few keypoints on frame of the video where the dog is filmed from behind. In contrast, CSE embeddings (3rd column from left) on which our method relies, provide a dense signal regardless of the viewpoint.
Our solution leverages (left to right): image, mask, dense correspondences and sparse keypoints
We compare our pose reconstruction on a set of videos from the COP3D dataset. First, we compare to BARC and BITE. Similarly to us, they are template-based models, based on SMAL. However, they take as input only a single frame. Additionaly, we evaluate on RAC, a template-free reconstructor predicting shape (and texture) from input videos. We observe that:
Pose prediction (Left to right): Our, BARC, BITE, RAC.
In this project we introduced the canonical duplex-mesh renderer, a new deformable implicit shape model. We evaluate the quality of texture reconstruction on the same models from shape evaluation. We compare to RAC, since BARC and BITE only provide shape reconstruction with no texture. We observe that:
Texture reconstruction (Left to right): Our, RAC.
Once an avatar (composed of a shape, time-dependent pose and texture) model is extracted from a scene, it can be animated via an other sequence of pose. Here, we animate a set of K=6 avatars with the same source dynamic motion, extracted from various scenes.
Avatar Reanimation (Left to right): Original scene, reconstructed avatar from scene, K=6 avatars with motion from scene