Abstract
Reconstructing 3D face models from a single image is an inherently ill-posed problem, which becomes even more challenging due to the presence of occlusions. In addition to fewer available observations, occlusions introduce an extra source of ambiguity where multiple reconstructions can be equally valid. Despite the ubiquity of the problem, very few methods address its multi-hypothesis nature. In this paper we introduce OFER, a novel approach for single-image 3D face reconstruction that can generate plausible, diverse and expressive 3D faces, even under strong occlusions. Specifically, we train two diffusion models that generate the shape and expression coefficients of a face parametric model, conditioned on the input image. This allows capturing the multi-modal nature of the problem, generating a distribution of solutions as output. Although this addresses the ambiguity problem, the challenge now is to pick the best matching shape to ensure consistency across diverse expressions. To achieve this, we propose a novel ranking mechanism that sorts the outputs of the shape diffusion network based on the predicted shape accuracy scores, to select the best match. We evaluate our method using standard benchmarks, and introduce CO-545—a new protocol and dataset designed to assess the accuracy of expressive faces under occlusion. Our results show improved performance over occlusion-based methods, coupled with the ability to generate multiple expressions for a given image.
Download paper here