Don’t miss out on NVIDIA Blackwell! Join the waitlist.

Research Publications

Selected publications from our researchers.

clip2latent: Text driven sampling of a pre-trained stylegan using denoising diffusion and clip

BMVC 2022

We introduce a new method to efficiently create text-to-image models from a pre-trained CLIP and StyleGAN. It enables text driven sampling with an existing generative model without any external data or fine-tuning. This is achieved by training a diffusion model conditioned on CLIP embeddings to sample latent vectors of a pre-trained StyleGAN, which we call clip2latent. We leverage the alignment between CLIP's image and text embeddings to avoid the need for any text labelled data for training the conditional diffusion model. We demonstrate that clip2latent allows us to generate high-resolution (1024x1024 pixels) images based on text prompts with fast sampling, high image quality, and low training compute and data requirements. We also show that the use of the well studied StyleGAN architecture, without further fine-tuning, allows us to directly apply existing methods to control and modify the generated images adding a further layer of control to our text-to-image pipeline.

Learn more

NPRportrait 1.0: A three-level benchmark for non-photorealistic rendering of portraits

Computational Visual Media 2022

Recently, there has been an upsurge of activity in image-based non-photorealistic rendering (NPR), and in particular portrait image stylisation, due to the advent of neural style transfer (NST). However, the state of performance evaluation in this field is poor, especially compared to the norms in the computer vision and machine learning communities. Unfortunately, the task of evaluating image stylisation is thus far not well defined, since it involves subjective, perceptual, and aesthetic aspects. To make progress towards a solution, this paper proposes a new structured, three-level, benchmark dataset for the evaluation of stylised portrait images. Rigorous criteria were used for its construction, and its consistency was validated by user studies. Moreover, a new methodology has been developed for evaluating portrait stylisation algorithms, which makes use of the different benchmark levels as well as annotations provided by user studies regarding the characteristics of the faces. We perform evaluation for a wide variety of image stylisation methods (both portrait-specific and general purpose, and also both traditional NPR approaches and NST) using the new benchmark dataset.

Learn more

Multiple pairwise ranking networks for personalized video summarization

ICCV 2021

In this paper, we investigate video summarization in the supervised setting. Since video summarization is subjective to the preference of the end-user, the design of a unique model is limited. In this work, we propose a model that provides personalized video summaries by conditioning the summarization process with predefined categorical user labels referred to as preferences. The underlying method is based on multiple pairwise rankers (called Multi-ranker), where the rankers are trained jointly to provide local summaries as well as a global summarization of a given video. In order to demonstrate the relevance and applications of our method in contrast with a classical global summarizer, we conduct experiments on multiple benchmark datasets, notably through a user study and comparisons with the state-of-art methods in the global video summarization task.

Learn more

Adversarial Monte Carlo denoising with conditioned auxiliary feature modulation

ACM Transactions on Graphics 2019

Along with the rapid improvements in hardware and gradually increasing perceptual demands of users, Monte Carlo path tracing is becoming more popular in movie production and video games due to its generality and unbiased nature [Keller et al. 2015; Zwicker et al. 2015]. However, its high estimator variance and low convergence rate motivate researchers to investigate efficient denoising approaches at reduced sample rates with the help of inexpensive by-products (eg, feature buffers). In the past few years, regressionbased kernel filtering approaches [Bitterli et al. 2016; Moon et al. 2014] and learning-based methods [Bako et al. 2017; Chaitanya et al. 2017; Kalantari et al. 2015; Vogels et al. 2018] have achieved great success. In particular, the deep learning based methods have achieved more plausible denoising results, since they effectively leverage convolutional neural networks to break the limitation of only utilizing information from pixel sets in specific images. However, based on our practice of employing the state-of-the-art methods, we found that nearly all of them rely on handcrafted optimization objectives like MSE or MAPE loss which do not necessarily ensure perceptually plausible results. Fig. 1 shows some typical cases where recent works [Bako et al. 2017; Chaitanya et al. 2017] have struggled to handle extremely noisy regions as in high-frequency area thus led to over-smoothed output with approximately correct colors. Our primary focus is to reconstruct the visually convincing global illumination as previous approaches while recovering high-frequency details as much as possible.

Learn more

HoloGAN: Unsupervised Learning of 3D Representations from Natural Images

ICCV 2019

We propose a novel generative adversarial network (GAN) for the task of unsupervised learning of 3D representations from natural images. Most generative models rely on 2D kernels to generate images and make few assumptions about the 3D world. These models therefore tend to create blurry images or artefacts in tasks that require a strong 3D understanding, such as novel-view synthesis. HoloGAN instead learns a 3D representation of the world, and to render this representation in a realistic manner. Unlike other GANs, HoloGAN provides explicit control over the pose of generated objects through rigid-body transformations of the learnt 3D features. Our experiments show that using explicit 3D features enables HoloGAN to disentangle 3D pose and identity, which is further decomposed into shape and appearance, while still being able to generate images with similar or higher visual quality than other generative models. HoloGAN can be trained end-to-end from unlabelled 2D images only. Particularly, we do not require pose labels, 3D shapes, or multiple views of the same objects. This shows that HoloGAN is the first generative model that learns 3D representations from natural images in an entirely unsupervised manner. 

Learn more

RenderNet: A Deep Convolutional Network for Differentiable Rendering from 3D Shapes

NIPS 2018

Traditional computer graphics rendering pipelines are designed for procedurally generating 2D images from 3D shapes with high performance. The nondifferentiability due to discrete operations (such as visibility computation) makes it hard to explicitly correlate rendering parameters and the resulting image, posing a significant challenge for inverse rendering tasks. Recent work on differentiable rendering achieves differentiability either by designing surrogate gradients for non-differentiable operations or via an approximate but differentiable renderer. These methods, however, are still limited when it comes to handling occlusion, and restricted to particular rendering effects. We present RenderNet, a differentiable rendering convolutional network with a novel projection unit that can render 2D images from 3D shapes. Spatial occlusion and shading calculation are automatically encoded in the network. Our experiments show that RenderNet can successfully learn to implement different shaders, and can be used in inverse rendering tasks to estimate shape, pose, lighting and texture from a single image. 

Learn more

Benchmarking Non-Photorealistic Rendering of Portraits

NPAR 2017

We present a set of images for helping NPR practitioners evaluate their image-based portrait stylisation algorithms. Using a standard set both facilitates comparisons with other methods and helps ensure that presented results are representative. We give two levels of diculty, each consisting of 20 images selected systematically so as to provide good coverage of several possible portrait characteristics. We applied three existing portrait-specic stylisation algorithms, two general-purpose stylisation algorithms, and one general learning based stylisation algorithm to the rst level of the benchmark, corresponding to the type of constrained images that have oen been used in portrait-specic work. We found that the existing methods are generally eective on this new image set, demonstrating that level one of the benchmark is tractable; challenges remain at level two. Results revealed several advantages conferred by portrait-specic algorithms over general-purpose algorithms: portrait-specic algorithms can use domain-specic information to preserve key details such as eyes and to eliminate extraneous details, and they have more scope for semantically meaningful abstraction due to the underlying face model. Finally, we provide some thoughts on systematically extending the benchmark to higher levels of difficulty. 

Learn more

Deep Learning and Face Recognition: The State of the Art

SPIE 2015

Deep Neural Networks (DNNs) have established themselves as a dominant technique in machine learning. DNNs have been top performers on a wide variety of tasks including image classification, speech recognition, and face recognition.1–3 Convolutional neural networks (CNNs) have been used in nearly all of the top performing methods on the Labeled Faces in the Wild (LFW) dataset.3–6 In this talk and accompanying paper, I attempt to provide a review and summary of the deep learning techniques used in the state-of-the-art. In addition, I highlight the need for both larger and more challenging public datasets to benchmark these systems... 

Learn more