DAVIDE: Depth-Aware Video Deblurring

Abstract

Video deblurring aims at recovering sharp details from a sequence of blurry frames. Despite the proliferation of depth sensors in mobile phones and the potential of depth information to guide deblurring, depth-aware deblurring has received only limited attention. In this work, we introduce the 'Depth-Aware VIdeo DEblurring' (DAVIDE) dataset to study the impact of depth information in video deblurring. The dataset comprises synchronized blurred, sharp, and depth videos. We investigate how the depth information should be injected into the existing deep RGB video deblurring models, and propose a strong baseline for depth-aware video deblurring. Our findings reveal the significance of depth information in video deblurring and provide insights into the use cases where depth cues are beneficial. In addition, our results demonstrate that while the depth improves deblurring performance, this effect diminishes when models are provided with a longer temporal context.

Dataset Overview

The DAVIDE dataset consists of synchronized blurred, depth, and sharp videos. The dataset comprises 90 video sequences divided into 69 for training, 7 for validation, and 14 for testing. The test set includes annotations of seven content attributes categorized by: 1) environment (indoor/outdoor), 2) motion (camera motion/camera and object motion), and 3) scene proximity (close/mid/far). These annotations aim to facilitate further analysis into scenarios where depth information could be more beneficial. Here, we provide a preview of some test samples.

Depth injection method

We propose a depth injection method to incorporate depth information into Shift-Net, an existing RGB-only video deblurring model. The depth information is injected into the model at different stages of the network architecture. This method includes the Grouped Spatial Shift (GSS) module and our Depth-Aware Transformer (DAT) block. While the GSS expands the receptive field of depth features $F_D$ with spatial shift, our DaT more effectively produces features $\tilde{F}_I$ that capture depth cues by aggregating $F'_D$ to the RGB features $F_I$. The DaT block consist of: 1) a cross-attention module ('X-Atten' block in figure) that adapts the depth-based key-value pairs with the RGB query $Q$, 2) a Spatial Feature Transform (SFT) layer that modulates the RGB features $F_I$ with adapted depth features $z$, and 3) a gated feed-forward network ('GDFN' block in figure) that performs feature aggregation over the modulated features.

Video results

We conducted a comprehensive evaluation of the role of depth in video deblurring. Our findings indicate that video deblurring methods can compensate for the lack of depth information by accessing longer context windows. In the case of single-image deblurring, indoor, close proximity, and camera motion scenes clearly benefit from depth. Here, we show some video results from our extensive evaluation.

Dataset and Resources

We provide various resources from our work, including the DAVIDE dataset, original raw captures, and demo videos with model checkpoints for experimentation. To facilitate the downloading process, we have provided a download script for each resource. See the table below for resource links, scripts, and setup instructions. To access these materials, please complete the request form. Once submitted, you will receive an email with the required credentials.

Resource	Access Link	Download Script	Instructions
DAVIDE dataset	View Dataset	`download_davide_dataset.sh`	instructions
Raw captures	View Captures	`download_raw_captures.sh`	instructions
Demo data	View Demo Data	`download_demo_data.sh`	instructions
Checkpoints	View Checkpoints	`download_davide_checkpoints.sh`	instructions

Citation

@InProceedings{torres2024davide,
      author    = {Torres, German F. and Kalliola, Jussi and Tripathy, Soumya and Acar, Erman and K{\"a}m{\"a}r{\"a}inen, Joni-Kristian},
      editor    = {Del Bue, Alessio and Canton, Cristian and Pont-Tuset, Jordi and Tommasi, Tatiana},
      title     = {DAVIDE: Depth-Aware Video Deblurring},
      booktitle = {Computer Vision — ECCV 2024 Workshops},
      year      = {2024},
      month     = sep,
      publisher = {Springer Nature Switzerland},
      address   = {Cham, Switzerland},
      pages     = {161–179},
      isbn      = {978-3-031-91838-4},
      doi       = {10.1007/978-3-031-91838-4_10},
    }