DAVIDE: Depth-Aware Video Deblurring

German F. Torres1, Jussi Kalliola1, Soumya Tripathy2, Erman Acar2, and Joni Kämäräinen1
1Tampere University  2Huawei Technologies
ECCV Workshop AIM 2024

Abstract

Video deblurring aims at recovering sharp details from a sequence of blurry frames. Despite the proliferation of depth sensors in mobile phones and the potential of depth information to guide deblurring, depth-aware deblurring has received only limited attention. In this work, we introduce the 'Depth-Aware VIdeo DEblurring' (DAVIDE) dataset to study the impact of depth information in video deblurring. The dataset comprises synchronized blurred, sharp, and depth videos. We investigate how the depth information should be injected into the existing deep RGB video deblurring models, and propose a strong baseline for depth-aware video deblurring. Our findings reveal the significance of depth information in video deblurring and provide insights into the use cases where depth cues are beneficial. In addition, our results demonstrate that while the depth improves deblurring performance, this effect diminishes when models are provided with a longer temporal context.


Dataset

The DAVIDE dataset consists of synchronized blurred, depth, and sharp videos. The dataset comprises 90 video sequences divided into 69 for training, 7 for validation, and 14 for testing. The test set includes annotations of seven content attributes categorized by: 1) environment (indoor/outdoor), 2) motion (camera motion/camera and object motion), and 3) scene proximity (close/mid/far). These annotations aim to facilitate further analysis into scenarios where depth information could be more beneficial. Here, we provide a preview of some test samples.

Depth injection method

Depth Fusion Method

We propose a depth injection method to incorporate depth information into Shift-Net, an existing RGB-only video deblurring model. The depth information is injected into the model at different stages of the network architecture. This method includes the Grouped Spatial Shift (GSS) module and our Depth-Aware Transformer (DAT) block. While the GSS expands the receptive field of depth features $F_D$ with spatial shift, our DaT more effectively produces features $\tilde{F}_I$ that capture depth cues by aggregating $F'_D$ to the RGB features $F_I$. The DaT block consist of: 1) a cross-attention module ('X-Atten' block in figure) that adapts the depth-based key-value pairs with the RGB query $Q$, 2) a Spatial Feature Transform (SFT) layer that modulates the RGB features $F_I$ with adapted depth features $z$, and 3) a gated feed-forward network ('GDFN' block in figure) that performs feature aggregation over the modulated features.

Video results

We conducted a comprehensive evaluation of the role of depth in video deblurring. Our findings indicate that video deblurring methods can compensate for the lack of depth information by accessing longer context windows. In the case of single-image deblurring, indoor, close proximity, and camera motion scenes clearly benefit from depth. Here, we show some video results from our extensive evaluation.

Citation

@article{torres2024davide,
      title={DAVIDE: Depth-Aware Video Deblurring},
      author={Torres, German F and Kalliola, Jussi and Tripathy, Soumya and Acar, Erman and K{\"a}m{\"a}r{\"a}inen, Joni-Kristian},
      journal={arXiv preprint arXiv:2409.01274},
      year={2024}
    }