AudioViewer: Learning to Visualize Sounds
Chunjin Song *1 Yuchi Zhang *1 Willis Peng 1
Parmis Mohaghegh 1 Bastian Wandt 12 Helge Rhodin1
1The University of British Columbia 2Linköping University, Sweden [paper] [code]
Abstract

A long-standing goal in the field of sensory substitution is enabling sound perception for deaf and hard of hearing (DHH) people by visualizing audio content. Different from existing models that translate to hand sign language, between speech and text, or text and images, we target immediate and low-level audio to video translation that applies to generic environment sounds as well as human speech. Since such a substitution is artificial, without labels for supervised learning, our core contribution is to build a mapping from audio to video that learns from unpaired examples via high-level constraints. For speech, we additionally disentangle content from style, such as gender and dialect. Qualitative and quantitative results, including a human study, demonstrate that our unpaired translation approach maintains important audio features in the generated video and that videos of faces and numbers are well suited for visualizing high-dimensional audio features that can be parsed by humans to match and distinguish between sounds and words.

Overview

AudioViewer A tool developed towards the long-term goal of helping hearing impaired persons to see what they can not hear. We map an audio stream to video and thereby use faces or numbers for visualizing the high-dimensional audio features intuitively. Different from the principle of lip reading, this can encode general sound and pass on information on the style of spoken language.

Human Speech Visualization

We use our model to visualize 3 test set utterances spoken by different people, as used for the training triplet. In the first two, different people speak the same sentence. The last two have the same speaker but different content. All three utterances are concatenated into a single audio/video.

Environment Sound Visualization

Our phone-level model generalizes well to natural sounds despite being only trained on human speech.

A Live Demo of AudioViewer

This live recording of our AudioViewer prototype demonstrates that human speech can be visualized in real-time (the low-res version) and works for different subjects. The slight delay stems from our web application running on a consumer PC and the windowed FFT that is not yet optimized for streaming applications. We speak slowly to facilitate a better association of visualization and speech despite the noticeable delay.

@article{audioviewer,
title={AudioViewer: learning to visualize sound},
author={Chunjin, Song and Zhang, Yuchi and Peng, Willis and Wandt, Bastian and Rhodin, Helge},
journal={WACV},
year={2023}
}