LiSA - Listen, See and Act: fusing audio-video cues to perceive visible and invisible events and develop perception-to-action solutions for autonomous vehicles

Details

ERC sector: PE6 - Computer Science and Informatics
ERC subsector: PE6_11 - Machine learning, statistical data processing and applications using signal processing (e.g. speech, image, video)
Project start date: 30/11/2023
CUP: D53D23017510001
Financial support received: €121.036,00

Description and purpose

Despite the impressive advancement in driverless vehicle technology, several limitations still exist. Most of those are related to the capability of autonomous systems to effectively perceive entities and events in the environment and compute timely navigation and reaction commands. The core of this project is investigating multimodal perception-to-action solutions to tackle these issues, and develop frameworks able to see and interpret also entities and events outside of the field of view.

Website: https://isar.unipg.it/project/lisa-listen-see-and-act/

Purpose

This project aims to empower Autonomous Vehicles with novel perception-to-action capabilities that rely on multiple and heterogeneous data sources. In particular, we leverage the combination of visual and audio information towards a more robust, efficient, and descriptive representation of the vehicle's surroundings. Sound, indeed, is able to provide omnidirectional perception, overcoming the limitations imposed by occlusions, and, thus, enhancing the vehicle’s awareness of the scene.

Expected results

We have been working towards:

Developing methods and models to detect and localise acoustic events in urban scenarios;
Generating joint representation of audio/visual events to enable the development of multi-modal systems that model spatiotemporal relationships of audio/visual inputs.
Developing perception-to-action methodologies to map audio-visual cues to vehicle control commands and improve autonomous navigation capabilities.

Achieved results

The project is progressing as planned, achieving objectives and milestones according to schedule. Specifically, our efforts brought to the successful development of an audio-video perception pipeline for the detection and localisation of events in urban scenes. Furthermore, we created an audio-visual simulator, fully customisable, for data collection and testing of our perception-to-action modules. We are currently working on a final evaluation of our systems in the real world.

Scientific director