A A A Volume : 44 Part : 2 Creating a new research community on detection and classification of acoustic scenes and events: Lessons from the first ten years of DCASE challenges and workshops Mark D. Plumbley 1 Centre for Vision, Speech and Signal Processing, University of Surrey Guildford, Surrey GU2 7XH, United Kingdom Tuomas Virtanen 2 Audio Research Group, Tampere University P.O. Box 553, FI-33101 Tampere, FINLANDABSTRACT Research work on automatic speech recognition and automatic music transcription has been around for several decades, supported by dedicated conferences or conference sessions. However, while in- dividual researchers have been working on recognition of more general environmental sounds, until ten years ago there were no regular workshops or conference sessions where this research, or its researchers, could be found. There was also little available data for researchers to work on or to benchmark their work. In this paper we will outline how a new research community working on De- tection and Classification of Acoustic Scenes and Events (DCASE) has grown over the last ten years, from two challenges on acoustic scene classification and sound event detection with a small workshop poster session, to an annual data challenge with six tasks and a dedicated annual workshop, attract- ing hundreds of delegates and strong industry interest. We will also describe how the analysis meth- ods have evolved, from mel frequency cepstral coefficients (MFCCs) or cochelograms classified by support vector machines (SVMs) or hidden Markov models (HMMs), to deep learning methods such as transfer learning, transformers, and self-supervised learning. We will finish by suggesting some potential future directions for automatic sound recognition and the DCASE community.1. INTRODUCTIONImagine you are standing on a street corner in a city. Close your eyes: what do you hear? Perhaps some cars and busses driving on the road, footsteps of people on the pavement, beeps from a pedes- trian crossing, rustling and clonks from shopping bags and boxes, and the hubbub of talking shoppers. To most people, this skill of listening to everyday scenes and events is so natural that it is taken for granted. However, this has been a very challenging task for computers.The ability to automatically recognize sound scenes and events has major potential impact in a wide range of applications. Some examples include: for broadcasters or sound designers, an ability to edit programmes and soundtracks by seeing the events in the track, or perhaps automatically; for1 m.plumbley@surrey.ac.uk2 tuomas.virtanen@tuni.fiworm 2022 ecology, to use sound to monitor populations and movements of birds, animals and insects, and in- forming better environmental policies; and for self-driving vehicles, to improve awareness of warn- ings and events in the vicinity. However, until recently there has been relatively little research into automatic recognition of everyday sounds, compared to other topics such as computer vision, with no community for researchers interested in this area.Over the last decade or so this has begun to change, with the creation of a research community around the now-annual challenges and workshops on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we will take a look at how this community was created, starting with the first DCASE challenge in 2012-13, the changes in technology over this period, and some lessons learned for the future. 2. EARLY WORK ON EVERYDAY SOUND RECOGNITIONWhile analysis of general environmental audio has received less attention than speech or musical audio, there has been some research on recognition of general environmental sounds for at least the last three decades or so [1], including for applications such as alarm sound detection [2], personal audio recordings [3], or audio-assisted cameras [4]. Early work in audio event recognition often used features developed for speech recognition, such as mel-frequency ceptral coefficients (MFCCs) [1], or generic features such as spectral centroid or skewness [5]. Sequences may be recognised using a hidden Markov model [6], or bag-of-frames approach may be used for acoustic scene recognition [7].Despite this research activity, there was no community or home for researchers to meet, and no “critical mass” to drive research in this area. Researchers working on general sound recognition were typically scattered across other communities, such as noise control (Internoise), speech processing (Interspeech, Eurospeech), music processing (ISMIR), or acoustics (Acoustic Society of America). Similarly, publications and presentations were often found scattered throughout workshops, confer- ences and journals in the speech, music and acoustics communities. 3. TOWARDS DCASE3.1. Initial explorations (2010-2011) To explore potential interest in real-world sound recognition, one of us organized a one-day Machine Listening Workshop in December 2010 (http://c4dm.eecs.qmul.ac.uk/mlw2010/ ), to bring together interested researchers and to explore opportunities. From discussions at that workshop, and with other international groups during 2011 (e.g. at the IEEE ICASSP International Conference on Acoustics Speech and Signal Processing), it became clear that there was interest and some existing work in this research area, but that there were barriers that were holding back progress. The main barriers appeared to be: (a) there was no coherent identity for the research area (it was sometimes awkwardly referred to as “non-speech non-music audio”); (b) there was no “home” for research papers, with publications scattered across different journals, workshops, and conferences sessions; and (c) there were few shar- able datasets, making it hard for researchers to compare their research.From these discussions, the idea of organizing a data challenge emerged. This was inspired by experience with data challenges in related fields, such as the music information retrieval evaluation eXchange (MIREX) challenge series [8], and the signal separation evaluation campaign (SiSEC) challenge series [9]. There had also been a closely-related CLEAR evaluations in 2006 and 2007 [10], which included an evaluation on “Acoustic Event Detection and Classification”, but these evaluations did not continue beyond the end of the CHIL project which sponsored them.worm 2022 3.2. Organizing the DCASE 2013 Challenge (2011-2013) The first DCASE challenge was a developed as a collaboration between the groups of Plumbley (Queen Mary University of London, UK), and Lagrange (Ecole Centrale de Nantes, France), with discussions starting in late 2011. We successfully proposed this to the IEEE Audio and Acoustic Signal Processing (AASP) Technical Committee (TC) as an “AASP Challenge”. This recognition by an academic society TC helped us to secure an exceptional special poster session and overview talk from the organizers of the IEEE Workshop on Applications of Signal Processing to Audio and Acous- tics (WASPAA, October 2013), including some reserved delegate places for DCASE presenters.To help ensure the success of the challenge design, and to help with additional visibility, over the summer of 2012 the challenge team held a community discussion (Aug-Sep 2012) on challenge task design. These discussions helped to address issues such as data formats and metrics, and ensured that there was an international community who were already aware of the challenge, and were ready to participate when the challenge was launched. Data was released in December 2012, with a deadline for submissions in April 2013. The challenge attracted 24 submissions from 18 teams. To participate, researchers submitted software code for their systems, which was run by the challenge team. While attempts were made to avoid portability issues through the use of either Matlab submissions, or pro- vision of a virtual machine, it did prove somewhat time-consuming for the challenge team to run submitted code. Subsequent challenges have instead mostly required participants to run their own code, and submit labels for evaluation.3.3. Dissemination Activity To raise additional awareness in the area and bring together national and international groups before the challenge results were available, a special session was organized at the ICASSP international signal processing conference (May 2013), and a further one-day workshop on “Listening in the Wild” was organized at Queen Mary University of London (June 2013).Immediate dissemination of the challenge tasks and results was made initially through overview conference publications at the European Signal Processing conference (EUSIPCO, September 2013) and the IEEE WASPAA workshop (October 2013) [11], with 12 posters presented in a special poster session at WASPAA 2013. The challenge team proposed and published a paper on Acoustic Scene Classification, designed to reach a broad readership, in IEEE Signal Processing Magazine (published May 2015) [12]. An academic journal paper describing the DCASE challenge and results was pub- lished in IEEE Transactions on Multimedia (published October 2015) [13]. To support our submis- sion to the academic journal, the Chair of the IEEE AASP Challenges Subcommittee provided a supporting letter, since it was a somewhat unusual type of paper. The Transactions paper was for many months the most downloaded paper from the journal, and has now been cited over 500 times. 4. AN ANNUAL DCASE CHALLENGE AND WORKSHOPFollowing the conclusion of the first DCASE challenge, the two current authors discussed possible future directions, including at the ICASSP 2014 conference in Florence. With the award of an ERC Starting Grant “EVERYSOUND” starting in 2015, Virtanen and his team led on organizing a follow- up challenge, with a dedicated one-day post-conference workshop after the EUSIPCO European Sig- nal Processing Conference (September 2016). The workshops have expanded steadily since then, with a 1½-day workshop in Munich, Germany in 2017, 2-day workshops in Surrey, UK in 2018, and New York, USA in 2019, and more flexible formats in 2020 (virtual Tokyo) and 2021 (virtual Bar- celona) during the COVID-19 pandemic. Participation has risen from 82 challenge submissions in 2016, to around 400 in 2020-2021. Workshop attendance also has risen from 80 in 2016 to over 550 virtual attendees in 2021. For more details see Figure 1 and https://dcase.community/events .worm 2022 500150200600ChallengesWorkshops500Teams Entries400150Attendance Papers100400Attendance300PapersEntriesTeams100300200200505010010000002013 2016 2017 2018 2019 2020 20212013 2016 2017 2018 2019 2020 2021Figure 1: DCASE Challenge entries and Workshop attendance (2020 and 2021 were virtual). The first DCASE challenge in 2013 consisted of two tasks, which form the core of the DCASE re- search area: acoustic scene classification (ASC), and sound event detection (SED). Over the years, the number of tasks has been increasing: the 2016 and 2017 challenges consisted of four tasks, the 2018 and 2019 challenges of five tasks, and 2020-2022 challenges of six tasks. The two core tasks (ASC and SED) have always been a part of the DCASE, but they have been continually renewed, for example, using larger datasets to address the problem of robustness to different capture devices, and including limits on the computational complexity of methods that make the submitted systems more applicable in real systems. In addition to the ASC and SED core tasks, new tasks have been added to address new and emerging problems such as sound event localization and detection, anomalous sound detection, and audio captioning.As new tasks were added, more people have become involved in challenge organization. Since 2016, DCASE has followed a format where an independent organizing team has been responsible in organizing each task. Data, baseline methods, and evaluation methods are provided by each task or- ganizing team. Since 2016, DCASE has followed a format where development data is first made available for around two months for participants for developing their systems. The development data of each task typically consists of audio and reference system output, which are divided into training and development-testing, so that participants can score and develop their systems during the devel- opment stage. Evaluation data is then released, and participants have about one month to produce the output of their system, which they send for evaluation. For each of the submissions, challenge organ- izers then compute the scores, such as accuracy, and these are made publicly available on the chal- lenge web site. The change from code submission to submission of system outputs has allowed par- ticipation from those who do not want to submit their code.Since 2019, the DCASE tasks have been selected based on a public call for task proposals, that have been evaluated by the DCASE Steering Group. While each task is organized independently by a separate organizing team, all the tasks share the same schedule in terms of release of development data, release of evaluation data, deadline for submitting system outputs, and publishing the results. DCASE has also a common website for all the tasks, and there has been coordination to present information about each of the tasks in a coherent way, to allow easy participation to multiple tasks.One of the core principles of DCASE has been supporting reproducibility of research and open science. Therefore, all the DCASE tasks have used data that is publicly available. DCASE has also required that participants write technical reports about their systems, which are made publicly avail-worm 2022 able on the challenge website. DCASE has been encouraging reproducibility, by encouraging partic- ipants to make software implementations of their systems publicly available, and in recent years, “Judge’s Award” prizes has been given, which include reproducibility as an evaluation criterion.As well as the DCASE workshop itself, the area has now become an established part of the audio signal processing research community. For example, the IEEE now has an official topic classification (EDICS) term for “Detection and Classification of Acoustic Scenes and Events” (AUD-CLAS), and recent ICASSP conferences have included regular sessions on this topic. The present authors, and Ellis, have also edited a book [14] to act as a reference work for the field. 5. TASKS AND DATASETSCurrent technologies for audio analysis rely heavily on machine learning, and therefore the data used to train a system plays a critical role. Test data is also needed to evaluate the performance of systems. Many datasets that have been introduced or used in DCASE Challenge have become established in the general field of environmental audio analysis.All DCASE tasks to date (except 2022 Task 6b Language-Based Audio Retrieval) are formulated as audio analysis tasks where a system is taking a piece of audio as input, and the system is supposed to produce an output specific to each task. Datasets used in DCASE are organized to support the development of systems for this kind of tasks by consisting of audio and task-specific annotations. The annotations serve as a target output for training systems, and as a reference for testing the per- formance of systems. The development data has been typically split into training and development- testing by dataset collectors or task organizers, and participants are required to report the results they obtain with the development-testing set in their technical reports. Since the annotations of the devel- opment set are publicly available, participants can also use them in scientific publications to bench- mark any methods that are developed outside the DCASE Challenge, with an evaluation setup that gives results which can be compared to those that others have obtained. At the DCASE evaluation stage, evaluation datasets consisting of audio only are published. Some of the tasks have published the annotations of evaluation data after the evaluation stage, but some tasks have kept the annotations unpublished, to allow reusing the same data in later editions of the task. We will now outline some of the most common tasks in DCASE and some important associated datasets.Acoustic scene classification : The goal in ASC is to classify what is the acoustic scene of an input audio signal. The set of possible acoustic scene classes is predefined, and there is training material available from each of the classes, and therefore this is a single-label supervised learning problem. Scene classes can include, for example, “Indoor shopping mall”, “Metro station”, “Pedestrian street”, “Public square”, as in DCASE ASC Task 1 between 2018 and 2022. The most important application of ASC is context-aware devices, where a device gets information about its context based on audio. DCASE ASC tasks have been organized based on variants of two datasets. Firstly, in 2016 and 2017 the task was based on TUT Acoustic scenes 2016 / 2017 [15,16], which consists of 30-second (2016) or 10-second (2017) segments of audio from 15 difference classes, such as "Bus”, “Cafe / Restau- rant”, “Car”, and “City center”. The audio is binaural, and the recordings have mostly been made in two cities in Finland (Tampere and Helsinki). The total amount of audio in the development set is 13 hours. Secondly, between 2018-2022 the task has been based on variants of TAU Urban Acoustic Scenes 2020 Mobile dataset [17], which consists of 10-second audio signals from 10 classes such as “Indoor shopping mall”, “Metro station”, “Pedestrian street”, and “Public square.” The dataset has been recorded in 12 large European cities such as Amsterdam, Barcelona, Helsinki, and Lisbon. The recordings were made with four difference devices (high-quality binaural microphones, two smartphones, and a GoPro), to allow the robustness to different devices to be studied. The totalworm 2022 amount of high-quality binaural audio is 40 hours. The biggest changes in the ASC datasets over the years is the increasing amount of data, increasing diversity of recording locations, and addition of multiple recording devices.Sound event detection : The goal in SED is to estimate when specific target sound classes are active in an input audio signal. It differs from ASC by including the estimation of temporal activity. Furthermore, in a SED task, multiple target classes can be active simultaneously. Producing reference annotations of sound events requires laborious manual work, and therefore the first SED datasets were somewhat small scale and used synthetic mixtures where isolated sound events were mixed with background audio. As annotation procedures have developed, recent SED datasets have increased in size. Typical SED applications include smart homes, smart cities, and bioacoustics, and DCASE has had tasks related to each of these domains. Example target classes include “Baby Crying”, “Glass breaking”, and “Alarm/bell/ringing”, and “Ambulance siren”. Some SED applications do not require exact temporal information about activities of sound events, only whether they are present in an audio signal or not. DCASE has also included tasks related to this kind of weak label estimation, or “audio tagging”, including using subsets of the large-scale AudioSet dataset [18].Sound event localization and detection : A natural extension of SED is to estimate the locations of sound sources, when a multi-microphone recording is available to allow spatial analysis. Annota- tion of locations of sound sources is even more laborious than annotation of temporal activities, and therefore the first editions of the task (between 2019- 2021) used data where sound events at different positions were simulated by convolving isolated sound event signals with impulse responses corre- sponding to different positions, and mixed with real ambience recordings. The latest edition (2022) includes real recordings, where the annotations were produced by involving additional sensors.The latest task additions to DCASE are anomalous sound detection , audio captioning (intro- duced in 2020), and language-based audio retrieval (introduced in 2022). Anomalous sound detec- tion datasets include different types of machinery, such as fan, gearbox, and bearing, and the goal is to analyze whether the sound produced by the machinery is normal or anomalous, the latter indicating a problem in the condition of the machinery [19]. In audio captioning, the goal of a system is to output text describing the contents of an input audio. In language-based audio retrieval, the goal of a system is to retrieve an audio signal from a database matching with a given textual description. These two tasks have used the Clotho dataset [20], which consist of audio files and their captions that describe their contents using text. 6. TECHNOLOGIESSince the first DCASE Challenge in 2013, methods have evolved substantially. Typical systems sub- mitted to DCASE 2013 used features such as mel frequency cepstral coefficients (MFCCs), cochle- ograms, or non-negative matrix factorization (NMF) of spectrograms, with classification performed using a support vector machine (SVM), random forest, Gaussian mixture model (GMM), or hidden Markov model (HMM). Most submitted systems used Matlab, with some using Python [13]. By the DCASE 2016 Challenge, while spectrogram, MFCC and NMF features were still being used, as well as others such as the constant Q transform (CQT) and mel filterbank, we were beginning to see clas- sification using deep learning methods, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Another emerging trend was the use of ensemble classifiers [21].As the challenges have progressed, methods used include a wide variety of CNN and other deep learning architectures, including models based on ResNet, EfficientNet and MobileNetV2; pre- trained models such as VGG-ish and PANNs; transfer learning and teacher-student models; and re-worm 2022 cent approaches such as Transformers and self-supervised learning. For more details, DCASE work- shop editions typically include reports or posters on the Challenge Tasks, see e.g. https://dcase.com- munity/workshop2021/program#challenge-posters 7. CONCLUSIONSWe have described the growth of a new research community, on Detection and Classification of Acoustic Scenes and Events, since the first DCASE Challenge was organized in 2012-13. This has created a “home” for researchers in this area, with an annual challenge and workshop, an expanding set of datasets and code, and emerging tasks such as anomaly detection and audio captioning.As this community has emerged, we have learned a number of useful lessons along the way. A successful data challenge can take some time to organize, including discussing with collaborators, ethics approval, data collection, designing baseline systems and evaluation criteria, and disseminating outcomes. Nevertheless, the datasets and code encourage a reproducible research mindset that can encourage researchers into a new field, creating a community where “a rising tide lifts all boats”. This new community also facilitates academic-industry knowledge exchange, with a third to a half of workshop delegates from industry, and companies both large and small sponsoring the workshop. It is also important to link the community to related communities, such as machine learning, audio signal processing, and noise control. While virtual discussions can be challenging, we have also seen that virtual workshops offer an opportunity for many more people to engage with the community.Future workshops may use a hybrid format, combining in-person interaction for those who can attend in person with remote attendance as an affordable option, or for those who cannot travel. Chal- lenges will continue to evolve, with new tasks such as language-based audio retrieval emerging in 2022. We are also seeing more tasks clearly related to beneficial outcomes, such as sound event de- tection in domestic environments that can be applied to assisted living, bioacoustic event detection that can be applied to environmental monitoring, or anomalous sound detection for machine condition monitoring. We look forward to DCASE continuing to be a supportive and open research community, helping researchers realize their potential to create beneficial results for society. 8. ACKNOWLEDGEMENTSThe authors would like to thank the DCASE Challenge organizers and participants, and the DCASE Workshop hosts and delegates, for helping to make the DCASE community what it is today.This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 “AI for Sound”. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript ver- sion arising. Data on DCASE Workshop statistics is available at https://dcase.community/events . 6. REFERENCES1. Goldhor, R. S. (1993). Recognition of environmental sounds. In Proc. ICASSP-93 , pp. I-149-152. 2. Ellis, D. (2001). Detecting alarm sounds. In Proc Workshop on Consistent & Reliable AcousticCues for sound analysis (CRAC) , pp. 59-62. 3. Ogle J. & D. Ellis (2007) Fingerprinting to identify repeated sound events in long-duration per-sonal audio recordings. Proc. ICASSP-2007 , pp. I-233-236. 4. Smaragdis, P. & B. Raj (2007) Audio-assisted cameras and acoustic doppler sensors. In Z Zhu, THuang (eds), Multimodal Surveillance: Sensors, Algorithms and Systems , Artech House.worm 2022 5. Casey, M. (2002). General sound classification and similarity in MPEG-7. Organised Sound6:153-164. 6. Wang, J., Xu, C., & Chng, E. (2006). Automatic sports video genre classification using pseudo-2d-HMM. In Proc ICPR’06 , pp. 778-781. 7. Aucouturier, J. J., & Defreville, B. (2007). Sounds like a park: A computational technique torecognize soundscapes holistically, without source identification. In Int. Congr. Acoust , pp. 2-7. 8. Downie, J. S., Ehmann, A. F., Bay, M., & Jones, M. C. (2010). The music information retrievalevaluation exchange: Some observations and insights. In Advances in Music Information Re- trieval (pp. 93-115). Springer, Berlin, Heidelberg. 9. Vincent, E., Araki, S., Theis, F., Nolte, G., Bofill, P., Sawada, H., Ozerov, A., Gowreesunker, V.,Lutter, D. and Duong, N. Q. (2012). The signal separation evaluation campaign (2007–2010): Achievements and remaining challenges. Signal Processing , 92, pp. 1928-1936. 10. Stiefelhagen, R., Bernardin, K., Bowers, R., Rose, R. T., Michel, M., & Garofolo, J. (2007). TheCLEAR 2007 evaluation. In Multimodal Technologies for Perception of Humans (pp. 3-34). Springer, Berlin, Heidelberg. 11. Giannoulis, D., Benetos, E., Stowell, D., Rossignol, M., Lagrange, M., & Plumbley, M. D. (2013).Detection and classification of acoustic scenes and events: An IEEE AASP challenge. In Proc IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2013) . 12. Barchiesi, D., Giannoulis, D., Stowell, D., & Plumbley, M. D. (2015). Acoustic scene classifica-tion: Classifying environments from the sounds they produce. IEEE Signal Processing Magazine , 32(3), pp. 16-34. 13. Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., & Plumbley, M. D. (2015). Detection andclassification of acoustic scenes and events. IEEE Transactions on Multimedia , 17, 1733-1746. 14. Virtanen, T., Plumbley, M. D., & Ellis, D. (Eds.). (2018). Computational Analysis of SoundScenes and Events . Heidelberg: Springer. 15. Mesaros, A. Heittola, T., and Virtanen, T. (2016) TUT database for acoustic scene classificationand sound event detection. In Proc. EUSIPCO 2016 . 16. Mesaros, A., Heittola, T., Diment, A., Elizalde, B., Shah, A., Vincent, E., Raj, B., and Virtanen,T. (2017) DCASE 2017 challenge setup: Tasks, datasets and baseline system. In Proc. DCASE2017 , pp. 85–92. 17. Mesaros, A., Heittola, T., and Virtanen, T. (2018) A multi-device dataset for urban acoustic sceneclassification. In Proc. DCASE2018 . 18. Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M.and Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In Proc ICASSP 2017 , pp. 776-780. 19. Kawaguchi, Y., Imoto, K., Koizumi, Y., Harada, N., Niizumi, D., Dohi, K., ... & Endo, T. (2021).Description and discussion on DCASE 2021 challenge task 2: Unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions. In Proc. DCASE 2021 , pp. 186-190. 20. Drossos, K., Lipping, S., & Virtanen, T. (2020). Clotho: An audio captioning dataset. In ProcICASSP 2020 , pp. 736-740. 21. Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., & Plumbley, M. D.(2018). Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 379-393.worm 2022 Previous Paper 706 of 808 Next