Institute of Acoustics: Paper Detail

Volume : 44

Part : 2

Influence of several audio parameters in urban sound event classification Simona Domazetovska 1 Faculty of Mechanical engineering in Skopje University Ss Cyril and Methodius in Skopje, North Macedonia Viktor Gavriloski 2 Faculty of Mechanical engineering in Skopje University Ss Cyril and Methodius in Skopje, North Macedonia Maja Anachkova 3 Faculty of Mechanical engineering in Skopje University Ss Cyril and Methodius in Skopje, North Macedonia

ABSTRACT

As the urban areas are becoming more dynamic and overcrowded, the problems with the noise pollution are increasing, leading in creating urban sound classification systems. This paper presents a system for classification of sound events established through the development of process for detection and parameterization of sound signals that are used for further classification by applying machine learning algorithms. As the urban sounds are sounds of interest in this study, the UrbanSound8K dataset was used for training and testing the system. Five audio parameters were analyzed and tested on three machine learning algorithms in order to investigate their influence in the accuracy results when recognizing the class of the urban sound event, thus yielding to a collection of 48 different model combinations. The applied audio parameters: MFCC, Mel Spectrogram, Chromagram, Spectral Contrast and Tonal Centroid were chosen as they are widely used for urban noise classification, and their combination forming diverse feature vectors can lead in choosing the right set of parameters. The accuracy results differed for each combination of the audio parameters, thus leading to choosing best set of parameters that reaches the highest recognition accuracy. The classification process was established using three ML algorithms: Support Vector Machines (SVM), Random Forest (RF) and Naïve Bayes (NB).

1 simona.domazetovska@mf.edu.mk 2 viktor.gavriloski@mf.edu.mk 3 maja.anachkova@mf.edu.mk

1. INTRODUCTION

The application of the advanced methods for environmental noise analysis through the development of classification of the sound events significantly improves and simplifies the process of noise assessment. The classification systems are built through the application of digital signal processing techniques that are used for better understanding the sound signal by forming patterns that are easier to recognize in the machine learning (ML) algorithms. Using audio parametrization process to extract the important features from the audio signal is very important as it is used as an input in the ML algorithms for identification of the features, learning from the data and classification of the sound events. The audio parametrization is a key requirement for audio signal classification [1]. According to the study in [2], the researchers have reviewed the existing audio parameters, dividing them into six domains: temporal, frequency, cepstral, modulation frequency, eigen domain and phase space domain. From here, depending on the dataset and the application of the classification system, it is easier to choose which parameters are going to be used in the feature extraction process.

Many studies in the literature have studied the classification systems using different audio parameters and machine learning algorithms. By testing the classification accuracy of road vehicle sources in [3], 13 signal features and 4 machine learning techniques have been applied, having a collection of 52 tested combinations. In [4], classification of urban sound event has been established by applying five classifiers and six audio parameters [4]. Simularly, researchers have been analyzing urban sounds by applying combination of feature extraction and deep learning techniques to process a stream of audio and label it in their belonged class [5]. A comparison between machine learning and deep learning classifier was applied in the field of environmental sound recognition in [6]. From here, the experimental results show that machine learning classifiers can be combined to achieve similar results to deep learning models, and even outperform them in terms of accuracy. The urban sound classification algorithms could be further upgrated in order to use them in many real-time applications by making contribution in creating a feasible and deployable real-time sound classification system [7].

Based on the previous review, this paper investigates the accuracy when using five audio parameters and three ML algorithms for urban sound event classification, thus yielding to a collection of 48 different model combinations. The focus is on the feature extraction process and the accuracy achieved when using different audio parameters. The accuracy results helped in choosing the best set of parameters that achieves the highest accuracy in the classification process. The use of the hyperparameter optimization on the ML algorithms resulted in high prediction with more than 90% accuracy.

The structure of this paper is as it follows: first, the design of the classification system and the used dataset for urban sound classification is analyzed, and afterwards, the steps for extracting the features using the chosen audio parameters were studied. In the following paragraph, the machine learning algorithms were briefly explained. In the end, the results were discussed, proposing conclusions and future work. 2. DESIGN OF THE URBAN SOUND CLASSIFICATION SYSTEM

For the purpose of this study, the classification of the urban sound events is established by two main processes: audio parametrization by extracting the important features from the audio file, and classification by using ML algorithms. The applied architecture of the classification system is shown on Figure 1.

First, the important features from the audio files from the dataset were extracted by applying 5 different audio parameters. By combining the different audio parameters, it is expected to find the most suitable set of parameters that achieves the highest classification accuracy. After the audio parametrization, the machine learning process is applied for training and testing the system using supervised learning method by applying the support vector machines, random forest and naïve Bayes algorithms.

Figure 1: Classification system architecture The UrbanSound8K dataset which is created by a team of researchers in [8] was used as an input for training and testing the system. The UrbanSound8K dataset is consisted of 8725 labbeled audio files representing 10 classes of urban sound events with time duration of almost 9 hours, where each sound event has maximum duration of 4 seconds. Based on the UrbanSound8K dataset, there are 10 classes of disturbing urban sounds: siren, air conditioner, dog bark, street music, drilling, children playing, gun shot, car horn, jack hammer and engine idling. Four important parameters: time duration, sample rate, bit depth and the distribution of the classes of sounds of the database were analyzed, so deeper information about the dataset can be gained. Figure 2 shows the results from the parameters from the database. As from the percentage distribution of each class in the whole dataset, it can be noticed that the audio files have approximately the same percentage consistency for each class, except for the gun shot and car horn sound classes. The sampling rate refers to the number of samples of audio recorded every second shown in Hz. It can be noticed that most of the sounds (around 90%) have sample rate of 48 000 and 44 1000 Hz, indicating good sound quality. The bit depth refers to the resolution of the sound data that is captured and stored in the audio file. From the results it can be noticed that most of the sound events have resolution between 16 and 24 bits. For the time duration, the recorded sounds are limited to 4 seconds duration, from where it can be noticed that most od the sounds (around 85.52%) have time duration between 3 and 4 seconds.

= Hdl > |_ranternizaron | = eS : TRAINING ] aces oe ee | 10% OF THE DATABASE

Figure 2: Analysis of audio files from the UrbanSound8K dataset

When parameterizing sound events and forming a feature vector, it is very important that all data have the same duration. Therefore, in the processing will be applied normalization and zero padding to the signals. This technique does not affect the algorithms in recognizing features, on the contrary, its application improves the performance of the machine learning algorithm.

3. AUDIO PARAMETRIZATION

Audio parametrization is used for extracting and understanding meaningful information from audio signals as their properties and similarities can hardly be derived from the waveforms. A key requirement for audio classification is the extraction of the appropriate audio parameters that represent and discriminate the signal features.

In the feature extraction process, the chosen audio parameters were extracted, and the audio was converted to feature vector represented as an array of numbers that can be further used in the classification process. Firstly, the continuous signal was processed into discrete signal using sample rate of 22 050 Hz, and afterwards it was divided into short time frames. Furthermore, additional processing techniques were implemented by extracting five audio parameters. The steps for extracting the feature using each of the audio parameters are shown on Figure 3.

Figure 3: Steps for extracting the features using 5 audio parameters

The used audio parameters are: 1. Mel Frequency Cepstral Coefficients (MFCC) . The coefficients are derived from a type of

cepstral representation of audio file. After the Fast Fourier transform, the MEL scale is applied using the triangular overlapping windows. The MEL scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another and it is widely used for processing environmental sound events. By taking the logs of the power and applying discrete cosine transform, 40 MFCC coefficients are extracted. The number of values while extracting the MFCC are padded to the edges of each axis, where constant value is applied. Scaling is also applied in order to effectively reduce the dimension.

2. Mel Spectrogram (MS) . This audio parameter visualizes the spectrum of frequencies

through the time of the audio signal based on the MEL-scale. First, the audio signal is converted from time to frequency domain using the fast Fourier transform. By converting the frequency on the y-axis to a log scale and the amplitude into color dimension, a spectrogram is formed. Applying MEL scale on the frequency axis results into forming MEL spectrogram which allows better understanding of the processed image. For the purpose of this paper, 128 coefficients are extracted. 3. Chromagram (Ch) . The feature is related to the perception of pitch, and it represents the

spectrum-based energy using 12 pitch classes in one octave. The computation process is shown on Figure 3, and in the extraction process 12 coefficients are extracted. The Chromagram feature seems to provide a more direct measure of the variations related to pitch and give higher accuracy in the prediction than the MEL spectrogram feature. 4. Spectral Contrast (SC). The spectral contrast defines the decibel difference between peaks

(that generally correspond to harmonic content in music) and valleys (where non-harmonic or noise components are more dominant) measured in sub-bands by octave-scale filters [9]. As it can be noticed on Figure 3, the Fast Fourier transform is first performed on the digital samples to obtain the spectrum. Then, the frequency domain is divided into sub-bands by several Octave-scale filters. The strength of spectral peaks, valleys, and their differences are estimated in each sub-hand. After being translated into logarithmic domain, the raw Spectral Contrast feature is mapped to an orthogonal space and eliminated the relativity among different dimensions by Karhunen-Loeve transform. 5. Tonal Centroid (TC). The tonal centroid detects changes in the harmonic content audio

signal [10]. As all the six dimensions of the tonal centroid vector are important, 6 coefficients were extracted, each one representing single dimension of the vector. The feature vector for each audio file will be formed using different combinations of the chosen audio parameters in order to achieve high accuracy when distinguishing the classes of the sound events. The MFCC parameter will be used as base audio parameter, as it has been largely employed in the field of urban sound classification because of its high accuracy performance. By combining the different audio parameters, different feature vectors were formed and applied to the ML classifiers.

4. MACHINE LEARNING ALGORITHMS

The support vector machines, random forest and naïve Bayes algorithms were chosen for the process of classification of the sound events. As the audio files in the UrbanSound8K dataset have certain noise, the chosen algorithms are known to achieve high accuracy when using datasets with noisy sound events. According to research in [11], the Support Vector Machines and the Random Forest classifiers have high tolerance for noise in datasets. The Naïve Bayes classifier is also shown to be more robust against data noise, and it is recommended when using data with noise [12].

The Random Forest is a classification algorithm consisting of many decisions’ trees. It uses bagging and feature randomness when building each individual tree to try to create uncorrelated forest of trees whose prediction highly accurate. As each decision tree in the forests has received different training data, the trees are not the same, thus predict different answers to the same input as it lets each of its trees vote on a classification. The class is detrmined based on the votes from each decision tree.

Support Vector Machines algorithm is typically used for two-class classification, but it has shown to achieve great accuracy when applying it for multi-class classification (as in this case for 10 classes). The SVM classifier trains by receiving labeled training data, which it inputs in the multi-dimensional space because of the multi-class classification. Among the classification methods, SVM can find the best compromise between model complexity and learning ability based on the limited information of samples, improving the recognition accuracy and reducing the computational workload [13].

Naïve Bayes classification algorithm is based on the Bayes Theorem with an assumption of independence among predictors. The NB classifier assumes that the presence of feature in a class is unrelated to the presence of other feature, i.e., every pair of features being classified is independent of each other. Even though it is assumed independence between the variables which presents assumption which does not often hold in the real world, the NB classifier is known to score high accuracy levels.

5. RESULTS AND DISCUSSION

By combining the five audio parameters, 16 different combinations of feature vectors were tested and compared due to the achieved accuracy to three machine learning algorithms. Table 1 shows the results from the tested results. The system was trained using 90% of the data, while for testing the system, 10% of unknown data was used. Table 1: Accuracy results from the classification using different audio parameters for three different types of classifiers

Random Forest Support Vector

Naïve Bayes

Machines

MFCC 55.07% 51.05% 47.19%

MFCC + SC 63.56% 58.78% 50.53%

MFCC + TC 57.22% 50.29% 45.04%

MFCC + Ch 60.08% 53.17% 47.72%

MFCC + MS 42.16% 49.91% 22.33%

MFCC + SC + TC 65.47% 58.66% 47.91%

MFCC + SC + Ch 66.31% 59.38% 50.78%

MFCC + SC + MS 58.12% 55.45% 26.33%

MFCC + TC+ Ch 61.31% 52.78% 48.92%

MFCC + TC + MS 62.22% 52.92% 48.02%

MFCC + Ch +MS 57.58% 52.21% 26.16%

MFCC + SC + TC + Ch 65.88% 58.42% 48.98%

MFCC + SC + TC + MS 60.45% 55.13% 29.92%

MFCC + SC + Ch + MS 61.41% 57.59% 26.28%

MFCC + TC + Ch + MS 58.13% 56.72% 26.22%

All 5 parameters 62.60% 55.67% 26.40% The MFCC parameter was used as base audio parameter as it achieves high accuracy by only using the parameter itself. The use of the MFCC parameter resulted in accuracy of 55.07% for the RF, 51.05% for the SVM, and 47.19% for the NB classifier.

Firstly, the MFCC parameter was combined and tested separately with the other four parameters. Combining the MFCC with the MEL Spectrogram feature lowers the accuracy of the prediction, especially for the Naïve Bayes classifier, where the achieved accuracy is lower for more than 50% compared with the accuracy achieved when using the MFCC parameter itself. Using the Spectral contrast feature results in higher accuracy classification for all type of classifiers. The use of Tonal Centroid and Chromagram seems to achieve the higher accuracy than the MFCC itself but using the Chromagram feature results in slightly higher accuracy.

Next, the MFCC audio feature was tested in combination with two audio parameters. From the six possible combinations, it could be noticed that the highest accuracy was achieved when using the MFCC, Chromagram and Spectral Contrast audio parameters forming feature vector of 59 coefficients (40 MFCC, 12 Chromagram and 7 Spectral Contrast). The Random Forest classifier results in highest accuracy with 66.31%, afterwards SVM with 59.38% and in the end is the NB classifier with 50.78%. This result was expected as the Chromagram and Spectral Contrast feature

increased the accuracy of the AED/C system when they were used separately. Again, it could be noticed that the MEL spectrogram significantly reduces the accuracy of the system for all the classifiers.

Afterwards, four different combinations were used combining four of the used audio parameters. As it is shown on the table, it could be concluded that the accuracy lowers, especially when using the MEL spectrogram in the combination with the other parameters. Although the MEL spectrogram is known to show promising results when using the neural network classifier, the results in this paper could confirm that the MEL spectrogram significantly reduces the accuracy when using the NB, RF and SVM classifier.

When using all the five parameters forming feature vector of 193 coefficients, the accuracy is decreased especially for the NB classifiers. Although the accuracy of using all the audio parameters is not the highest achieved, still the results are higher than the results in [4].

From the analysis that were made, it could be concluded that the highest achieved accuracy for all the three classifiers is when using the MFCC, Chromagram and Spectral Contrast audio parameters.

When analyzing the results, it can be noticed that the sound event representing the ‘engine idling’ has the highest accuracy, while for the ‘street music’ and the ‘air conditioner’ class the highest errors could be noticed. When listening to the sounds representing the two mentioned classes, it could be stated that the ‘engine idling’ sound event has similar sounds than the other two classes. This could be due to the sound event, because the street music has many elements and types of the sound event, while for the engine idling it could be stated out that this sound event is similar with each audio file that represents this sound event.

Although the achieved accuracy is the highest from all the 16 combinations, the model still needs to improve. Regarding to this, the next phase is consisted of applying hyperparameter optimization on the ML algorithms in order to improve the model and achieve higher accuracy. The hyperparametric optimization was applied to the three algorithms when using MFCC, spectral contrast and chromagram as audio parameters. After the optimization, accuracy of 92.9% was obtained for the SVM algorithm, 91.53% for the RF algorithm and 53.68% for the NB algorithm. Figure 4 shows the confusion matrices between the actual and predicted results for each ML algorithm from the most successful model for all the three classifiers after the applied hyperparameter optimization. The vertical bar shows the actual class, while the horizontal bar shows the predicted class of the results while testing the AED/C system.

From the confusion matrices it can be concluded that models are mostly making the same types of mistakes that a human might: street music with children playing, engine idling with air conditioner, drilling with jackhammering. This similarity indicates that the feature extraction techniques are validly representing the predicted data, similar as human would.

Figure 4: Confusion matrices of the tested results for the MFCC, Chromagram and Spectral Contrast features (1. Random Forest, 2. Support Vector Machines and 3. Naive Bayes classifier)

6. CONCLUSIONS

This paper proposes advanced methods for detection of noise events in urban area by creating system for classification of urban sound events. By changing the audio parameters while the feature extraction process, different feature vectors were created and used as an input parameter in three different machine learning algorithms, resulting in different accuracy of the tested results. From the applied methodology it can be noticed that a key role in building a successful system for recognition and classification of sound events has the choice of parameters of sound events, as well as the choice of machine learning algorithms and the applied hyperparameter optimization. In order to achieve high accuracy, hyperparameter optimization had to be performed, resulting in high accuracy results of 92.9% when using the SVM algorithm and 91.53% when using the RF algorithm. The naïve Bayes showed to have low accuracy results when using the chosen audio parameters and the urban sound dataset. Future research would focus on further validation and implementation of these classification systems in real applications for smart city application using the Internet of things technology. 7. REFERENCES

1. Socoró, Joan Claudi, et al. "B3-Report describing the ANED algorithms for low and high

computation capacity sensors.", 2016 2. Mitrović, Dalibor, Matthias Zeppelzauer, and Christian Breiteneder. "Features for content-based

audio retrieval." Advances in computers . Vol. 78. Elsevier, 2010. 71-150. 3. Valero, Xavier, and Francesc Alías. "Narrow-band autocorrelation function features for the

automatic recognition of acoustic environments." The Journal of the Acoustical Society of America 134.1 (2013): 880-890. 4. Chang, C., & Doran, B. (2016). Urban Sound Classification: With Random Forest SVM DNN

RNN and CNN Classifiers. In CSCI E-81 Machine Learning and Data Mining Final Project Fall 2016 . Harvard University Cambridge. 5. Agarwal I., Yadav P., Gupta N., Yadav S. Urban Sound Classification Using Machine

Learning and Neural Networks. In: Mahapatra R.P., Panigrahi B.K., Kaushik B.K., Roy S. (eds) Proceedings of 6th International Conference on Recent Trends in Computing 2021. Lecture Notes in Networks and Systems, vol 177. Springer, Singapore. 6. Lhoest, Lancelot, et al. "MosAIc: A Classical Machine Learning Multi-Classifier Based

Approach against Deep Learning Classifiers for Embedded Sound Classification." Applied Sciences 11.18 (2021): 8394. 7. Baucas, Marc Jayson, and Petros Spachos. "Using cloud and fog computing for large scale iot-

based urban sound classification." Simulation Modelling Practice and Theory 101 (2020): 102013. 8. Salamon, Justin, Christopher Jacoby, and Juan Pablo Bello. "A dataset and taxonomy for urban

sound research." Proceedings of the 22nd ACM international conference on Multimedia . 2014. 9. Sethares, William A., Robin D. Morris, and James C. Sethares. "Beat tracking of musical

performances using low-level audio features." IEEE Transactions on speech and audio processing 13.2 (2005): 275-285. 10. Harte, Christopher, Mark Sandler, and Martin Gasser. "Detecting harmonic change in musical

audio." Proceedings of the 1st ACM workshop on Audio and music computing multimedia . 2006. 11. Schooltink, Willem Theodorus. Testing the sensitivity of machine learning classifiers to

attribute noise in training data . BS thesis. University of Twente, 2020. 12. Stribos, Reinier H. The Impact of Data Noise on a Naive Bayes Classifier . BS thesis. University

of Twente, 2021. 13. Jianfeng, Xi, et al. "A classification and recognition model for the severity of road traffic

accident." Advances in Mechanical Engineering 11.5 (2019)

Building Acoustics

Policy & health

Underwater acoustics

Speech and hearing

Physical acoustics

Noise and vibration engineering

Musical acoustics

Electroacoustics

Environmental Sound

Measurement and instrumentation

Regulatory & Standards

Research

About Us

Terms and Conditions

Advertise With Us

People & Contacts

Publications

Engineering

Bursary Fund

Regional Branches

Specialist Groups

Conferences and Events

Conference Proceedings

British Standards Committees

Organisation Search

Why become a member?

Application Process

Membership Fees

Application Policy

Application

Professional Development Scheme (CPD)

Bulletins

Member Directory

Help and Advice

Awards

Become a Sponsor Member

What is acoustics?

Technician Apprenticeship Scheme 2022

Where do acousticians work?

Career Guide

What educational qualifications do I need?