Institute of Acoustics: Paper Detail

Study on sound source localization inside a structure using a domain transfer model for real-world adaption of a trained model

Shunsuke Kita 1

Osaka Research Institute of Industrial Science and Technology 2-7-1 Ayumino, Izumi-shi, Osaka, 594-1157, Japan Faculty of Engineering Science, Kansai University 3-3-35 Yamate, Suita-shi, Osaka, 564-8680, Japan

Yoshinobu Kajikawa 2

Faculty of Engineering Science, Kansai University 3-3-35 Yamate, Suita-shi, Osaka, 564-8680, Japan

ABSTRACT

In this study, we propose a method for the adaptation of a sound source localization model trained on simulation to real-world data in a developed method of a source localization inside a structure. The model for predicting a position of the source is constructed from deep neural network or convolutional neural network, and predicts the source position inside the structure from the frequency spectrum that the accelerometers measure on the outer surface of the structure. The proposed method uses a domain transfer model that transforms real data into pseudo-simulation data to improve the source localization performance of the trained model. The domain transfer model is built from an autoencoder or deep convolutional autoencoder and transfers the data from real to simulation data. The performances of both models is evaluated using the real data as semi-supervised data conditions. A deep convolutional autoencoder led the sound source localization model to a higher than baseline performance.

1. INTRODUCTION

Sound source localization (SSL) methods are used to estimate positions of noise sources emitted from machinery and other equipment, and are important for reducing the noise level of products. Currently, the most common SSL method is to use a microphone array to estimate the position of the source based on the time di ﬀ erence of arrival of acoustic signals [1, 2]. In recent years, several methods have been proposed that incorporate deep learning and overcome various scenarios that have been a challenge for conventional methods [3]. However, these methods assume that the observation point and microphone exist in the same acoustic space. This study concerns the problem of estimating

1 kitas@orist.jp

2 kaji@kansai-u.ac.jp

a slaty. inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS O ¥, ? GLASGOW

Measurement data in real world Simulation data

Sound signal

Acceleration signal

Trained DNN

Simulation data Structure

Sound source position

Prediction of source

Sound source

(a) (b) (c)

Figure 1: Framework of the proposed method

a source inside a structure from the outside. In this case, the sound waves are observed indirectly because the source and microphone exist in di ﬀ erent acoustic spaces; therefore, conventional methods cannot be applied because microphone independence is not satisﬁed. To solve this problem, an SSL method inside a structure using Deep Neural Networks (DNN) and Computer-Aided Engineering (CAE) has been developed [4]. Numerical and actual experimental results show that it is possible to estimate the position of the sound from the frequency spectrum of the accelerometers (FSA) measured by acceleration sensors installed on the structure ’ s exterior. Nevertheless, the developed framework strategy still fails to adapt the network trained on the simulation domain to the real environment domain. To address this issue, we propose an adaptation of a domain transfer model responsible for domain transformation under semi-supervised conditions. The model transforms real data into pseudo-simulated data by using real and simulation data as datasets within the framework. In this study, an autoencoder (AE) [5] and deep convolutional autoencoder (DCAE) [6] are considered as transfer models to examine the dependence of performance on the number of semi-supervised data.

2. PROPOSED METHOD WITH DOMAIN TRANSFER MODEL

A domain transfer model is incorporated into the framework for SSL inside a structure .

2.1. Framework for SSL inside structure As shown in Figure 1, the framework for SSL inside a structure is implemented by the following three steps:

(a) Data generation by CAE. A coupled acoustic-structure analysis is used to generate datasets that consists of data observed outside the structure and the position of the sound source. For example, the ﬁnite element method is used to generate analytical data such as acceleration signals on the exterior surface of the structure and acoustic signals around the structure corresponding to the acoustic excitation of the source position.

(b) Training of SSL model. The analytical data obtained from the coupled acoustic-structure analysis is deﬁned as the input data for the DNN and the positions of the sources paired with the input data are deﬁned as the labels for the DNN. The input-output relationships are learned by the DNN.

Encoder blocks Decoder blocks

Vector format of measured data in real-world Vector format of pseudo-simulated data

Image format of measured data in real-world

Image format of pseudo-simulated data

Figure 2: Domain transfer model.

Sen.2 Sen.3 Sen.1 Sen.3 Sen.2 Sen.1

Sen.2 Sen.3 Sen.1

Sen.3 Sen.2 Sen.1

Sound source

Speaker

Acrylic box

(a) (b)

(c)

Figure 3: Experimental setup.

2.2. Domain transfer model under semi-supervised conditions The real-world SSL performance of a simulation-trained DNN is signiﬁcantly worse than that of a data-trained DNN due to di ﬀ erences between simulated and real data. Additionally, the decision boundaries of a DNN constructed in the simulation domain (source domain) and the real domain (target domain) are di ﬀ erent [7,8]. Therefore, a domain transfer model is applied into the framework to transform the real data into simulated data using pairs from these domains ’ datasets. In this study, FSA is used as the observation data. The domain transfer model is constructed by AE and DCAE with the input data as the real data and the label as the simulated data. The transfer performance of AE and DCAE which process vector and image data, respectively, are compared ( Figure 2). Finally, SSL performance with respect to real data is tested by the prediction of the SSL model on pseudo-simulated data, where the domain transfer model transforms the real data into the simulated data.

3. DATASETS GENERATED FROM SIMULATION AND REAL DOMAINS

The datasets are FSAs observed by three accelerometers on the outer surface of the structure paired with the source position labels and are collected from both the simulation and real domains. The subject is the acrylic box with thickness 3 mm shown in Figure 3. The size of the acoustic volume

is 400 × 400 × 400 mm 3 . A situation is was assumed in which acoustic excitation from the single source inside the structure is measured using three acceleration sensors attached to the outer surface of the structure. The three sensors are positioned asymmetrically with respect to the structure and in the same positions in the simulation and real situations.

Table 1: Analysis conditions

Acrylic Young’s modulus 27 MPa

Acrylic density 1180 kg / m 3

Acrylic damping ratio 0.8

Interval of sound source 50 mm

Noise source No.1 - No.512

Observation Sen.1 - Sen.3

Frequency range 10 Hz- 1.5 kHz

Table 2: Measurement conditions

Interval of sound source 100 mm

Noise source No.1 - No.64

Observation Sen.1 - Sen.3

Input signal Swept sinusoidal

Frequency range 10 Hz - 1.5 kHz

Sampling rate 4.8 kHz

Sound pressure 90 dB at 1 m

Subband width 10 Hz

The conditions of the simulation and experiments are shown in Table 1 and Table 2, respectively. The data for the simulation domain is generated from coupled acoustic structure analysis in the ﬁnite element method. The ﬁnite element analysis solver is full harmonic analysis in Ansys Mechanical. In the experiment, one speaker (Visaton FRS 7 ) is placed inside the acrylic box as a source. The acoustic excitation of the structure is measured by three acceleration sensors (Analog Devices ADXL354) installed on the outer surface of the structure. The sound waves of the sweep signal are generated by the speaker via a sound card (Fireface UCX) and speaker ampliﬁer (LP-2024A + ). The conditions for the position of the source is an interval of 50 mm for the simulation and 100 mm for the experiment, and the number of source points are 514 and 64, respectively. Each observed FSA is subbanded to 150 bins per accelerometer. The FSAs are measured from the three positions and are concatenated as a horizontal vector from Sen. 1 - 3 in sequence when treated as vector data. Therefore, the size of the observation data for one source point is 450 × 1. For the image data, the FSAs are deﬁned as 150 × 3 data arranged in columns.

4. EXPERIMENTAL SETUP UNDER SEMI-SUPERVISED CONDITIONS APPLYING DOMAIN TRANSFER MODEL The conditions for the domain transfer models are shown in Table 3, where "F-" represents fully connected layers and "C-" represents convolutional layers. Note: where AE is autoencoder, CNN is convolutional neural network, and DCAE is deep convolutional autoencoder. When both input and

Table 3: Domain transfer model conditions

Hidden layer AE : F-400, F-350, F-300, F-250, F-200, F-100,

F-100, F-200, F-250, F-300, F-350, F-400

DCAE : C-400, C-200, C-100, C-100, C-200, C-400

Activation Hidden layer ： ReLU

Output ： AE (Linear), DCAE (Sigmoid)

Optimization Adam : Learning rate = 0 . 001

( β 1 = 0.9 ， β 2 = 0.999)

Loss function Mean squared error

Initialization He normal

Batch size 5

Epochs 1000

Preprocessing Min-max normalization

Metrics Hold-out validation

Table 4: Sound source localization model conditions

Hidden layer AE : F-400, F-350, F-300, F-200, F-100, F-50

CNN : C-256, C-128, C-64, C-32, C-16,

F-200, F-150, F-100, F-50, F-25, F-20,

Activation Hidden layer ： ReLU

Output ： Linear (Reg.), Softmax (Class)

Loss function Mean squared error (Reg.), Cross entropy error (Class)

Initialization He normal

Batch size 50

Epochs 1000

label data are vector data, AE is adopted as the domain transfer model. When both input and label data are image data, DCAE is adopted. Batch normalization is applied between the layers of each model. The DCAE is applied with (2, 3) kernel size, (1, 1) stride, and (2, 1) ‘ same ’ padding. In Reference [9,10], the frequency response data are normalized from 0 to 2 16 − 1 after hyperbolic tangent transformation. In this study, because the measured data are between 0 and 1, the only preprocessing is min-max normalization per dataset. The conditions of the SSL model are shown in Table 4. Note: where AE is autoencoder, CNN is convolutional neural network, and DCAE is deep convolutional autoencoder. DNN is used when the input data are vectors, and convolutional neural network (CNN) is used when the input data are images. Optimization, preprocessing, and metrics are the same as the domain transfer model conditions. In the case of the classiﬁcation problem, the total acoustic space is divided into eight acoustic subvolumes, and each acoustic space is labeled according to one of the K coding schemes. In the case of the regression problem, the X-, Y-, and Z-coordinates are directly deﬁned as label data. The ﬂowchart for evaluating the models ’ performance for the datasets is shown in Figure 4. The

Simulation data: X S(T)

Transfer data : X RS(V)

Domain transfer model h RS

Real data : X R(T)

Real data : X R(V)

(a) Training of domain transfer model (b) Tese data transfer

Label : T S(T)

Label : T RS(V)

SSL model f SS

(d) SSL

Simulation data : X S(T)

Figure 4: Flowchart for evaluating sound space localization performance

matrix X of D - dimensional input vectors x i and the matrix T of K -dimensional label vectors t i are given by

X = ( x 1 , x 2 , x 3 , · · · , x N ) T , (1)

T = ( t 1 , t 2 , t 3 , · · · , t N ) T , (2)

where the subscript N indicates the number of samples. The FSAs of the simulation and real data are represented by X S and X R , respectively. Similarly, their paired source position labels are denoted by T S and T R . Here, the subscripts S and R represent the simulation and experimental data, respectively. The goal is to estimate T R from X R using an SSL model built on datasets obtained by simulation. In most cases, X S does not equal X R , so the SSLs f S S and f RR for each domain are di ﬀ erent, where f S S and f RR are models built in the simulation and real environment domains, respectively. As a result, SSL using the model trained on simulation in real environments is di ﬃ cult. Therefore, a domain transfer model h RS is constructed to transfer the experimental data into simulation data. The model uses N R ( T ) pairs of simulated data X S ( T ) , and experimental data X R ( T ) as training data (Figure 4 (a)). Here, the subscripts S ( T ) and R ( T ) represent the simulation and experimental training data, respectively. The domain transfer model transforms the test data X R ( V ) of the experimental data into the transformation data X RS ( V ) , which is the pseudo-simulation data (Figure 4 (b)). The subscripts R ( V ) and RS ( V ) denote the experimental test data and transformed data, respectively. The SSL model f S S is built using a dataset ( X S ( T ) , T S ( T ) ) with N S ( T ) data sets (Figure 4 (c)). By providing the transformed test data X RS ( V ) as input data to the trained SSL model, the source positions in the real environment data are estimated (Figure 4 (d)). The amount of data in the hold-out is described and the following equation is satisﬁed.

N R ( T ) + N R ( V ) = N R , N R = 64 , (3)

where N R is the amount of experimental data.

N S ( T ) = N S + N R ( T ) , N S = 512 , (4)

where N S is the amount of simulation data. The accuracy formula (Acc.) shown in Eq. (5) is used to evaluate the accuracy of the classiﬁcation problem and the root mean square error (RMSE) shown in Eq. (6) is used for the regression problem.

Acc . = The number of correct answers

N R ( V ) (5)

1 N R ( V ) ( T R ( V ) − X RS ( V ) w ∗ ) T ( T R ( V ) − X RS ( V ) w ∗ ) , (6)

RMSE =

where w ∗ is the weight of the trained model. The label data considers the X-, Y-, and Z-coordinates; hence, the RMSE is expressed as Eq.(6) using matrices T R ( V ) and X RS ( V ) . The percentage of experimental data used as training data for the domain transfer model varies from 10-90% of conditions, and the SSL performance is measured in each case.

5. PERFORMANCE OF DATA TRANSFER BY DOMAIN TRANSFER MODEL AND SOUND SOURCE PREDICTIONS BY SSL MODEL

Figure 5: FSA transformation of training and test data. Semi-supervised data is 70% of the total experimental data.

Figure 5 shows an example of data transformation by AE. Red, blue, and green solid lines represent experimental, simulation, and transformed data, respectively. This ﬁgure shows that the transformed data is shifted in resonance frequency and transformed closer to the simulation data. However, the visualization by t-sne [11] based on classiﬁcation shows that domain matching is not possible (Figure 6). This can be understood from the fact that AE cannot learn local features of data, and the transformed data had some negative values. Figure 7 shows the visualization by t-sne of the data transformed by DCAE. This distribution shows that most of the training data can be domain matched. These results show that DCAE using convolutional layers is e ﬀ ective for domain transformation. Furthermore, Figure 8 shows the transfer performance of AE, DCAE, and DCAE-DA under semi- supervised conditions as evaluated by RMSE. The "-DA" denotes data augmentation by masking

Real Sim.

Transfer

Figure 6: Visualization of transformed data by AE. Subscripts "T" and "V" are training data and test data, respectively.

Real Sim. Transfer

5) Qe = ge) — ‘cnn 10 ( | ) Z yuouoduio7 ° ? —30 —20 0 20 40 Component | ( —40 —60 -)

Figure 7: Visualization of transformed data by DCAE. Subscripts "T" and "V" are training and test data, respectively.

the replacement with zero values. The RMSE of the model with the convolutional layer is lower than that of the model with the fully connected layer, and it is shown to decrease as the amount of semi-supervised data increases. The models with data augmentation also show higher generalization performance than non-augmented models. Finally, the performance of the SSL model in classiﬁcation and regression is shown in Figure 9. The training data given to the model are all simulation and a semi-supervised experimental data, validation data are the semi-supervised experimental data, and test data are unknown experimental data. In classiﬁcation, the accuracy of both the training and validation data are close to 100%. For test data, the accuracy improved as the number of semi-supervised data increased, but is only 57% at best. However, these performances are above the baseline in all conditions. In the regression, the RMSE for training and validation data are above the spatial sampling of the simulation domain and below the spatial sampling of the real domain. The RMSE for the test data is greater than the spatial sampling of the real environment. From these results, the regression model is still underﬁtted, and the CNN structure and hyperparameters need to be tuned.

Component 2 (-) —10 0 10 Component | (-) Ex. CAE. Transfer

RMSE (-)

Semi-supervised (%)

0.025 0.02 ¢ —6- AE (Training) —©=— AE (Test) =—©= DCAE (Training) =—6— DCAE (Test) =©= DCAE-DA (Training) =©= DCAE-DA (Test) 10 30 50 70 90

Figure 8: RMSE of supervised conditions.

Spatial sampling of real domain

Accuracy (%)

57%

Spatial sampling of Sim. domain

Baseline

Semi-supervised (%) Semi-supervised (%)

(a) Classification (b) Regression

Figure 9: Results of classiﬁcation and regression problems.

wv e 100¢ ; _—+ awe —— : Y) w) >, ~m 50° -e-Trainig -e- Validation -& Test 0 | | | 10 30 50 70 90

6. CONCLUSIONS

The performance of the SSL method inside a structure is improved by transforming experimental data into pseudo-simulation data using the domain transfer model. The distribution of t-sne into two dimensions indicates that DCAE had better transfer performance than AE for the FSA of the structure’s exterior. Furthermore, when masking by zero-value replacement is applied to DCAE, the performance is further improved. In the case of classiﬁcation, SSL accuracy exceeded the baseline for all semi-supervised data conditions. However, the RMSEs in the regression problem case are higher than the spatial sampling in the real domain. This problem indicates that the CNN structure and hyperparameters need to be tuned in future research.

ACKNOWLEDGEMENTS

This paper is ﬁnancially supported by JSPS KAKENHI (20K14687, 22K03991).

REFERENCES

[1] Charles Knapp and Gli ﬀ ord Carter. The generalized correlation method for estimation of time delay. IEEE transactions on acoustics, speech, and signal processing , 24(4):320–327, 1976. [2] Gli ﬀ ord Carter. Coherence and time delay estimation. Proceedings of the IEEE , 75(2):236–255, 1987. [3] Pierre-Amaury Grumiaux and et al. A survey of sound source localization with deep learning methods, 2021. [4] Shunsuke Kita and Yoshinobu Kajikawa. Fundamental study on sound source localization inside a structure using a deep neural network and computer-aided engineering. Journal of Sound and Vibration , 513:116400, 2021. [5] Geo ﬀ rey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science , 313(5786):504–507, 2006. [6] Jonathan Masci and et al. Stacked convolutional auto-encoders for hierarchical feature extraction. In International conference on artiﬁcial neural networks , pages 52–59. Springer, 2011. [7] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference , 90(2):227–244, 2000. [8] Jose G. Moreno-Torres and et al. A unifying view on dataset shift in classiﬁcation. Pattern recognition , 45(1):521–530, 2012. [9] Tyler Dare. Experimental force reconstruction using a neural network and simulated training data. In Proceedings of 2020 International Congress on Noise Control Engineering, INTER- NOISE 2020 , 2020. [10] Tyler Dare. Experimental force reconstruction on plates of arbitrary shape using neural networks. In Proceedings of 2021 International Congress on Noise Control Engineering, INTER-NOISE 2021 , 2021. [11] Laurens Van der Maaten and Geo ﬀ rey E Hinton. Visualizing data using t-sne. Journal of machine learning research , 9(11), 2008.

Building Acoustics

Policy & health

Underwater acoustics

Speech and hearing

Physical acoustics

Noise and vibration engineering

Musical acoustics

Electroacoustics

Environmental Sound

Measurement and instrumentation

Regulatory & Standards

Research

About Us

Terms and Conditions

Advertise With Us

People & Contacts

Publications

Engineering

Bursary Fund

Regional Branches

Specialist Groups

Conferences and Events

Conference Proceedings

British Standards Committees

Organisation Search

Why become a member?

Application Process

Membership Fees

Application Policy

Application

Professional Development Scheme (CPD)

Bulletins

Member Directory

Help and Advice

Awards

Become a Sponsor Member

What is acoustics?

Technician Apprenticeship Scheme 2022

Where do acousticians work?

Career Guide

What educational qualifications do I need?