A A A Study on sound source localization inside a structure using a domain transfer model for real-world adaption of a trained model Shunsuke Kita 1 Osaka Research Institute of Industrial Science and Technology 2-7-1 Ayumino, Izumi-shi, Osaka, 594-1157, Japan Faculty of Engineering Science, Kansai University 3-3-35 Yamate, Suita-shi, Osaka, 564-8680, Japan Yoshinobu Kajikawa 2 Faculty of Engineering Science, Kansai University 3-3-35 Yamate, Suita-shi, Osaka, 564-8680, Japan ABSTRACT In this study, we propose a method for the adaptation of a sound source localization model trained on simulation to real-world data in a developed method of a source localization inside a structure. The model for predicting a position of the source is constructed from deep neural network or convolutional neural network, and predicts the source position inside the structure from the frequency spectrum that the accelerometers measure on the outer surface of the structure. The proposed method uses a domain transfer model that transforms real data into pseudo-simulation data to improve the source localization performance of the trained model. The domain transfer model is built from an autoencoder or deep convolutional autoencoder and transfers the data from real to simulation data. The performances of both models is evaluated using the real data as semi-supervised data conditions. A deep convolutional autoencoder led the sound source localization model to a higher than baseline performance. 1. INTRODUCTION Sound source localization (SSL) methods are used to estimate positions of noise sources emitted from machinery and other equipment, and are important for reducing the noise level of products. Currently, the most common SSL method is to use a microphone array to estimate the position of the source based on the time di ff erence of arrival of acoustic signals [1, 2]. In recent years, several methods have been proposed that incorporate deep learning and overcome various scenarios that have been a challenge for conventional methods [3]. However, these methods assume that the observation point and microphone exist in the same acoustic space. This study concerns the problem of estimating 1 kitas@orist.jp 2 kaji@kansai-u.ac.jp a slaty. inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS O ¥, ? GLASGOW Measurement data in real world Simulation data Sound signal Acceleration signal Trained DNN Simulation data Structure Sound source position Prediction of source Sound source (a) (b) (c) Figure 1: Framework of the proposed method a source inside a structure from the outside. In this case, the sound waves are observed indirectly because the source and microphone exist in di ff erent acoustic spaces; therefore, conventional methods cannot be applied because microphone independence is not satisfied. To solve this problem, an SSL method inside a structure using Deep Neural Networks (DNN) and Computer-Aided Engineering (CAE) has been developed [4]. Numerical and actual experimental results show that it is possible to estimate the position of the sound from the frequency spectrum of the accelerometers (FSA) measured by acceleration sensors installed on the structure ’ s exterior. Nevertheless, the developed framework strategy still fails to adapt the network trained on the simulation domain to the real environment domain. To address this issue, we propose an adaptation of a domain transfer model responsible for domain transformation under semi-supervised conditions. The model transforms real data into pseudo-simulated data by using real and simulation data as datasets within the framework. In this study, an autoencoder (AE) [5] and deep convolutional autoencoder (DCAE) [6] are considered as transfer models to examine the dependence of performance on the number of semi-supervised data. 2. PROPOSED METHOD WITH DOMAIN TRANSFER MODEL A domain transfer model is incorporated into the framework for SSL inside a structure . 2.1. Framework for SSL inside structure As shown in Figure 1, the framework for SSL inside a structure is implemented by the following three steps: (a) Data generation by CAE. A coupled acoustic-structure analysis is used to generate datasets that consists of data observed outside the structure and the position of the sound source. For example, the finite element method is used to generate analytical data such as acceleration signals on the exterior surface of the structure and acoustic signals around the structure corresponding to the acoustic excitation of the source position. (b) Training of SSL model. The analytical data obtained from the coupled acoustic-structure analysis is defined as the input data for the DNN and the positions of the sources paired with the input data are defined as the labels for the DNN. The input-output relationships are learned by the DNN. Encoder blocks Decoder blocks Vector format of measured data in real-world Vector format of pseudo-simulated data Image format of measured data in real-world Image format of pseudo-simulated data Figure 2: Domain transfer model. Sen.2 Sen.3 Sen.1 Sen.3 Sen.2 Sen.1 Sen.2 Sen.3 Sen.1 Sen.3 Sen.2 Sen.1 Sound source Speaker Acrylic box (a) (b) (c) Figure 3: Experimental setup. (c) Prediction of source positions. The trained DNN constructed in the simulation is used in the real world for SSL. 2.2. Domain transfer model under semi-supervised conditions The real-world SSL performance of a simulation-trained DNN is significantly worse than that of a data-trained DNN due to di ff erences between simulated and real data. Additionally, the decision boundaries of a DNN constructed in the simulation domain (source domain) and the real domain (target domain) are di ff erent [7,8]. Therefore, a domain transfer model is applied into the framework to transform the real data into simulated data using pairs from these domains ’ datasets. In this study, FSA is used as the observation data. The domain transfer model is constructed by AE and DCAE with the input data as the real data and the label as the simulated data. The transfer performance of AE and DCAE which process vector and image data, respectively, are compared ( Figure 2). Finally, SSL performance with respect to real data is tested by the prediction of the SSL model on pseudo-simulated data, where the domain transfer model transforms the real data into the simulated data. 3. DATASETS GENERATED FROM SIMULATION AND REAL DOMAINS The datasets are FSAs observed by three accelerometers on the outer surface of the structure paired with the source position labels and are collected from both the simulation and real domains. The subject is the acrylic box with thickness 3 mm shown in Figure 3. The size of the acoustic volume is 400 × 400 × 400 mm 3 . A situation is was assumed in which acoustic excitation from the single source inside the structure is measured using three acceleration sensors attached to the outer surface of the structure. The three sensors are positioned asymmetrically with respect to the structure and in the same positions in the simulation and real situations. Table 1: Analysis conditions Acrylic Young’s modulus 27 MPa Acrylic density 1180 kg / m 3 Acrylic damping ratio 0.8 Interval of sound source 50 mm Noise source No.1 - No.512 Observation Sen.1 - Sen.3 Frequency range 10 Hz- 1.5 kHz Table 2: Measurement conditions Interval of sound source 100 mm Noise source No.1 - No.64 Observation Sen.1 - Sen.3 Input signal Swept sinusoidal Frequency range 10 Hz - 1.5 kHz Sampling rate 4.8 kHz Sound pressure 90 dB at 1 m Subband width 10 Hz The conditions of the simulation and experiments are shown in Table 1 and Table 2, respectively. The data for the simulation domain is generated from coupled acoustic structure analysis in the finite element method. The finite element analysis solver is full harmonic analysis in Ansys Mechanical. In the experiment, one speaker (Visaton FRS 7 ) is placed inside the acrylic box as a source. The acoustic excitation of the structure is measured by three acceleration sensors (Analog Devices ADXL354) installed on the outer surface of the structure. The sound waves of the sweep signal are generated by the speaker via a sound card (Fireface UCX) and speaker amplifier (LP-2024A + ). The conditions for the position of the source is an interval of 50 mm for the simulation and 100 mm for the experiment, and the number of source points are 514 and 64, respectively. Each observed FSA is subbanded to 150 bins per accelerometer. The FSAs are measured from the three positions and are concatenated as a horizontal vector from Sen. 1 - 3 in sequence when treated as vector data. Therefore, the size of the observation data for one source point is 450 × 1. For the image data, the FSAs are defined as 150 × 3 data arranged in columns. 4. EXPERIMENTAL SETUP UNDER SEMI-SUPERVISED CONDITIONS APPLYING DOMAIN TRANSFER MODEL The conditions for the domain transfer models are shown in Table 3, where "F-" represents fully connected layers and "C-" represents convolutional layers. Note: where AE is autoencoder, CNN is convolutional neural network, and DCAE is deep convolutional autoencoder. When both input and Table 3: Domain transfer model conditions Hidden layer AE : F-400, F-350, F-300, F-250, F-200, F-100, F-100, F-200, F-250, F-300, F-350, F-400 DCAE : C-400, C-200, C-100, C-100, C-200, C-400 Activation Hidden layer : ReLU Output : AE (Linear), DCAE (Sigmoid) Optimization Adam : Learning rate = 0 . 001 ( β 1 = 0.9 , β 2 = 0.999) Loss function Mean squared error Initialization He normal Batch size 5 Epochs 1000 Preprocessing Min-max normalization Metrics Hold-out validation Table 4: Sound source localization model conditions Hidden layer AE : F-400, F-350, F-300, F-200, F-100, F-50 CNN : C-256, C-128, C-64, C-32, C-16, F-200, F-150, F-100, F-50, F-25, F-20, Activation Hidden layer : ReLU Output : Linear (Reg.), Softmax (Class) Loss function Mean squared error (Reg.), Cross entropy error (Class) Initialization He normal Batch size 50 Epochs 1000 label data are vector data, AE is adopted as the domain transfer model. When both input and label data are image data, DCAE is adopted. Batch normalization is applied between the layers of each model. The DCAE is applied with (2, 3) kernel size, (1, 1) stride, and (2, 1) ‘ same ’ padding. In Reference [9,10], the frequency response data are normalized from 0 to 2 16 − 1 after hyperbolic tangent transformation. In this study, because the measured data are between 0 and 1, the only preprocessing is min-max normalization per dataset. The conditions of the SSL model are shown in Table 4. Note: where AE is autoencoder, CNN is convolutional neural network, and DCAE is deep convolutional autoencoder. DNN is used when the input data are vectors, and convolutional neural network (CNN) is used when the input data are images. Optimization, preprocessing, and metrics are the same as the domain transfer model conditions. In the case of the classification problem, the total acoustic space is divided into eight acoustic subvolumes, and each acoustic space is labeled according to one of the K coding schemes. In the case of the regression problem, the X-, Y-, and Z-coordinates are directly defined as label data. The flowchart for evaluating the models ’ performance for the datasets is shown in Figure 4. The Simulation data: X S(T) Transfer data : X RS(V) Domain transfer model h RS Domain transfer model h RS Real data : X R(T) Real data : X R(V) (a) Training of domain transfer model (b) Tese data transfer Label : T S(T) Label : T RS(V) SSL model f SS SSL model f SS (d) SSL Simulation data : X S(T) (c) Training of SSL model Figure 4: Flowchart for evaluating sound space localization performance matrix X of D - dimensional input vectors x i and the matrix T of K -dimensional label vectors t i are given by X = ( x 1 , x 2 , x 3 , · · · , x N ) T , (1) T = ( t 1 , t 2 , t 3 , · · · , t N ) T , (2) where the subscript N indicates the number of samples. The FSAs of the simulation and real data are represented by X S and X R , respectively. Similarly, their paired source position labels are denoted by T S and T R . Here, the subscripts S and R represent the simulation and experimental data, respectively. The goal is to estimate T R from X R using an SSL model built on datasets obtained by simulation. In most cases, X S does not equal X R , so the SSLs f S S and f RR for each domain are di ff erent, where f S S and f RR are models built in the simulation and real environment domains, respectively. As a result, SSL using the model trained on simulation in real environments is di ffi cult. Therefore, a domain transfer model h RS is constructed to transfer the experimental data into simulation data. The model uses N R ( T ) pairs of simulated data X S ( T ) , and experimental data X R ( T ) as training data (Figure 4 (a)). Here, the subscripts S ( T ) and R ( T ) represent the simulation and experimental training data, respectively. The domain transfer model transforms the test data X R ( V ) of the experimental data into the transformation data X RS ( V ) , which is the pseudo-simulation data (Figure 4 (b)). The subscripts R ( V ) and RS ( V ) denote the experimental test data and transformed data, respectively. The SSL model f S S is built using a dataset ( X S ( T ) , T S ( T ) ) with N S ( T ) data sets (Figure 4 (c)). By providing the transformed test data X RS ( V ) as input data to the trained SSL model, the source positions in the real environment data are estimated (Figure 4 (d)). The amount of data in the hold-out is described and the following equation is satisfied. N R ( T ) + N R ( V ) = N R , N R = 64 , (3) where N R is the amount of experimental data. N S ( T ) = N S + N R ( T ) , N S = 512 , (4) where N S is the amount of simulation data. The accuracy formula (Acc.) shown in Eq. (5) is used to evaluate the accuracy of the classification problem and the root mean square error (RMSE) shown in Eq. (6) is used for the regression problem. Acc . = The number of correct answers N R ( V ) (5) 1 N R ( V ) ( T R ( V ) − X RS ( V ) w ∗ ) T ( T R ( V ) − X RS ( V ) w ∗ ) , (6) RMSE = where w ∗ is the weight of the trained model. The label data considers the X-, Y-, and Z-coordinates; hence, the RMSE is expressed as Eq.(6) using matrices T R ( V ) and X RS ( V ) . The percentage of experimental data used as training data for the domain transfer model varies from 10-90% of conditions, and the SSL performance is measured in each case. 5. PERFORMANCE OF DATA TRANSFER BY DOMAIN TRANSFER MODEL AND SOUND SOURCE PREDICTIONS BY SSL MODEL Figure 5: FSA transformation of training and test data. Semi-supervised data is 70% of the total experimental data. Figure 5 shows an example of data transformation by AE. Red, blue, and green solid lines represent experimental, simulation, and transformed data, respectively. This figure shows that the transformed data is shifted in resonance frequency and transformed closer to the simulation data. However, the visualization by t-sne [11] based on classification shows that domain matching is not possible (Figure 6). This can be understood from the fact that AE cannot learn local features of data, and the transformed data had some negative values. Figure 7 shows the visualization by t-sne of the data transformed by DCAE. This distribution shows that most of the training data can be domain matched. These results show that DCAE using convolutional layers is e ff ective for domain transformation. Furthermore, Figure 8 shows the transfer performance of AE, DCAE, and DCAE-DA under semi- supervised conditions as evaluated by RMSE. The "-DA" denotes data augmentation by masking Real Sim. Transfer Figure 6: Visualization of transformed data by AE. Subscripts "T" and "V" are training data and test data, respectively. Real Sim. Transfer 5) Qe = ge) — ‘cnn 10 ( | ) Z yuouoduio7 ° ? —30 —20 0 20 40 Component | ( —40 —60 -) Figure 7: Visualization of transformed data by DCAE. Subscripts "T" and "V" are training and test data, respectively. the replacement with zero values. The RMSE of the model with the convolutional layer is lower than that of the model with the fully connected layer, and it is shown to decrease as the amount of semi-supervised data increases. The models with data augmentation also show higher generalization performance than non-augmented models. Finally, the performance of the SSL model in classification and regression is shown in Figure 9. The training data given to the model are all simulation and a semi-supervised experimental data, validation data are the semi-supervised experimental data, and test data are unknown experimental data. In classification, the accuracy of both the training and validation data are close to 100%. For test data, the accuracy improved as the number of semi-supervised data increased, but is only 57% at best. However, these performances are above the baseline in all conditions. In the regression, the RMSE for training and validation data are above the spatial sampling of the simulation domain and below the spatial sampling of the real domain. The RMSE for the test data is greater than the spatial sampling of the real environment. From these results, the regression model is still underfitted, and the CNN structure and hyperparameters need to be tuned. Component 2 (-) —10 0 10 Component | (-) Ex. CAE. Transfer RMSE (-) Semi-supervised (%) 0.025 0.02 ¢ —6- AE (Training) —©=— AE (Test) =—©= DCAE (Training) =—6— DCAE (Test) =©= DCAE-DA (Training) =©= DCAE-DA (Test) 10 30 50 70 90 Figure 8: RMSE of supervised conditions. Spatial sampling of real domain Accuracy (%) 57% Spatial sampling of Sim. domain Baseline Semi-supervised (%) Semi-supervised (%) (a) Classification (b) Regression Figure 9: Results of classification and regression problems. wv e 100¢ ; _—+ awe —— : Y) w) >, ~m 50° -e-Trainig -e- Validation -& Test 0 | | | 10 30 50 70 90 6. CONCLUSIONS The performance of the SSL method inside a structure is improved by transforming experimental data into pseudo-simulation data using the domain transfer model. The distribution of t-sne into two dimensions indicates that DCAE had better transfer performance than AE for the FSA of the structure’s exterior. Furthermore, when masking by zero-value replacement is applied to DCAE, the performance is further improved. In the case of classification, SSL accuracy exceeded the baseline for all semi-supervised data conditions. However, the RMSEs in the regression problem case are higher than the spatial sampling in the real domain. This problem indicates that the CNN structure and hyperparameters need to be tuned in future research. ACKNOWLEDGEMENTS This paper is financially supported by JSPS KAKENHI (20K14687, 22K03991). Accracy (“%) 100 80 > 60 | 40 | 207 v 10 30 50 -e-Trainig -© Validation -©- Test 70 90 REFERENCES [1] Charles Knapp and Gli ff ord Carter. The generalized correlation method for estimation of time delay. IEEE transactions on acoustics, speech, and signal processing , 24(4):320–327, 1976. [2] Gli ff ord Carter. Coherence and time delay estimation. Proceedings of the IEEE , 75(2):236–255, 1987. [3] Pierre-Amaury Grumiaux and et al. A survey of sound source localization with deep learning methods, 2021. [4] Shunsuke Kita and Yoshinobu Kajikawa. Fundamental study on sound source localization inside a structure using a deep neural network and computer-aided engineering. Journal of Sound and Vibration , 513:116400, 2021. [5] Geo ff rey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science , 313(5786):504–507, 2006. [6] Jonathan Masci and et al. Stacked convolutional auto-encoders for hierarchical feature extraction. In International conference on artificial neural networks , pages 52–59. Springer, 2011. [7] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference , 90(2):227–244, 2000. [8] Jose G. Moreno-Torres and et al. A unifying view on dataset shift in classification. Pattern recognition , 45(1):521–530, 2012. [9] Tyler Dare. Experimental force reconstruction using a neural network and simulated training data. In Proceedings of 2020 International Congress on Noise Control Engineering, INTER- NOISE 2020 , 2020. [10] Tyler Dare. Experimental force reconstruction on plates of arbitrary shape using neural networks. In Proceedings of 2021 International Congress on Noise Control Engineering, INTER-NOISE 2021 , 2021. [11] Laurens Van der Maaten and Geo ff rey E Hinton. Visualizing data using t-sne. Journal of machine learning research , 9(11), 2008. Previous Paper 538 of 769 Next