A A A Volume : 44 Part : 2 Spatial Interpolation of Early Room Impulse Responses Using Equivalent Source Method Based on Grouped Image SourcesHaruka Matsuhashi 1Tokyo Denki University 5 Senju-Asahi-Cho, Adachi-ku, Tokyo, JAPANIzumi Tsunokuni 2Tokyo Denki University 5 Senju-Asahi-Cho, Adachi-ku, Tokyo, JAPANYusuke Ikeda 3Tokyo Denki University 5 Senju-Asahi-Cho, Adachi-ku, Tokyo, JAPANABSTRACT The measurement of room impulse response (RIR) is helpful in various acoustical applications. In particular, multi-point measurements of RIRs are useful for applications that require spatial information. However, RIR estimation is di ffi cult because many measurements and large-scale microphones are required. To e ffi ciently obtain multi-point RIR measurements, spatial interpolation methods for RIRs based on physical models have been proposed. In particular, a combination of the sparse equivalent source method (ESM) with the image source method (ISM) achieves a spatial interpolation of the early reflected sounds at higher frequencies. However, when the order of the reflected sounds increases, solving the optimization problems becomes di ffi cult because the complexity of the problem increases. In this study, to reduce the complexity of the optimization problem, we proposed a sparse ESM by grouping the arrival times of the reflected sounds. The simulation experiments revealed that the proposed method can estimate higher-order reflections with high accuracy compared to the conventional method. In addition, we found that the estimated regions with high accuracy became broader.1. INTRODUCTIONRoom impulse response (RIR) is used in various acoustic applications such as room acoustics, sound field synthesis, and sound field visualization. When the spatial information of sound propagation is essential, the RIRs need to be measured at multiple points. However, the multipoint measurement of RIR requires many microphones and systems to move the microphones. Thus,1 21fmi18@ms.dendai.ac.jp2 21udc02@ms.dendai.ac.jp3 yusuke.ikeda@mail.dendai.ac.jpa slaty. inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS O ¥, ? GLASGOW spatial interpolation methods based on compressed sensing framework using a small number of microphones have been proposed to e ffi ciently obtain the RIRs at many points. For example, the interpolation method for RIR uses randomly placed microphones in a room [1]. However, the geometry of microphone arrays is complex, and the frequencies determined using practically possible microphone positions are usually low. To estimate the RIRs at the desired grid points, another interpolation method was proposed in which the microphone was moved [2]. However, this method requires a device to move the microphone to the desired location. Thus, obtaining multi-point RIR measurements is di ffi cult because it requires considerable time and e ff ort. RIR interpolation methods involving physical models have been proposed to e ffi ciently obtain multipoint RIRs. In [3], Moiola proposed a plane-wave decomposition method (PWDM) that represents the sound field by the superposition of plane-wave transfer functions. Samuel e t al. also proposed the sound field reconstruction using plane waves with consideration of the spatial sparsity of the sound [4]. The equivalent source method (ESM) that uses the transfer functions of point sources has also been proposed [5]. Grande e t al. proposed a sparse ESM using the sparsity of the sound source in space [6]. Koyama e t al. proposed a sound field decomposition method in which a reverberant sound field was modeled to estimate the sound field of a direct sound [7]. In this method, the direct and reverberant sound components were the reconstructed ESM and PWDM, respectively. The sparse ESM using sparsity in time domain also have been proposed [8]. An estimation method for the direct and early reflected sound components of RIRs was proposed in which the sparse ESM and the image source method (ISM) were combined [9–11]. This method reconstructs RIRs from a small number of fixed microphones by using the sparsity of the sound source in a room. However, as the order of the reflective sound increases, the interpolation accuracy decreases. One reason for the decrease in accuracy is that as the order of the reflective sound increases, the basis for representing the data, that is, the dictionary, becomes too large. In this study, we proposed a spatial interpolation method for early RIRs with sparse equivalent sources (ESs) and grouped image sources (ISs). RIRs are decomposed in the time domain based on the grouped ISs according to their arrival times. The decomposed RIRs are then represented by the superposition of the sparse ESs in the frequency domain. Numerical experiments were conducted to evaluate the estimation accuracy of the proposed method by comparing it with conventional methods without RIR decomposition.2. METHODWe consider modeling the early RIRs around microphones using the RIRs of a small number of microphones. Figure 1 presents a conceptual diagram of the proposed method. Assuming that the location of a sound source and the room geometry are approximately known, the positions of ISs are calculated corresponding to the reflection order under consideration based on the ISM [12]. The sound source and ISs were then grouped using the distance of the sources from the center of the microphone array. The RIR is decomposed by windowing the signal per microphone to include the arrival times of all the sources in each group. In the time domain, the m -th microphone signal y m ( t ) can be represented as follows:L Xl = 1 w ( l ) win ( t ) · y m ( t ) (1)y m ( t ) =y ( l ) m ( t ) ≡ w ( l ) win ( t ) · y m ( t ) , (2)where w ( l ) win ( t ) is a window function that extracts only the reflected sounds corresponding to the l -th group of arrival times and t is the time index, and y ( l ) m ( t ) is the signal of l -th group. Then, we consider modeling the RIRs per group using the sparse ESM in the frequency domain [6]. Based on ISM, virtual sound sources (ESs) are placed around the position of a sound source and ISs. Using the Microphone signalEquivalent Sourcel = LMicrophoneAmp.・・・DecompositionTimel =1l =1Sound SourceInterpolated areal = LImage SourceWallFigure 1: Conceptual diagram of the proposed method. The gray bands on the left represent the range of each grouped image source. The measured RIRs at the microphones were decomposed by windowing in the time domain, as shown on the right. Parameters l and L are the index and number of image-source groups, respectively. Each windowed RIR signals is modeled using a sparse ESM.transfer function D from the ES at x ( l , g ) n to the m -th microphone at x ′ m and the weight coe ffi cient w n of the ES, the signal of the l -th group is as follows:G l XN Xn = 1 D ( x ′ m , x ( l , g ) n , ω ) w ( l , g ,ω ) n , (3)y ( l ,ω ) mg = 1D ( x ′ m , x ( l , g ) n , ω ) = e − ik | x ′ m − x ( l , g ) n |4 π | x ′ m − x ( l , g ) n | , (4)where G l is the total number of sound sources in the l -th group, N is the total number of ESs for each source, and ω is the angular frequency. Since Eq.(3) holds for all microphones, it can be written in matrix representation as follows. y ( l ,ω ) = D ( l ,ω ) w ( l ,ω ) , (5)D ( x ′ 1 , x ( l ) 1 , ω ) · · · D ( x ′ 1 , x ( l ) N l , ω ) ... ... ... , (6)D ( l ,ω )D ( x ′ M , x ( l ) 1 , ω ) · · · D ( x ′ M , x ( l ) N l , ω )w ( l ,ω ) [ w ( l ,ω ) 1 , . . . , w ( l ,ω ) N l ] T , (7)where y ( l ,ω ) is the vector of the transfer functions from the source in the l -th group, D is the transfer function matrix of ESs and is calculated using the Green’s function. N l is the number of ESs included in the l -th group, and w ( l ,ω ) denotes the vector of weight coe ffi cients. To obtain the weight vector, we assumed the sparsity of the sound source in space. In other words, the total number of sound sources and ISs was significantly lower than the total number of ESs. Because only a few elements of vector w have a value, we solve the following optimization problem using ℓ 1-norm regularization:minimize w ( l ,ω ) 1 2 ∥ y ( l ,ω ) − D ( l ,ω ) w ( l ,ω ) ∥ 2 + λ l ∥ w ( l ,ω ) ∥ 1 , (8)where ∥·∥ 2 is ℓ 2-norm, ∥·∥ 1 is ℓ 1-norm, and λ l denotes the penalty parameter. In a previous method [11], the reflected sound was modeled by the superposition of all ESs simultaneously. Thus, the size of the matrix of ES transfer functions increased as the order of the reflected sounds increased. In the proposed method, the size of the matrix can be reduced by grouping the reflected sounds based on their arrival times so that the optimization problem becomes solvable. Using the weight vectors obtained from the optimization problem, the transfer functions at any point x are represented as follows:G l XL XN Xn = 1 D ( x , x ( l , g ) n , ω ) w ( l , g ,ω ) n . (9)y ( x , ω ) =g = 1l = 1To obtain the estimated RIRs, the transfer functions obtained from Eq.(9) are converted to the time domain by using the inverse Fourier transform.3. EXPERIMENTS3.1. Experimental conditions2.70mWall1.00 mSound Source2.70 m1.00 m0.75 m(0,0)1.35 mEvaluation Point( ) Microphone( )Y0.50 mXFigure 2: Arrangement for simulation. The position of the sound source is the origin of the coordinate system. The simulated room was square shaped and had four walls. We used 20 microphones for modeling and evaluation at 421 points around the microphone array.We conducted simulation experiments to evaluate the estimation accuracy of the proposed method by comparing it with that of the conventional method. The conventional method interpolates RIRs using the ESM and ISM without grouping ISs and decomposing RIR signals. Figure 2 shows the simulation conditions. The sampling rate was 48 kHz, the reflection coe ffi cient was 1.0, and the frequency of the interpolated sound field was 0 . 5–8 . 5 kHz. In this experiment, we compared the interpolation accuracies obtained by increasing the maximum order R of the reflected sounds in the RIRs from 1 to 3. The sound field was simulated using the ISM [13]. Random noise with an SNR of 30 dB was added to the microphone signal for modeling. The experiments were conducted 20 times, and the results were averaged. The sound propagated in three dimensions, and interpolation and evaluation targeted approximately 400 points on a two-dimensional plane. The ISs were grouped by source distance every 1 m. Four hundred ESs were placed around each source. The penalty parameter λ l was set to 10 − 4 . MATLAB (2020a) [14] and CVX (ver. 2.1) was used to solve the optimization problem [15]. We evaluated the accuracies using the mean SNR in the frequency domain SNR freq and the distribution of the mean SNR in the time domain SNR time as follows:SNR freq = P ω max ω = ω min P E e = 1 | y ( ω, e ) | 2 P ω max ω = ω min P E e = 1 | y ( ω, e ) − ˆ y ( ω, e ) | 2 (10)SNR time = P T t = 0 | y ( t ) | 2 P T t = 0 | y ( t ) − ˆ y ( t ) | 2 (11)where y is the true value, ˆ y is the estimated value, e is index of evaluation points, T is the signal length, ω is the center frequency of the frequency band, and E is the number of evaluation points.3.2. Results404040Proposed PreviousProposed PreviousProposed PreviousSNR freq [dB]SNR freq [dB]SNR freq [dB]3030302020201010100001 2 3 4 5 6 7 8 Frequency[kHz]1 2 3 4 5 6 7 8 Frequency[kHz]1 2 3 4 5 6 7 8 Frequency[kHz](A) R =1 (B) R =2 (C) R =3Figure 3: Mean SNR in frequency domain (A) when R = 1, (B) when R = 2, and (C) when R = 3. R is the maximum order of the reflected sound in the RIRs. The black line and gray dotted line indicate mean SNRs obtained by the proposed method and the previous method, respectively. The horizontal axis represents the central frequency of the 1 kHz-frequency band. For example, 1 kHz means mean SNR between 0.5 kHz and 1.5 kHz.First, we evaluated the interpolation accuracies in the frequency domain. Figure 3 shows the SNR mean for each reflection order R . From Fig.3(A), when R = 1, there was no significant di ff erence in interpolation accuracies between the proposed method and the previous method. From Fig.3(B), the interpolation accuracies of the proposed method improved by about 5 dB at 1 kHz and 10 dB at 8 kHz when R = 2. From Fig.3(C), when R = 3, the accuracies significantly improved by about 15 dB on average. In particular, at 2 kHz, the accuracy improved by approximately 20 dB. Next, we evaluated the interpolation accuracies in the time domain and the size of the estimation region. Figure 4 shows the SNR distribution. From Fig.4 (i)–(iii), as the maximum order of reflected sounds increased, the highly estimated region with an SNR of about 25 dB became narrower compared to the estimation region obtained by the previous method. From Fig.4 (iv)–(vi), it is observed that the proposed method maintains estimation accuracies even when the maximum order of the reflected sounds increases. Finally, we evaluated the RIR signals. Figure 5 shows the RIR signals for ( X , Y ) = (0 . 55 , 0 . 325). In the previous method, the estimated signal contained apparent errors. In the proposed method, the error in the estimated signal evidently decreased significantly. In addition, the previous method yielded errors in the estimated amplitudes of the direct and reflected sounds. In the proposed method, the amplitudes could be estimated with improved accuracy.4. CONCLUSIONSIn this study, we proposed the spatial interpolation method for RIRs based the sparse ESM with consideration of arival times of reflected sounds. From the simulation experiments, the proposed method improved the interpolation accuracies by approximately 25 dB SNR compared to the previous R =2 R =3R =1( i ) ( ii ) ( iii )0.52530Previous Proposed25Y[m]20SNR time [dB]-0.47515( iv ) ( v ) ( vi )0.52510Y[m]50-0.4750.75 1.75 X[m]0.75 1.75 X[m] 0.75 1.75 X[m]Figure 4: SNR map of interpolated RIR in evaluation area. The grids of the map are the positions of the gray points in Fig. 2. The sound source is shown on the left side of the map. R is the maximum order that includes the RIRs. The cross at the center of the map shows the position of the microphone array.0.50.50.5AmplitudeAmplitudeAmplitude-0.2-0.2-0.20 3 Time[ms]0 3 Time[ms]0 3 Time[ms](A) Ground truth (B) Proposed method (C) Previous methodFigure 5: Example of RIR signals at ( X , Y ) = (0 . 55 , 0 . 325) when R = 3. (A) Ground truth of RIR signal. (B) RIR signal estimated using the proposed method. (C) RIR signal estimated using the conventional method.method. In addition, the proposed method achieved the spatial interpolation with a high accuracy over a broader region. In future work, we will investigate a grouping method for reflective sounds.ACKNOWLEDGEMENTSThis work was partially supported by JSPS KAKENHI Grant Numbers 19K12049 and 20K11872.REFERENCES[1] R. Mignot, L. Daudet, and F. Ollivier. Room reverberation reconstruction: Interpolation of the early part using compressed sensing. IEEE Transactions on Audio, Speech, and Language Processing , 21(11):2301–2312, 2013. [2] Fabrice Katzberg, Radoslaw Mazur, Marco Maass, Philipp Koch, and Alfred Mertins. Sound- field measurement with moving microphones. The Journal of the Acoustical Society of America , 141(5):3220–3235, 2017. [3] A. Moiola, Ralf Hiptmair, and I. Perugia. Plane wave approximation of homogeneous helmholtz solutions. Zeitschrift für angewandte Mathematik und Physik , 62:809–837, 10 2011. [4] Samuel Verburg and Efren Fernandez-Grande. Reconstruction of the sound field in a room using compressive sensing. The Journal of the Acoustical Society of America , 143:3770–3779, 06 2018. [5] Gary H. Koopmann, Limin Song, and John B. Fahnline. A method for computing acoustic fields based on the principle of wave superposition. The Journal of the Acoustical Society of America , 86(6):2433–2438, 1989. [6] Efren Fernandez-Grande, Angeliki Xenaki, and Peter Gerstoft. A sparse equivalent source method for near-field acoustic holography. The Journal of the Acoustical Society of America , 141:532–542, 01 2017. [7] S. Koyama and L. Daudet. Sparse representation of a spatial sound field in a reverberant environment. IEEE Journal of Selected Topics in Signal Processing , 13(1):172–184, 2019. [8] N. Antonello, E. De Sena, M. Moonen, P. A. Naylor, and T. van Waterschoot. Room impulse response interpolation using a sparse spatio-temporal representation of the sound field. IEEE / ACM Transactions on Audio, Speech, and Language Processing , 25(10):1929–1941, 2017. [9] K. Tanaka, I. Tsunokuni, Y. Ikeda, and N. Osaka. Estimation method of transfer functions of primary reflection for local sound field by equivalent point sources.(in japanese). In Proceedings of the Autumn Meeting of Acoustical Society of Japan , 2019. [10] I. Tsunokuni, K. Kurokawa, Y. Ikeda, and N. Osaka. Modeling of direct-sound transfer functions in local area by sparse equivalent sources. In 2019 IEEE 8th Global Conference on Consumer Electronics (GCCE) , pages 747–751, 2019. [11] Izumi Tsunokuni, Kakeru Kurokawa, Haruka Matsuhashi, Yusuke Ikeda, and Naotoshi Osaka. Spatial extrapolation of early room impulse responses in local area using sparse equivalent sources and image source method. Applied Acoustics , 179:108027, 2021. [12] Jont B. Allen and David A. Berkley. Image method for e ffi ciently simulating small-room acoustics. The Journal of the Acoustical Society of America , 65(4):943–950, 1979. [13] Robin Scheibler, Eric Bezzam, and Ivan Dokmani´c. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 351–355, 2018. [14] The Mathworks, Inc., Natick, Massachusetts. MATLAB version 9.8.0.1417392 (R2020a) , 2020. [15] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx , March 2014. Previous Paper 226 of 808 Next