Institute of Acoustics: Paper Detail

Time delay estimation via average magnitude di ﬀ erences among multiple microphone signals

Zhen Zhu 1 , Hongsen He 2

School of Information Engineering and Robot Technology Used for Special Environment Key Laboratory of Sichuan Province, Southwest University of Science and Technology Mianyang 621010, China

Jingdong Chen 3

CIAIC and Shaanxi Provincial Key Laboratory of Artiﬁcial Intelligence, Northwestern Polytechnical University Xi’an 710072, China

ABSTRACT Time delay estimation (TDE) plays a signiﬁcant role in hands-free speech communication systems for localizing and tracking speakers. To boost the robustness of time delay estimators in room acoustic environments, a novel TDE approach is proposed in this paper. This method ﬁrst exploits the reciprocals of average magnitude di ﬀ erence functions of sound signals captured at a microphone array, instead of cross correlation coe ﬃ cients, to construct the entries of the parameterized correlation coe ﬃ cient matrix in the multichannel cross-correlation coe ﬃ cient algorithm. A multichannel average magnitude di ﬀ erence coe ﬃ cient is then deﬁned to establish the time delay estimator. Simulation results demonstrate that the proposed TDE strategy can yield better performance in noisy and reverberant environments as compared to the multichannel cross-correlation coe ﬃ cient method.

1. INTRODUCTION

Time delay estimation (TDE), which aims at estimating the relative time di ﬀ erence of arrival using the signals received at an array of microphones, plays a fundamental role in hands-free speech communication and human-machine voice interaction for localizing and tracking speakers. Common time delay estimators utilize the cross correlation methods [1–3], the approaches based on the speciﬁc properties of the speech signal [4, 5], the adaptive eigenvalue decomposition-based strategies [6–8], adaptive blind multichannel identiﬁcation schemes [9, 10], the information theory-based techniques [11–13], and so forth.

1 1462246736@qq.com

2 hongsenhe@gmail.com

3 jingdongchen@ieee.org

a slaty. inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS O ¥, ? GLASGOW

Due to simplicity and ease of implementation, the cross correlation technology is largely applied in practical localization systems, in which the generalized cross correlation (GCC) approach employing two sensors is the most popular one [1]. However, the cross correlation-based estimators show large variance even under high signal-to-noise ratio (SNR) conditions [14]. Chen et al. proposed a multichannel cross-correlation coe ﬃ cient (MCCC) method on the basis of microphone arrays [15,16]. It is shown that the algorithm can e ﬃ ciently exploit the redundancy among multiple microphones to improve the TDE performance [15–17]. Another family of simple and easy methods use the average magnitude di ﬀ erence function (AMDF) to construct the time delay estimator [14]. It has been shown that the AMDF-based TDE algorithm presents small variance in noisy environments. To enhance the TDE performance in practical acoustic environments, two improved AMDF-based approaches were proposed [18], from which one can ﬁnd that the time delay estimators with two microphones exhibit better robustness to reverberation than the GCC approach. In this paper, we expand the AMDF to the multichannel case in the same manner as the MCCC algorithm. The reciprocals of the AMDFs of sound signals observed at an array of microphones are employed to construct the entries of the parameterized average magnitude di ﬀ erence matrix, based on which a multichannel average magnitude di ﬀ erence coe ﬃ cient (MAMDC) is then deﬁned to establish the time delay estimator.

2. TIME DELAY ESTIMATOR BASED ON MAMDC

2.1. Signal Model Assume that a broadband sound source impinges on a uniform linear array of M microphones and the size of the acoustic array aperture is much smaller than the distance from the source to the array. Without loss of generality, we select microphone 1 as the reference point, and then the propagation of the signal from the acoustic source to the m th microphone at time n is modeled as

x m ( n ) = α m s [ n − t − f m ( D )] + w m ( n ) , (1)

where α m , m = 1 , 2 , . . . , M , are the attenuation factors due to propagation e ﬀ ects, s ( n ) is the unknown zero-mean and reasonably broadband source signal, t is the propagation time from the source to microphone 1, D is the relative delay between the ﬁrst and second microphones due to the source, f m ( D ) = ( m − 1) D is the relative delay between microphones 1 and m , and w m ( n ), is the additive noise at the m th microphone, which is assumed to be uncorrelated with both the source signal and the noise observed at other microphones. With the above signal model, the goal of TDE is to estimate the time delay D given the signals received at M microphones. For a hypothesized time delay p , we deﬁne the parameterized time-aligned signal of x m ( n ) as follows:

x m [ n + f m ( p )] = α m s [ n − t − f m ( D ) + f m ( p )] + w m [ n + f m ( p )] . (2)

When p = D , the signals received at M microphones are in phase, and there exists the maximum correlation among the M microphone signals.

2.2. Time Delay Estimator Based on MAMDC The parameterized AMDF between the time shifted signals x i [ n + f i ( p )] and x j [ n + f j ( p )] is deﬁned as [14]

Φ AMDF , ij ( p ) = E n x i [ n + f i ( p )] − x j [ n + f j ( p )] o , i , j = 1 , 2 , . . . , M , (3)

where E ( · ) denotes the mathematical expectation. To extend the AMDF-based TDE algorithm to the multichannel case, we employ the idea of the MCCC algorithm to construct the multichannel average magnitude di ﬀ erence matrix as follows:





γ AMDF , 11 ( p ) γ AMDF , 12 ( p ) · · · γ AMDF , 1 M ( p )

γ AMDF , 21 ( p ) γ AMDF , 22 ( p ) · · · γ AMDF , 2 M ( p ) ... ... ... ...

e D AMDF ( p ) =

(4)

γ AMDF , M 1 ( p ) γ AMDF , M 2 ( p ) · · · γ AMDF , MM ( p )

with the ( i , j )th entry

γ AMDF , ij ( p ) = ξ Φ AMDF , i j ( p ) + ξ , 1 ≤ i , j ≤ M , (5)

where ξ is a ﬁxed positive number to prevent division overﬂow. Since 0 < γ AMDF , ij ( p ) ≤ 1, one can check

0 ≤ det he D AMDF ( p ) i ≤ 1 , (6)

where det( · ) stands for the determinant a square matrix. By analogy with the squared MCCC [15,16], the squared MAMDC among the M signals x 1 [ n + f 1 ( p )], x 2 [ n + f 2 ( p )], . . . , x M [ n + f M ( p )] is deﬁned as

γ 2 MAMDC , 1: M ( p ) = 1 − det he D AMDF ( p ) i . (7)

Therefore, we obtain the new time delay estimator on the basis of the MAMDC as follows: b D = arg max p γ 2 MAMDC , 1: M ( p ) . (8)

2.3. Performance Evaluation The relationship between the AMDF and the cross correlation can be derived from the following well-known inequality:

v t

L X

1 L

n = 1 x 2 ( n ) . (9)

n = 1 | x ( n ) | ≤

The left-hand side of (9) is the average magnitude of the sequence x ( n ) with the length of L , while the right-hand side of the equation is the root mean square of this sequence. Using (9), one can approximate Φ AMDF , i j ( p ) in (3) as

Φ AMDF , ij ( p ) = E n x i [ n + f i ( p )] − x j [ n + f j ( p )] o

v t

L X

1 L

h x i [ n + f i ( p )] − x j [ n + f j ( p )] i 2

≈ β ij ( p )

n = 1

= β ij ( p ) p

r ii + r j j − 2 r i j ( p ) , (10)

where

L X

r ii = 1

n = 1 x 2 i [ n + f i ( p )] , (11)

L X

r jj = 1

n = 1 x 2 j [ n + f j ( p )] , (12)

L X

r ij ( p ) = 1

n = 1 x i [ n + f i ( p )] x j [ n + f j ( p )] , (13)

E W

Microphone array

P 1 P 11

P 2

P 10

P 3

P 9

P 4

P 5 P 6 P 7 P 8

5 0 10

Figure 1: Layout of the microphone array and the sound source positions in the simulated room.

r ii , r j j are respectively the variance of channels i and j , r i j ( p ) is the parameterized cross correlation function between channels i and j , and β ij ( p ) ∈ (0 , 1], which is closely related to the joint probability density function between x i [ n + f i ( p )] and x j [ n + f j ( p )] [19]. It is seen from (10) that the function Φ AMDF , i j ( p ) depends not only on r i j ( p ), but also on β ij ( p ). This shows that the function Φ AMDF , ij ( p ) exploits more information than the cross correlation function r ij ( p ) to estimate the time delay, which helps enhance the robustness of TDE. Thus, we can deduce that in the multichannel case, the proposed MAMDC algorithm can obtain better TDE performance than the MCCC algorithm.

3. SIMULATION EXPERIMENTS

3.1. Experimental Environment Experiments are carried out in a simulated room of size 10 m × 8 m × 6 m. A uniform linear array consists of six omnidirectional microphones with the inter-element spacing being 0.1 m. For ease of exposition, positions in the room are designated by ( x , y , z ) coordinates in meter with reference to the southwest corner of the room ﬂoor. The ﬁrst and the sixth microphones of the array are situated at (4.75, 4.00, 1.40) and (5.25, 4.00, 1.40), respectively. A sound source is successively located at eleven points along the circular arc with the center at (5.00, 4.00, 1.40) and the radius equal to 2 m. The layout of the microphone array and the sound source positions are illustrated in Fig. 1. The eleven sound source positions are denoted as P 1 , P 2 , . . . , P 11 . The impulse responses of the room acoustic channels are simulated using the image model [20]. The source signal is convolved with the synthetic impulse responses. In the calculation of the SNR, the signal component includes reverberation. In terms of reverberant environments, three speciﬁc levels are also simulated, i.e., an anechoic environment, a moderately reverberant environment with reverberation time T 60 = 300 ms, and a heavily reverberant one with T 60 = 600 ms. In simulations, all the microphones are calibrated with unity gains. Appropriately scaled temporally white Gaussian noise is added to each microphone signal to simulate the required SNR. The parameter ξ is set to 0.01. The source signal is a female speech sampled at 16 kHz and duration of 4 min. The microphone signals are segmented into nonoverlapping frames with the frame length of 128 ms (2048 samples). The total frame number is 1876. All the frames are windowed using a Hamming window, respectively. For each signal frame, the proposed algorithm is applied to conduct a time delay estimate. When the sound source is respectively situated at the eleven positions, the true time delays between the sound source and the ﬁrst two microphones are 4.7, 4.0, 3.0, 2.0, 1.0, 0.0, − 1.0, − 2.0, − 3.0, − 4.0, and − 4.7

samples, respectively. In simulations, we use two criteria [21,22], namely the probability of anomalous estimates and the root mean square error (RMSE) of the nonanomalous estimates, to evaluate the performance of the proposed algorithm. For the i th delay estimate b D i , if the absolute error | b D i − D | > T c / 2, where D is the true delay, and T c is the signal self correlation time, the estimate is identiﬁed as an anomalous estimate. Otherwise, this estimate would be deemed as a nonanomalous one. For the particular speech signal used in this study, T c is equal to 4.0 samples. Once the time delays at eleven source positions are estimated by the proposed algorithm, respectively, the average probability of anomalous estimates and the average RMSE of the nonanomalous estimates can be obtained. Accordingly, the lower the average probability of the anomalies and the average RMSE of the nonanomalous estimates are, the better is the TDE performance.

1.2

MCCC MAMDC

1.1

Probability of anomalies (%)

RMSE (samples)

0.9

0.8

0.7

0.6

2 3 4 5 6 Number of microphones (a)

2 3 4 5 6 Number of microphones (b)

Figure 2: Probability of (a) anomalous estimates, (b) RMSE of nonanomalous estimates versus the number of microphones in the anechoic environment. The ﬁtting curve is a third-order polynomial.

1.25

MCCC MAMDC

1.2

Probability of anomalies (%)

1.15

RMSE (samples)

1.1

1.05

0.95

0.9

0.85

2 3 4 5 6 Number of microphones (a)

2 3 4 5 6 Number of microphones (b)

Figure 3: Probability of (a) anomalous estimates, (b) RMSE of nonanomalous estimates versus the number of microphones in the moderately reverberant environment. The ﬁtting curve is a third-order polynomial.

1.35

MCCC MAMDC

1.3

Probability of anomalies (%)

1.25

RMSE (samples)

1.2

1.15

1.1

1.05

2 3 4 5 6 Number of microphones (a)

2 3 4 5 6 Number of microphones (b)

Figure 4: Probability of (a) anomalous estimates, (b) RMSE of nonanomalous estimates versus the number of microphones in the heavily reverberant environment. The ﬁtting curve is a third-order polynomial.

3.2. Results Figures 2–4 present the TDE results in anechoic, moderately, and heavily reverberant environments, where the probability of anomalous estimates and the RMSE of nonanomalous estimates are plotted as a function of the number of microphones, respectively. It is seen from Figs. 2–4 that under low SNR conditions, such as SNR = − 10 dB and − 5 dB, the TDE performance of the MAMDC and MCCC algorithms is boosted as the number of microphones increases, indicating that both of them can take advantage of the redundancy among multiple microphones to combat strong noises. It can also be observed that when the number of microphones is ﬁxed, the performance of the proposed MAMDC algorithm outperforms that of the MCCC algorithm, especially for the case of more microphones used, which demonstrates that it is e ﬀ ective to generalize the AMDF to the multichannel case to estimate the time delay.

1.4

MCCC MAMDC

1.2

Probability of anomalies (%)

RMSE (samples)

0.8

0.6

0.4

0.2

-15 -10 -5 0 5 10 15 20 SNR (dB) (a)

-15 -10 -5 0 5 10 15 20 SNR (dB) (b)

Figure 5: Probability of (a) anomalous estimates, (b) RMSE of nonanomalous estimates versus SNR in the anechoic and moderately reverberant environments where ﬁve microphones are utilized. The ﬁtting curve is a third-order polynomial.

Figure 5 depicts the TDE results versus SNR in the anechoic and moderately reverberant environments where ﬁve microphones are utilized. It is found that the probability of anomalous estimates and the RMSE of nonanomalous estimates of the proposed MAMDC algorithm is lower than those of the MCCC algorithm regardless of the SNR condition, which validates that the proposed MAMDC algorithm is more robust to noise than the MCCC algorithm, especially in heavy noises. Figure 6 illustrates the TDE results versus T 60 at the SNRs of − 5 dB and − 10 dB where ﬁve microphones are employed. As seen from Fig. 6, the proposed MAMDC algorithm displays its superiority over the MCCC algorithm no matter how the reverberation time changes. This demonstrates that as compared to the MCCC algorithm, the MAMDC algorithm can more e ﬀ ectively use the redundancy of multiple microphones to mitigate the adverse e ﬀ ect of noise and reverberation on TDE.

1.3

1.2

Probability of anomalies

1.1

RMSE (samples)

0.9

0.8

0.7

MCCC MAMDC

0 100 200 300 400 500 600 700 800 5

0 100 200 300 400 500 600 700 800 0.6

Figure 6: Probability of (a) anomalous estimates, (b) RMSE of nonanomalous estimates versus T 60 at the SNRs of − 10 dB and − 5 dB where ﬁve microphones are utilized. The ﬁtting curve is a third-order polynomial.

4. CONCLUSIONS

In this work, we proposed a new TDE algorithm for localizing and tracking speakers. This algorithm generalizes the AMDF-based method from two- to multiple-channel cases. The reciprocals of the AMDFs of sound signals received at a microphone array are employed to construct the multichannel average magnitude di ﬀ erence matrix, based on which the MAMDC is deﬁned to establish the time delay estimator. Simulation experiments showed that the proposed MAMDC algorithm can e ﬀ ectively exploit the redundancy of multiple microphones to obtain better robustness over noise and reverberation than the MCCC algorithm.

ACKNOWLEDGEMENTS

This work was supported in part by the National Science Foundation of China (Grant No. 62071399) and the Open Foundation of Robot Technology Used for Special Environment Key Laboratory of Sichuan Province (Grant No. 20kfkt03).

REFERENCES

[1] C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing , 24(8): 320–327, 1976. [2] G. C. Carter. Time delay estimation for passive sonar signal processing. IEEE Transactions on Acoustics, Speech, and Signal Processing , 29(1): 463–470, 1981. [3] J. Chen, J. Benesty, and Y. Huang. Performance of GCC- and AMDF-based time-delay estimation in practical reverberant environments. EURASIP Journal on Applied Signal Processing , 2005: 25–36, 2005. [4] M. S. Brandstein. A pitch-based approach to time-delay estimation of reverberant speech. Proc. IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics , pp. 1–4, 1997. [5] T. G. Dvorkind and S. Gannot. Time di ﬀ erence of arrival estimation of speech source in a noisy and reverberant environment. Signal Processing , 85(1): 177–204, 2005. [6] Y. Huang, J. Benesty, and G. W. Elko. Adaptive eigenvalue decomposition algorithm for real time acoustic source localization system. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing , pp. 937–940, 1999. [7] J. Benesty. Adaptive eigenvalue decomposition algorithm for passive acoustic source localiza-tion. Journal of the Acoustical Society of America , 107(1): 384–391, 2000. [8] S. Doclo and M. Moonen. Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments. EURASIP Journal on Applied Signal Processing , 2003(11): 1110– 1124, 2003. [9] Y. Huang, J. Benesty, and J. Chen. Acoustic MIMO Signal Processing , Springer, Berlin, Heidelberg, Germany, 2006. [10] H. He, J. Chen, J. Benesty, Y. Zhou, and T. Yang. Robust multichannel TDOA estimation for speaker localization using the impulsive characteristics of speech spectrum. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing , pp. 6130–6134, 2017. [11] F. Talantzis, A. G. Constantinides, and L. C. Polymenakos. Estimation of direction of arrival using information theory. IEEE Signal Processing Letters , 12(8): 561–564, 2005. [12] J. Benesty, Y. Huang, and J. Chen. Time delay estimation via minimum entropy. IEEE Signal Processing Letters , 14(3): 157–160, 2007. [13] H. He, J. Lu, L. Wu, and X. Qiu. Time delay estimation via non-mutual information among multiple microphones. Applied Acoustics , 74: 1033–1036, 2013. [14] G. Jacovitti and G. Scarano. Discrete time techniques for time delay estimation. IEEE Transactions on Signal Processing , 41: 525–533, 1993. [15] J. Chen, J. Benesty, and Y. Huang. Robust time delay estimation exploiting redundancy among multiple microphones. IEEE Transactions on Speech and Audio Processing , 11(6): 549–557, 2003. [16] J. Benesty, J. Chen, and Y. Huang. Time-delay estimation via linear interpolation and cross-correlation. IEEE Transactions on Speech and Audio Processing , 12(5): 509–519, 2004. [17] J. Chen, J. Benesty, and Y. Huang. Time delay estimation in room acoustic environments: an overview. EURASIP Journal on Applied Signal Processing , 2006: 1–19, 2006. [18] J. Chen, J. Benesty, and Y. Huang. Performance of GCC- and AMDF-based time-delay estimation in practical reverberant environments. EURASIP Journal on Applied Signal Processing , 2005: 25–36, 2005. [19] M. J. Ross, H. L. Sha ﬀ er, A. Cohen, R. Freudberg, and H. J. Manley. Average magnitude dif-ference function pitch extractor. IEEE Transactions on Acoustics, Speech, and Signal Processing , 22: 353–362, 1974. [20] J. B. Allen and D. A. Berkley. Image method for e ﬃ ciently simulating small-room acoustics. Journal of the Acoustical Society of America , 65: 943–950, 1979. [21] J. P. Ianniello. Time delay estimation via cross-correlation in the presence of large estimation errors. IEEE Transactions on Acoustics, Speech, and Signal Processing , 30: 998–1003, 1982. [22] B. Champagne, S. Bedard, and A. Stephenne. Performance of time-delay estimation in presence of room reverberation. IEEE Transactions on Speech and Audio Processing , 4: 148–152, 1996.

Building Acoustics

Policy & health

Underwater acoustics

Speech and hearing

Physical acoustics

Noise and vibration engineering

Musical acoustics

Electroacoustics

Environmental Sound

Measurement and instrumentation

Regulatory & Standards

Research

About Us

Terms and Conditions

Advertise With Us

People & Contacts

Publications

Engineering

Bursary Fund

Regional Branches

Specialist Groups

Conferences and Events

Conference Proceedings

British Standards Committees

Organisation Search

Why become a member?

Application Process

Membership Fees

Application Policy

Application

Professional Development Scheme (CPD)

Bulletins

Member Directory

Help and Advice

Awards

Become a Sponsor Member

What is acoustics?

Technician Apprenticeship Scheme 2022

Where do acousticians work?

Career Guide

What educational qualifications do I need?