Welcome to the new IOA website! Please reset your password to access your account.

Time delay estimation via average magnitude di ff erences among multiple microphone signals

Zhen Zhu 1 , Hongsen He 2

School of Information Engineering and Robot Technology Used for Special Environment Key Laboratory of Sichuan Province, Southwest University of Science and Technology Mianyang 621010, China

Jingdong Chen 3

CIAIC and Shaanxi Provincial Key Laboratory of Artificial Intelligence, Northwestern Polytechnical University Xi’an 710072, China

ABSTRACT Time delay estimation (TDE) plays a significant role in hands-free speech communication systems for localizing and tracking speakers. To boost the robustness of time delay estimators in room acoustic environments, a novel TDE approach is proposed in this paper. This method first exploits the reciprocals of average magnitude di ff erence functions of sound signals captured at a microphone array, instead of cross correlation coe ffi cients, to construct the entries of the parameterized correlation coe ffi cient matrix in the multichannel cross-correlation coe ffi cient algorithm. A multichannel average magnitude di ff erence coe ffi cient is then defined to establish the time delay estimator. Simulation results demonstrate that the proposed TDE strategy can yield better performance in noisy and reverberant environments as compared to the multichannel cross-correlation coe ffi cient method.

1. INTRODUCTION

Time delay estimation (TDE), which aims at estimating the relative time di ff erence of arrival using the signals received at an array of microphones, plays a fundamental role in hands-free speech communication and human-machine voice interaction for localizing and tracking speakers. Common time delay estimators utilize the cross correlation methods [1–3], the approaches based on the specific properties of the speech signal [4, 5], the adaptive eigenvalue decomposition-based strategies [6–8], adaptive blind multichannel identification schemes [9, 10], the information theory-based techniques [11–13], and so forth.

1 1462246736@qq.com

2 hongsenhe@gmail.com

3 jingdongchen@ieee.org

a slaty. inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS O ¥, ? GLASGOW

Due to simplicity and ease of implementation, the cross correlation technology is largely applied in practical localization systems, in which the generalized cross correlation (GCC) approach employing two sensors is the most popular one [1]. However, the cross correlation-based estimators show large variance even under high signal-to-noise ratio (SNR) conditions [14]. Chen et al. proposed a multichannel cross-correlation coe ffi cient (MCCC) method on the basis of microphone arrays [15,16]. It is shown that the algorithm can e ffi ciently exploit the redundancy among multiple microphones to improve the TDE performance [15–17]. Another family of simple and easy methods use the average magnitude di ff erence function (AMDF) to construct the time delay estimator [14]. It has been shown that the AMDF-based TDE algorithm presents small variance in noisy environments. To enhance the TDE performance in practical acoustic environments, two improved AMDF-based approaches were proposed [18], from which one can find that the time delay estimators with two microphones exhibit better robustness to reverberation than the GCC approach. In this paper, we expand the AMDF to the multichannel case in the same manner as the MCCC algorithm. The reciprocals of the AMDFs of sound signals observed at an array of microphones are employed to construct the entries of the parameterized average magnitude di ff erence matrix, based on which a multichannel average magnitude di ff erence coe ffi cient (MAMDC) is then defined to establish the time delay estimator.

2. TIME DELAY ESTIMATOR BASED ON MAMDC

2.1. Signal Model Assume that a broadband sound source impinges on a uniform linear array of M microphones and the size of the acoustic array aperture is much smaller than the distance from the source to the array. Without loss of generality, we select microphone 1 as the reference point, and then the propagation of the signal from the acoustic source to the m th microphone at time n is modeled as

x m ( n ) = α m s [ n − t − f m ( D )] + w m ( n ) , (1)

where α m , m = 1 , 2 , . . . , M , are the attenuation factors due to propagation e ff ects, s ( n ) is the unknown zero-mean and reasonably broadband source signal, t is the propagation time from the source to microphone 1, D is the relative delay between the first and second microphones due to the source, f m ( D ) = ( m − 1) D is the relative delay between microphones 1 and m , and w m ( n ), is the additive noise at the m th microphone, which is assumed to be uncorrelated with both the source signal and the noise observed at other microphones. With the above signal model, the goal of TDE is to estimate the time delay D given the signals received at M microphones. For a hypothesized time delay p , we define the parameterized time-aligned signal of x m ( n ) as follows:

x m [ n + f m ( p )] = α m s [ n − t − f m ( D ) + f m ( p )] + w m [ n + f m ( p )] . (2)

When p = D , the signals received at M microphones are in phase, and there exists the maximum correlation among the M microphone signals.

2.2. Time Delay Estimator Based on MAMDC The parameterized AMDF between the time shifted signals x i [ n + f i ( p )] and x j [ n + f j ( p )] is defined as [14]

Φ AMDF , ij ( p ) = E n x i [ n + f i ( p )] − x j [ n + f j ( p )] o , i , j = 1 , 2 , . . . , M , (3)

where E ( · ) denotes the mathematical expectation. To extend the AMDF-based TDE algorithm to the multichannel case, we employ the idea of the MCCC algorithm to construct the multichannel average magnitude di ff erence matrix as follows:





γ AMDF , 11 ( p ) γ AMDF , 12 ( p ) · · · γ AMDF , 1 M ( p )

γ AMDF , 21 ( p ) γ AMDF , 22 ( p ) · · · γ AMDF , 2 M ( p ) ... ... ... ...

e D AMDF ( p ) =

(4)

γ AMDF , M 1 ( p ) γ AMDF , M 2 ( p ) · · · γ AMDF , MM ( p )

with the ( i , j )th entry

γ AMDF , ij ( p ) = ξ Φ AMDF , i j ( p ) + ξ , 1 ≤ i , j ≤ M , (5)

where ξ is a fixed positive number to prevent division overflow. Since 0 < γ AMDF , ij ( p ) ≤ 1, one can check

0 ≤ det he D AMDF ( p ) i ≤ 1 , (6)

where det( · ) stands for the determinant a square matrix. By analogy with the squared MCCC [15,16], the squared MAMDC among the M signals x 1 [ n + f 1 ( p )], x 2 [ n + f 2 ( p )], . . . , x M [ n + f M ( p )] is defined as

γ 2 MAMDC , 1: M ( p ) = 1 − det he D AMDF ( p ) i . (7)

Therefore, we obtain the new time delay estimator on the basis of the MAMDC as follows: b D = arg max p γ 2 MAMDC , 1: M ( p ) . (8)

2.3. Performance Evaluation The relationship between the AMDF and the cross correlation can be derived from the following well-known inequality:

v t

L X

L X

1 L

1 L

n = 1 x 2 ( n ) . (9)

n = 1 | x ( n ) | ≤

The left-hand side of (9) is the average magnitude of the sequence x ( n ) with the length of L , while the right-hand side of the equation is the root mean square of this sequence. Using (9), one can approximate Φ AMDF , i j ( p ) in (3) as

Φ AMDF , ij ( p ) = E n x i [ n + f i ( p )] − x j [ n + f j ( p )] o

v t

L X

1 L

h x i [ n + f i ( p )] − x j [ n + f j ( p )] i 2

≈ β ij ( p )

n = 1

= β ij ( p ) p

r ii + r j j − 2 r i j ( p ) , (10)

where

L X

r ii = 1

n = 1 x 2 i [ n + f i ( p )] , (11)

L

L X

r jj = 1

n = 1 x 2 j [ n + f j ( p )] , (12)

L

L X

r ij ( p ) = 1

n = 1 x i [ n + f i ( p )] x j [ n + f j ( p )] , (13)

L

8

N

E W

S

Microphone array

P 1 P 11

4

P 2

P 10

P 3

P 9

P 4

P 5 P 6 P 7 P 8

0

5 0 10

Figure 1: Layout of the microphone array and the sound source positions in the simulated room.

r ii , r j j are respectively the variance of channels i and j , r i j ( p ) is the parameterized cross correlation function between channels i and j , and β ij ( p ) ∈ (0 , 1], which is closely related to the joint probability density function between x i [ n + f i ( p )] and x j [ n + f j ( p )] [19]. It is seen from (10) that the function Φ AMDF , i j ( p ) depends not only on r i j ( p ), but also on β ij ( p ). This shows that the function Φ AMDF , ij ( p ) exploits more information than the cross correlation function r ij ( p ) to estimate the time delay, which helps enhance the robustness of TDE. Thus, we can deduce that in the multichannel case, the proposed MAMDC algorithm can obtain better TDE performance than the MCCC algorithm.

3. SIMULATION EXPERIMENTS

3.1. Experimental Environment Experiments are carried out in a simulated room of size 10 m × 8 m × 6 m. A uniform linear array consists of six omnidirectional microphones with the inter-element spacing being 0.1 m. For ease of exposition, positions in the room are designated by ( x , y , z ) coordinates in meter with reference to the southwest corner of the room floor. The first and the sixth microphones of the array are situated at (4.75, 4.00, 1.40) and (5.25, 4.00, 1.40), respectively. A sound source is successively located at eleven points along the circular arc with the center at (5.00, 4.00, 1.40) and the radius equal to 2 m. The layout of the microphone array and the sound source positions are illustrated in Fig. 1. The eleven sound source positions are denoted as P 1 , P 2 , . . . , P 11 . The impulse responses of the room acoustic channels are simulated using the image model [20]. The source signal is convolved with the synthetic impulse responses. In the calculation of the SNR, the signal component includes reverberation. In terms of reverberant environments, three specific levels are also simulated, i.e., an anechoic environment, a moderately reverberant environment with reverberation time T 60 = 300 ms, and a heavily reverberant one with T 60 = 600 ms. In simulations, all the microphones are calibrated with unity gains. Appropriately scaled temporally white Gaussian noise is added to each microphone signal to simulate the required SNR. The parameter ξ is set to 0.01. The source signal is a female speech sampled at 16 kHz and duration of 4 min. The microphone signals are segmented into nonoverlapping frames with the frame length of 128 ms (2048 samples). The total frame number is 1876. All the frames are windowed using a Hamming window, respectively. For each signal frame, the proposed algorithm is applied to conduct a time delay estimate. When the sound source is respectively situated at the eleven positions, the true time delays between the sound source and the first two microphones are 4.7, 4.0, 3.0, 2.0, 1.0, 0.0, − 1.0, − 2.0, − 3.0, − 4.0, and − 4.7

samples, respectively. In simulations, we use two criteria [21,22], namely the probability of anomalous estimates and the root mean square error (RMSE) of the nonanomalous estimates, to evaluate the performance of the proposed algorithm. For the i th delay estimate b D i , if the absolute error | b D i − D | > T c / 2, where D is the true delay, and T c is the signal self correlation time, the estimate is identified as an anomalous estimate. Otherwise, this estimate would be deemed as a nonanomalous one. For the particular speech signal used in this study, T c is equal to 4.0 samples. Once the time delays at eleven source positions are estimated by the proposed algorithm, respectively, the average probability of anomalous estimates and the average RMSE of the nonanomalous estimates can be obtained. Accordingly, the lower the average probability of the anomalies and the average RMSE of the nonanomalous estimates are, the better is the TDE performance.

40

1.2

MCCC MAMDC

MCCC MAMDC

35

1.1

Probability of anomalies (%)

30

RMSE (samples)

1

25

0.9

20

0.8

15

0.7

10

5

0.6

2 3 4 5 6 Number of microphones (a)

2 3 4 5 6 Number of microphones (b)

Figure 2: Probability of (a) anomalous estimates, (b) RMSE of nonanomalous estimates versus the number of microphones in the anechoic environment. The fitting curve is a third-order polynomial.

45

1.25

MCCC MAMDC

MCCC MAMDC

1.2

40

Probability of anomalies (%)

1.15

35

RMSE (samples)

1.1

30

1.05

25

1

20

0.95

15

0.9

10

0.85

2 3 4 5 6 Number of microphones (a)

2 3 4 5 6 Number of microphones (b)

Figure 3: Probability of (a) anomalous estimates, (b) RMSE of nonanomalous estimates versus the number of microphones in the moderately reverberant environment. The fitting curve is a third-order polynomial.

50

1.35

MCCC MAMDC

MCCC MAMDC

1.3

45

Probability of anomalies (%)

1.25

RMSE (samples)

40

1.2

35

1.15

30

1.1

25

1.05

20

1

2 3 4 5 6 Number of microphones (a)

2 3 4 5 6 Number of microphones (b)

Figure 4: Probability of (a) anomalous estimates, (b) RMSE of nonanomalous estimates versus the number of microphones in the heavily reverberant environment. The fitting curve is a third-order polynomial.

3.2. Results Figures 2–4 present the TDE results in anechoic, moderately, and heavily reverberant environments, where the probability of anomalous estimates and the RMSE of nonanomalous estimates are plotted as a function of the number of microphones, respectively. It is seen from Figs. 2–4 that under low SNR conditions, such as SNR = − 10 dB and − 5 dB, the TDE performance of the MAMDC and MCCC algorithms is boosted as the number of microphones increases, indicating that both of them can take advantage of the redundancy among multiple microphones to combat strong noises. It can also be observed that when the number of microphones is fixed, the performance of the proposed MAMDC algorithm outperforms that of the MCCC algorithm, especially for the case of more microphones used, which demonstrates that it is e ff ective to generalize the AMDF to the multichannel case to estimate the time delay.

1.4

45

MCCC MAMDC

MCCC MAMDC

40

1.2

Probability of anomalies (%)

35

1

RMSE (samples)

30

0.8

25

20

0.6

15

0.4

10

0.2

5

0

0

-15 -10 -5 0 5 10 15 20 SNR (dB) (a)

-15 -10 -5 0 5 10 15 20 SNR (dB) (b)

Figure 5: Probability of (a) anomalous estimates, (b) RMSE of nonanomalous estimates versus SNR in the anechoic and moderately reverberant environments where five microphones are utilized. The fitting curve is a third-order polynomial.

Figure 5 depicts the TDE results versus SNR in the anechoic and moderately reverberant environments where five microphones are utilized. It is found that the probability of anomalous estimates and the RMSE of nonanomalous estimates of the proposed MAMDC algorithm is lower than those of the MCCC algorithm regardless of the SNR condition, which validates that the proposed MAMDC algorithm is more robust to noise than the MCCC algorithm, especially in heavy noises. Figure 6 illustrates the TDE results versus T 60 at the SNRs of − 5 dB and − 10 dB where five microphones are employed. As seen from Fig. 6, the proposed MAMDC algorithm displays its superiority over the MCCC algorithm no matter how the reverberation time changes. This demonstrates that as compared to the MCCC algorithm, the MAMDC algorithm can more e ff ectively use the redundancy of multiple microphones to mitigate the adverse e ff ect of noise and reverberation on TDE.

40

1.3

35

1.2

Probability of anomalies

30

1.1

RMSE (samples)

25

1

20

0.9

15

0.8

10

0.7

MCCC MAMDC

MCCC MAMDC

0 100 200 300 400 500 600 700 800 5

0 100 200 300 400 500 600 700 800 0.6

Figure 6: Probability of (a) anomalous estimates, (b) RMSE of nonanomalous estimates versus T 60 at the SNRs of − 10 dB and − 5 dB where five microphones are utilized. The fitting curve is a third-order polynomial.

4. CONCLUSIONS

In this work, we proposed a new TDE algorithm for localizing and tracking speakers. This algorithm generalizes the AMDF-based method from two- to multiple-channel cases. The reciprocals of the AMDFs of sound signals received at a microphone array are employed to construct the multichannel average magnitude di ff erence matrix, based on which the MAMDC is defined to establish the time delay estimator. Simulation experiments showed that the proposed MAMDC algorithm can e ff ectively exploit the redundancy of multiple microphones to obtain better robustness over noise and reverberation than the MCCC algorithm.

ACKNOWLEDGEMENTS

This work was supported in part by the National Science Foundation of China (Grant No. 62071399) and the Open Foundation of Robot Technology Used for Special Environment Key Laboratory of Sichuan Province (Grant No. 20kfkt03).

REFERENCES

[1] C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing , 24(8): 320–327, 1976. [2] G. C. Carter. Time delay estimation for passive sonar signal processing. IEEE Transactions on Acoustics, Speech, and Signal Processing , 29(1): 463–470, 1981. [3] J. Chen, J. Benesty, and Y. Huang. Performance of GCC- and AMDF-based time-delay estimation in practical reverberant environments. EURASIP Journal on Applied Signal Processing , 2005: 25–36, 2005. [4] M. S. Brandstein. A pitch-based approach to time-delay estimation of reverberant speech. Proc. IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics , pp. 1–4, 1997. [5] T. G. Dvorkind and S. Gannot. Time di ff erence of arrival estimation of speech source in a noisy and reverberant environment. Signal Processing , 85(1): 177–204, 2005. [6] Y. Huang, J. Benesty, and G. W. Elko. Adaptive eigenvalue decomposition algorithm for real time acoustic source localization system. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing , pp. 937–940, 1999. [7] J. Benesty. Adaptive eigenvalue decomposition algorithm for passive acoustic source localiza-tion. Journal of the Acoustical Society of America , 107(1): 384–391, 2000. [8] S. Doclo and M. Moonen. Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments. EURASIP Journal on Applied Signal Processing , 2003(11): 1110– 1124, 2003. [9] Y. Huang, J. Benesty, and J. Chen. Acoustic MIMO Signal Processing , Springer, Berlin, Heidelberg, Germany, 2006. [10] H. He, J. Chen, J. Benesty, Y. Zhou, and T. Yang. Robust multichannel TDOA estimation for speaker localization using the impulsive characteristics of speech spectrum. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing , pp. 6130–6134, 2017. [11] F. Talantzis, A. G. Constantinides, and L. C. Polymenakos. Estimation of direction of arrival using information theory. IEEE Signal Processing Letters , 12(8): 561–564, 2005. [12] J. Benesty, Y. Huang, and J. Chen. Time delay estimation via minimum entropy. IEEE Signal Processing Letters , 14(3): 157–160, 2007. [13] H. He, J. Lu, L. Wu, and X. Qiu. Time delay estimation via non-mutual information among multiple microphones. Applied Acoustics , 74: 1033–1036, 2013. [14] G. Jacovitti and G. Scarano. Discrete time techniques for time delay estimation. IEEE Transactions on Signal Processing , 41: 525–533, 1993. [15] J. Chen, J. Benesty, and Y. Huang. Robust time delay estimation exploiting redundancy among multiple microphones. IEEE Transactions on Speech and Audio Processing , 11(6): 549–557, 2003. [16] J. Benesty, J. Chen, and Y. Huang. Time-delay estimation via linear interpolation and cross-correlation. IEEE Transactions on Speech and Audio Processing , 12(5): 509–519, 2004. [17] J. Chen, J. Benesty, and Y. Huang. Time delay estimation in room acoustic environments: an overview. EURASIP Journal on Applied Signal Processing , 2006: 1–19, 2006. [18] J. Chen, J. Benesty, and Y. Huang. Performance of GCC- and AMDF-based time-delay estimation in practical reverberant environments. EURASIP Journal on Applied Signal Processing , 2005: 25–36, 2005. [19] M. J. Ross, H. L. Sha ff er, A. Cohen, R. Freudberg, and H. J. Manley. Average magnitude dif-ference function pitch extractor. IEEE Transactions on Acoustics, Speech, and Signal Processing , 22: 353–362, 1974. [20] J. B. Allen and D. A. Berkley. Image method for e ffi ciently simulating small-room acoustics. Journal of the Acoustical Society of America , 65: 943–950, 1979. [21] J. P. Ianniello. Time delay estimation via cross-correlation in the presence of large estimation errors. IEEE Transactions on Acoustics, Speech, and Signal Processing , 30: 998–1003, 1982. [22] B. Champagne, S. Bedard, and A. Stephenne. Performance of time-delay estimation in presence of room reverberation. IEEE Transactions on Speech and Audio Processing , 4: 148–152, 1996.