Welcome to the new IOA website! Please reset your password to access your account.

A time delay estimation approach with low computational complexity for speaker localization

Bangguo Song 1 , Hongsen He 2 , and Tao Yang 3

School of Information Engineering and Robot Technology Used for Special Environment Key Laboratory of Sichuan Province, Southwest University of Science and Technology Mianyang 621010, China

ABSTRACT Time delay estimation (TDE), which aims at estimating the time di ff erence of arrival using the signals captured by an array of microphones, plays an essential role in hands-free speech communication systems for localizing and tracking speakers. The sparse linear prediction model, which is based on the ℓ 1 -norm optimization with respect to the prediction-error vector and the coe ffi cient vector of the linear predictor, can be e ff ectively used to prefilter the microphone signals so as to establish a time delay estimator robust to background noise and reverberation. In the course of solving such a sparse model, however, a high-dimensional cross-correlation matrix has to be inverted, indicating that the corresponding TDE algorithm has huge computational load. In this paper, we propose an improved TDE approach based on dual ℓ 1 -norm optimization model. The original prediction-error filter is decomposed by Kronecker product into two groups of short subpredictors. Accordingly, the size of the cross-correlation matrix in the TDE algorithm is degraded, which significantly reduces the computational complexity for prewhitening microphone signals. Simulation experiments in room acoustic environments demonstrate the computational e ffi ciency of the proposed TDE algorithm as well as its estimation precision.

1. INTRODUCTION

Time delay estimation (TDE), which targets to estimate the relative time di ff erence of arrival using the signals received by an array of sensors, is a crucial component of speaker localization technology, which is extensively applied to hands-free speech communication and human-machine voice interaction systems [1]. However, TDE performance in room acoustic environments is usually adversely a ff ected by reverberation and background noise [2]. A large number of TDE algorithms have been developed to address this challenge [2–6]. Among those approaches, the most popular one is the generalized cross-correlation(GCC) method [3], which whitens microphone signals to equally emphasize all frequencies, hence obtaining the time delay between two microphones as the time lag that maximizes the cross correlation function between the whitened signals. Although the GCC

1 1790964210@qq.com

2 hongsenhe@gmail.com

3 yangtao@swust.edu.cn

a slaty. inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS O ¥, ? GLASGOW

scheme acquires the robustness over reverberation, it is sensitive to noise due to complete whitening of microphone signals. Linear prediction is a significant prefiltering approach [7], which can remove the common spectral characteristics of the signals captured by an array of microphones. In the TDE, the linear prediction can be employed to prewhiten microphone signals so that the prediction errors based TDE algorithm can achieve the robustness to reverberation [8]. To improve the immunity of the TDE based on linear prediction to noise, a class of trade-o ff linear prediction algorithms were proposed [5, 9, 10]. This type of approaches introduce the sparsity of the prediction coe ffi cient vector into the optimization model. As a result, the TDE algorithms achieve a robustness balance between reverberation and noise. In [10], dual sparse constraints were developed to aim at the prediction-error vector and the coe ffi cient vector of the linear predictor, respectively, indicating that this whitening method can use the non- Gaussian distributed characteristic of speech signals to gain the robustness. However, these algorithms are of huge computational cost due to the inversion of high-dimensional correlation matrices. In this work, we propose an improved TDE algorithm based on Ref. [10]. We exploit Kronecker product decomposition to establish a group of sparse linear prediction models, which is based on the ℓ 1 -norm optimization with respect to the prediction-error vector and the coe ffi cient vector of the linear predictor. Accordingly, the size of the cross-correlation matrices in the TDE algorithm is degraded, which significantly reduces the computational complexity of prewhitening of microphone signals. The e ff ectiveness of the proposed TDE algorithm is testified via simulation experiments.

2. TDE APPROACH WITH LOW COMPUTATIONAL COMPLEXITY

2.1. Signal Model Assume that there is a broadband sound source in the far field which radiates a plane wave, and we use an array of sensors, which is composed of M microphones, to collect the signals. If we choose the first microphone as the reference point, the signal captured by the m th microphone at time n is then modeled as

x m ( n ) = α m s [ n − t − f m ( D )] + w m ( n ) , m = 1 , 2 , . . . , M , (1)

where α m , m = 1 , 2 , . . . , M , are the attenuation factors due to propagation e ff ects, s ( n ) is the unknown zero-mean and reasonably broadband source signal, t is the propagation time from the source to microphone 1, w m ( n ) is the additive noise at the m th microphone, D is the relative delay between the first two microphones due to the source, and f m ( D ) = ( m − 1) D is the relative delay between microphones 1 and m . With the above signal model, the objective of TDE is to estimate the delay D given the signals received at the M microphones. For a hypothesized time delay τ , we use the time shifted signal x m [ n + f m ( τ )]. When τ = D , it can be checked that the desired signal components received at the M microphones are aligned. To simplify the notation, we write x m [ n + f m ( τ )] as x m ( n , τ ).

2.2. TDE Algorithm Based on Kronecker Product Decomposition To enhance the robustness of TDE, microphone signals usually have to be prefiltered to remove the common spectral features present in doublet signals (including the pitch) and minimize spectral fluctuation induced by multi-path e ff ects. In this work, we consider to use a linear prediction model to prewhiten the microphone signals. For instance, the current sample x m ( n , τ ) of channel m ( m = 1 , 2 , . . . , M ) can be predicted from its multiple past samples, which is formulated by

K X

x ( n , τ ) =

k = 1 a k x ( n − k , τ ) + e ( n , τ ) , (2)

where a k , k = 1 , 2 , . . . , K , are prediction coe ffi cients, K is the length of the predictor, and e ( n , τ ) is the prediction error. For simplicity, the subscript m of x m ( n , τ ) that respectively numbers the microphone channels is omitted. For the convenience of formulations, let us consider the most recent L signal samples, and so (2) can be written as the following vector / matrix form:

x ( n , τ ) = X ( n , τ ) a ( τ ) + e ( n , τ ) , (3)

where

x ( n , τ ) = [ x ( n , τ ) x ( n + 1 , τ ) · · · x ( n + K + L − 1 , τ )] T , (4)

a ( τ ) = [ a 1 ( τ ) a 2 ( τ ) · · · a K ( τ )] T , (5)

e ( n , τ ) = [ e ( n , τ ) e ( n + 1 , τ ) · · · e ( n + K + L − 1 , τ )] T , (6)





x ( n − 1 , τ ) x ( n − 2 , τ ) · · · x ( n − K , τ )

x ( n , τ ) x ( n − 1 , τ ) · · · x ( n − K + 1 , τ ) ... ... ... ...

, (7)

X ( n , τ ) =

x ( n + K + L − 2 , τ ) x ( n + K + L − 3 , τ ) · · · x ( n + L − 1 , τ )

and ( · ) T denotes the transpose of a vector or matrix. In [10], a dual ℓ 1 -norm based optimization model was proposed to prewhiten the microphone signals. The consequent TDE method, however, is of huge computational cost due to the inversion of high-dimensional correlation matrix. Unlike that model, we propose a group of optimization submodels to reduce the computational load. To this end, we use the Kronecker product to decompose the coe ffi cient vector a ( τ ) of the linear predictor into

P X

a ( τ ) =

p = 1 a 2 , p ( τ ) ⊗ a 1 , p ( τ ) , (8)

where a 1 , p ( τ ) and a 2 , p ( τ ) are two groups of sub-filters of length K 1 and K 2 , respectively, and ⊗ denotes the Kronecker product. We assume that K = K 1 K 2 , and P < min { K 1 , K 2 } . By exploiting the following relationships [11,12]:

a 2 , p ( τ ) ⊗ a 1 , p ( τ ) = [ a 2 , p ( τ ) ⊗ I K 1 ] a 1 , p ( τ ) = [ I K 2 ⊗ a 1 , p ( τ )] a 2 , p ( τ ) , (9)

where I K 1 and I K 2 are the identity matrix of sizes K 1 and K 2 , respectively, X ( n , τ ) a ( τ ) in (3) can be expressed respectively as

P X

X ( n , τ ) a ( τ ) =

p = 1 X ( n , τ )[ a 2 , p ( τ ) ⊗ I K 1 ] a 1 , p ( τ )

P X

p = 1 X 2 , p ( n , τ ) a 1 , p ( τ )

= X 2 ( n , τ ) a 1 ( τ ) (10)

and

P X

X ( n , τ ) a ( τ ) =

p = 1 X ( n , τ )[ I K 2 ⊗ a 1 , p ( τ )] a 2 , p ( τ )

P X

p = 1 X 1 , p ( n , τ ) a 2 , p ( τ )

= X 1 ( n , τ ) a 2 ( τ ) , (11)

where

X 2 , p ( n , τ ) = X ( n , τ )[ a 2 , p ( τ ) ⊗ I K 1 ] , (12) X 2 ( n , τ ) = [ X 2 , 1 ( n , τ ) X 2 , 2 ( n , τ ) · · · X 2 , P ( n , τ )] , (13)

a 1 ( τ ) = [ a T 1 , 1 ( τ ) a T 1 , 2 ( τ ) · · · a T 1 , P ( τ )] T , (14) X 1 , p ( n , τ ) = X ( n , τ )[ I K 2 ⊗ a 1 , p ( τ )] , (15) X 1 ( n , τ ) = [ X 1 , 1 ( n , τ ) X 1 , 2 ( n , τ ) · · · X 1 , P ( n , τ )] , (16)

a 2 ( τ ) = [ a T 2 , 1 ( τ ) a T 2 , 2 ( τ ) · · · a T 2 , P ( τ )] T . (17)

Thus, the error signal vector e ( n , τ ) in (3) can be expressed as the following equivalent forms:

e 1 ( n , τ ) = x ( n , τ ) − X 2 ( n , τ ) a 1 ( τ ) , (18) e 2 ( n , τ ) = x ( n , τ ) − X 1 ( n , τ ) a 2 ( τ ) . (19)

Using the above prediction error vectors, we establish a group of optimization submodels with dual ℓ 1 -norm constraints as follows:

n ∥ a 1 ( τ ) ∥ ℓ 1 + 1

2 µ ∥ x ( n , τ ) − X 2 ( n , τ ) a 1 ( τ ) ∥ ℓ 1 o , (20)

min a 1 ( τ )

n ∥ a 2 ( τ ) ∥ ℓ 1 + 1

2 µ ∥ x ( n , τ ) − X 1 ( n , τ ) a 2 ( τ ) ∥ ℓ 1 o , (21)

min a 2 ( τ )

where ∥·∥ ℓ 1 stands for the ℓ 1 -norm, the first ℓ 1 -norm constraint in each optimization submodel is introduced in order to emphasize the sparsity of the coe ffi cient vector of the linear predictor, the second ℓ 1 -norm constraint aims at employing the non-Gaussian property of speech to improve the TDE performance, and the parameter µ > 0 is to balance the sparsity level between the prediction coe ffi cient and prediction error. We exploit the augmented Lagrangian alternating direction method of multipliers (ADMM) [13] to solve this group of constructed dual ℓ 1 -norm optimization problems in (20) and (21). To do so, we reformulate (20) and (21) as

n ∥ a 1 ( τ ) ∥ ℓ 1 + 1

2 µ ∥ e 1 ( n , τ ) ∥ ℓ 1 : e 1 ( τ ) = x ( n , τ ) − X 2 ( n , τ ) a 1 ( τ ) o (22)

min a 1 ( τ ) , e 1 ( n ,τ )

and

n ∥ a 2 ( τ ) ∥ ℓ 1 + 1

2 µ ∥ e 2 ( n , τ ) ∥ ℓ 1 : e 2 ( τ ) = x ( n , τ ) − X 1 ( n , τ ) a 2 ( τ ) o , (23)

min a 2 ( τ ) , e 2 ( n ,τ )

respectively, which have the following augmented Lagrangian subproblems:

( ∥ a 1 ( τ ) ∥ ℓ 1 + 1

2 µ ∥ e 1 ( n , τ ) ∥ ℓ 1

min a 1 ( τ ) , e 1 ( n ,τ )

− η T 1 ( τ )[ X 2 ( n , τ ) a 1 ( τ ) + e 1 ( n , τ ) − x ( n , τ )]

+ β

) , (24)

2 ∥ X 2 ( n , τ ) a 1 ( τ ) + e 1 ( n , τ ) − x ( n , τ ) ∥ 2 ℓ 2

( ∥ a 2 ( τ ) ∥ ℓ 1 + 1

2 µ ∥ e 2 ( n , τ ) ∥ ℓ 1

min a 2 ( τ ) , e 2 ( n ,τ )

− η T 2 ( τ )[ X 1 ( n , τ ) a 2 ( τ ) + e 2 ( n , τ ) − x ( n , τ )]

+ β

) , (25)

2 ∥ X 1 ( n , τ ) a 2 ( τ ) + e 2 ( n , τ ) − x ( n , τ ) ∥ 2 ℓ 2

where η 1 ( τ ) and η 2 ( τ ) are two Lagrangian multiplier vectors, β > 0 is a penalty parameter, ∥·∥ ℓ 2 denotes the ℓ 2 -norm. According to the ADMM method, given the current iteration [ e 1 , t ( n , τ ) , η 1 , t ( τ )] and [ e 2 , t ( n , τ ) , η 2 , t ( τ )], where t is the iteration index, we can obtain [ a 1 , t + 1 ( n , τ ) , e 1 , t + 1 ( n , τ ) , η 1 , t + 1 ( τ )] and [ a 2 , t + 1 ( n , τ ) , e 2 , t + 1 ( n , τ ) , η 2 , t + 1 ( τ )] by alternating minimization of (24) and (25) with respect to each variable while keeping the other variables fixed, respectively. First, when e 1 ( n , τ ) = e 1 , t ( n , τ ) , e 2 ( n , τ ) = e 2 , t ( n , τ ) , η 1 ( τ ) = η 1 , t ( τ ) , and η 2 ( τ ) = η 2 , t ( τ ) fixed, the minimization of (24) with respect to a 1 ( τ ) and the minimization of (25) with respect to a 2 ( τ ) are equivalent to

 ∥ a 1 ( τ ) ∥ ℓ 1 + β

 (26)

X 2 ( n , τ ) a 1 ( τ ) + e 1 , t ( n , τ ) − x ( n , τ ) − η 1 , t ( τ ) /β 2

min a 1 ( τ )

ℓ 2

2

and

 ∥ a 2 ( τ ) ∥ ℓ 1 + β

 , (27)

X 1 ( n , τ ) a 2 ( τ ) + e 2 , t ( n , τ ) − x ( n , τ ) − η 2 , t ( τ ) /β 2

min a 2 ( τ )

ℓ 2

2

respectively, for which the solutions can be formulated approximately by the following soft- thresholding operators:

a 1 , t + 1 ( τ ) = soft  a 1 , t ( τ ) − ρ g 1 , t ( n , τ ) , ρ/β  , (28) a 2 , t + 1 ( τ ) = soft  a 2 , t ( τ ) − ρ g 2 , t ( n , τ ) , ρ/β  , (29)

where ρ > 0 is a proximal parameter,

g 1 , t ( n , τ ) = 2 X T 2 ( n , τ )  X 2 ( n , τ ) a 1 , t ( τ ) + e 1 , t ( n , τ ) − x ( n , τ ) − η 1 , t ( τ ) /β  , (30)

g 2 , t ( n , τ ) = 2 X T 1 ( n , τ )  X 1 ( n , τ ) a 2 , t ( τ ) + e 2 , t ( n , τ ) − x ( n , τ ) − η 2 , t ( τ ) /β  , (31)

and the soft function is defined by

soft ( p , λ ) = sign ( p ) ⊙ max {| p | − λ, 0 } , ∀ p ∈ R K , λ > 0 (32)

with sign ( · ) being the signum function, ⊙ denoting the dot product of two vectors, and all the other operations being performed in a component-wise way. Then, when a 1 ( n , τ ) = a 1 , t + 1 ( n , τ ), a 2 ( n , τ ) = a 2 , t + 1 ( n , τ ), η 1 = η 1 , t ( τ ), and η 2 = η 2 , t ( τ ) fixed, the minimization of (24) with respect to e 1 ( n , τ ) and the minimization of (25) with respect to e 2 ( n , τ ) are equivalent to

( 1

2 µ ∥ e 1 ( n , τ ) ∥ ℓ 1 + β

) (33)

X 2 ( n , τ ) a 1 , t + 1 ( τ ) + e 1 ( n , τ ) − x ( n , τ ) − η 1 , t ( τ ) /β 2

min e 1 ( n ,τ )

ℓ 2

2

and

( 1

2 µ ∥ e 2 ( n , τ ) ∥ ℓ 1 + β

) , (34)

X 1 ( n , τ ) a 2 , t + 1 ( τ ) + e 2 ( n , τ ) − x ( n , τ ) − η 2 , t ( τ ) /β 2

min e 2 ( n ,τ )

ℓ 2

2

respectively, which have the following exact solutions:

e 1 , t + 1 ( n , τ ) = soft  x ( n , τ ) + η 1 , t ( τ ) /β − X 2 ( n , τ ) a 1 , t + 1 ( τ ) , 1 / (2 βµ )  , (35) e 2 , t + 1 ( n , τ ) = soft  x ( n , τ ) + η 2 , t ( τ ) /β − X 1 ( n , τ ) a 2 , t + 1 ( τ ) , 1 / (2 βµ )  . (36)

Finally, the Lagrangian multiplier vectors η 1 ( τ ) and η 2 ( τ ) are updated by

η 1 , t + 1 ( τ ) = η 1 , t ( τ ) − γβ  X 2 ( n , τ ) a 1 , t + 1 ( τ ) + e 1 , t + 1 ( n , τ ) − x ( n , τ )  (37)

and

η 2 , t + 1 ( τ ) = η 2 , t ( τ ) − γβ  X 1 ( n , τ ) a 2 , t + 1 ( τ ) + e 2 , t + 1 ( n , τ ) − x ( n , τ )  , (38)

respectively, where γ > 0 is a constant. Once the microphone signals are prewhitened by the proposed optimization algorithm, respectively, TDE can be acquired by finding the maximum of the cross-correlation function between the prediction error signals.

3. COMPUTATIONAL COMPLEXITY

The computational cost is evaluated in terms of the number of real-valued multiplications / divisions required for the implementation of each algorithm. The number of additions / subtractions are negle cted because they are muc h quick e r to compute in most generic hardware platforms. Th e number of m ultiplications for computing the inverse of a matrix of size L × L (using the LU deco m position) is assumed t o be L 3 − L [14]. Fi gu re 1 il lust ra tes the computational co m plexity of the ℓ 1 /ℓ 1 - no rm constraint based linear predi ction ( ℓ 1 /ℓ 1 -LP) algorithm and the ℓ 1 /ℓ 1 -nor m constraint base d Kronecker produ ct deco m position ( ℓ 1 /ℓ 1 -KPD) algorithm. The computational complexity is plotte d as a function of the predictor length K and frame length L , respectiv e ly. It can be seen that the computatio n al cost of th e proposed ℓ 1 /ℓ 1 -KPD algorithm is s i g nificantly lower than that of the ℓ 1 /ℓ 1 -LP algorithm, espec ially when the predictor len gth or the fra me lengt h is greater. The reason is that the origin al prediction-error filter is decomposed by the Kronecker product into two groups of short subpredictors so that the size of the cross-correlation matrices in the TDE algorithm is degraded.

10 10

10 10

Computational Complexity

Computational Complexity

ℓ 1 /ℓ 1 -LP ℓ 1 /ℓ 1 -KPD

10 9

10 9

10 8

10 8

10 7

ℓ 1 /ℓ 1 -LP ℓ 1 /ℓ 1 -KPD

10 7

10 6

16 32 48 64 80 96 112 128 K (a)

0 512 1024 1536 2048 2560 3072 3584 4096 L (b)

Figure 1: Computational complexity of the proposed ℓ 1 /ℓ 1 -KPD and ℓ 1 /ℓ 1 -LP algorithms versus the predictor length K and frame length L , respectively, where P = 1.

4. EXPERIMENTS

Experiments are carried out in a simulated room of size 7 m × 6 m × 3 m. For ease of exposition, positions in the room are designated by ( x , y , z ) coordinates in meter with reference to the southwest corner of the room floor. Two microphones are situated at (3 . 25 , 3 . 00 , 1 . 40) and (3 . 35 , 3 . 00 , 1 . 40), respectively. A sound source is successively located at four di ff erent positions to examine the performance of the TDE algorithms. The distance from each source position to the center between the two microphones is 2 m. The impulse responses from the source to the two microphones are generated using the image model [15]. The microphones’ outputs are obtained by convolving the source signal with the corresponding generated impulse responses and then adding zero-mean white Gaussian noise to the results to control the signal-to-noise ratio (SNR). For the studied algorithms, the predictor length K is set to 100, β = 10, γ = 1 . 199, µ = 0 . 1, ρ = 0 . 001, the initial values of a 1 ( τ ), a 2 ( τ ), e 1 ( n , τ ), e 2 ( n , τ ), η 1 , and η 2 are a null vector, respectively, and the iterative time is set to 32. In the experiments, the microphone signals are partitioned into nonoverlapping frames with a frame length of 64 ms. Each frame is windowed with a Hamming window, and a time delay estimate is then achieved. The sound source signal used in this paper, which is sampled at 16 kHz, is a segment of speech signal from a male talker and the length of the signal is about one minute. The total number of frames is 950 (the frame length is 1024 samples). We use two criteria [16,17], namely the probability of anomalous estimates and the root mean square error (RMSE) of the nonanomalous estimates, to

1.5

50

Probability of anomalies (%)

GCC-PHAT LS-LP ℓ 1 /ℓ 1 -LP ℓ 1 /ℓ 1 -KPD

GCC-PHAT LS-LP ℓ 1 /ℓ 1 -LP ℓ 1 /ℓ 1 -KPD

40

RMSE (samples)

1.25

30

1

20

0.75

10

0

0.5

-5 0 5 10 15 20 25 SNR (dB) (a)

-5 0 5 10 15 20 25 SNR (dB) (b)

Figure 2: TDE performance of the proposed ℓ 1 /ℓ 1 -KPD and ℓ 1 /ℓ 1 -LP algorithms versus SNR in a reverberant environment with T 60 = 300 ms and P = 1. The fitting curve is a fourth-order polynomial.

evaluate the performance of the proposed algorithm. For the i th delay estimate b D i , if the absolute error | b D i − D | > T c / 2, where D is the true delay, and T c is the signal self correlation time, the estimate is identified as an anomalous estimate. Otherwise, this estimate would be deemed as a nonanomalous one. For the particular speech signal used in this study, T c is equal to 4.0 samples. Once the time delays at twelve source positions are estimated by the proposed algorithm, respectively, the average probability of anomalous estimates and the average RMSE of the nonanomalous estimates can be attained. Figure 2 shows the TDE performance of the studied algorithms versus SNR in a reverberant environment, where P = 1. It is seen that the trade-o ff prewhitening algorithms, including the ℓ 1 /ℓ 1 - LP and ℓ 1 /ℓ 1 -KPD algorithms, basically perform better than the GCC-PHAT and LS-LP algorithms. In high SNR circumstances, such as SNR > 20 dB, the RMSE of the trade-o ff prewhitening algorithms is greater than that of the GCC-PHAT and LS-LP algorithms. The reason behind this is that the complete whitening ability of the latter removes the common spectral characteristics of the signals captured by an array of microphones. It can also be seen that the proposed ℓ 1 /ℓ 1 -KPD algorithm obtains comparable performance as compared to the ℓ 1 /ℓ 1 -LP algorithm, indicating that under the framework of the linear prediction optimization model based on the dual ℓ 1 -norm constraints, it is e ff ective to use the Kronecker product to decompose the coe ffi cient vector of the predictor for prewhitening microphone signals.

50

1.5

Probability of anomalies (%)

GCC-PHAT LS-LP ℓ 1 /ℓ 1 -LP ℓ 1 /ℓ 1 -KPD

GCC-PHAT LS-LP ℓ 1 /ℓ 1 -LP ℓ 1 /ℓ 1 -KPD

40

RMSE (samples)

1.25

30

1

20

0.75

10

0

0.5

100 200 300 400 500 600 700 T 60 (ms) (a)

100 200 300 400 500 600 700 T 60 (ms) (b)

Figure 3: TDE performance of the proposed ℓ 1 /ℓ 1 -KPD and ℓ 1 /ℓ 1 -LP algorithms versus T 60 in a noisy environment with SNR = 15 dB and P = 1. The fitting curve is a fourth-order polynomial.

Figure 3 illustrates the TDE performance of the studied algorithms versus reverberation time T 60 in a noisy environment, where P = 1. It can be seen from Fig. 3 that as T 60 is increased, the TDE performance of all the algorithms degrades, which indicates that reverberation usually has an adverse influence on TDE. Under di ff erent reverberation conditions, the trade-o ff prewhitening algorithms attain better performance than the GCC-PHAT and LS-LP algorithms in noise and reverberant environments. In addition, the proposed ℓ 1 /ℓ 1 -KPD algorithm achieves similar performance to the ℓ 1 /ℓ 1 -LP algorithm, again showing the e ff ectiveness of the Kronecker product decomposition that is applied to the coe ffi cient vector of the linear predictor to prewhiten the microphone signals for TDE.

5. CONCLUSION

In this paper, we proposed an improved TDE algorithm based on the ℓ 1 -norm optimization with respect to the prediction-error vector and the coe ffi cient vector of the linear predictor. The coe ffi cient vector of the linear predictor is decomposed by the Kronecker product into two groups of short prediction-error subfilters. This algorithm displays the robustness of the original method over noise and reverberation, and more importantly, obtains low computational complexity due to the low-dimensional cross-correlation matrices resulted from the Kronecker product decomposition. Simulation experiments in noise and reverberant environments demonstrated the e ff ectiveness of the proposed TDE algorithm as well as the high computational e ffi ciency.

ACKNOWLEDGEMENTS

This work was supported in part by the National Science Foundation of China (Grant No. 62071399) and the Open Foundation of Robot Technology Used for Special Environment Key Laboratory of Sichuan Province (Grant No. 20kfkt03).

REFERENCES

[1] S. L. Gay and J. Benesty. Acoustic Signal Processing for Telecommunication . Kluwer, Boston, USA, 2000.

[2] J. Chen, J. Benesty, and Y. Huang. Time delay estimation in room acoustic environments: An overview. EURASIP Journal on Applied Signal Processing , 1–19, 2006.

[3] C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4): 320–327, 1976.

[4] H. He, J. Chen, J. Benesty, Y. Zhou, and T. Yang. Robust multichannel TDOA estimation for speaker localization using the impulsive characteristics of speech spectrum. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing , pp. 6130–6134, 2017.

[5] H. He, J. Chen, J. Benesty, W. Zhang, and T. Yang. A class of multichannel sparse linear prediction algorithms for time delay estimation of speech sources. Signal Processing , 169: 1–14, 2020.

[6] Y. H. Shin and J. H. Lee. Prefiltering approach of adaptive eigenvalue decomposition method for an improved time delay estimation in room environment. IEEE Access , 9: 118443–118452, 2021.

[7] R. P. Ramachandran and P. Kabal. Pitch prediction filters in speech coding. IEEE Transactions on Acoustics, Speech and Signal Processing , ASSP-37: 467–478, 1989.

[8] E. D. Claudio and R. Parisi. Multi-source localization strategies. Microphone Arrays: Signal Processing Techniques and Applications , edited by M. Brandstein and D. Ward, Springer, New York, USA, 181–201, 2001.

[9] H. He, T. Yang, and J. Chen. On time delay estimation from a sparse linear prediction perspective. Journal of the Acoustical Society of America , 137(2): 1044–1047, 2015.

[10] H. He, X. Wang, Y. Zhou, and T. Yang. An ℓ 1 / ℓ 1 -norm based linear prediction algorithm to time delay estimation via alternating direction method of multipliers. Proc. IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC) , pp. 1–5, 2017.

[11] D. A. Harville. Matrix Algebra From a Statistician’s Perspective . Springer-Verlag, New York, USA, 1997.

[12] C. Paleologu, J. Benesty, and S. Ciochina. Linear system identification based on a Kronecker product decomposition. IEEE / ACM Transactions on Audio, Speech, and Language Processing , 26: 1793–1808, 2018.

[13] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning , 3(1): 1–122, 2010.

[14] L. Fox. An Introduction to Numerical Linear Algebra . Clarendon, Oxford, U.K., 1964.

[15] Allen, J. B. and Berkley, D. A. Image method for e ffi ciently simulating small-room acoustics, Journal of the Acoustical Society of America , 65: 943–950, 1979.

[16] J. P. Ianniello. Time delay estimation via cross-correlation in the presence of large estimation errors. IEEE Transactions on Acoustics, Speech, and Signal Processing , 30: 998–1003, 1982.

[17] B. Champagne, S. Bedard, and A. Stephenne. Performance of time-delay estimation in presence of room reverberation. IEEE Transactions on Speech and Audio Processing , 4: 148–152, 1996.