A A A Volume : 44 Part : 3 Proceedings of the Institute of Acoustics Analysis of complex-valued neural networks for audio source localisation Vlad S. Paul, Institute of Sound and Vibration Research, University of Southampton, UK Philip A. Nelson, Institute of Sound and Vibration Research, University of Southampton, UK 1 INTRODUCTION Machine learning has become increasingly popular in the audio signal processing field, with comprehensive surveys1–3 showing the rapid increase in applications and network performance. Some of the most popular applications are speech analysis and synthesis, audio source separation and detection and audio source localisation and tracking. The latter has been the subject of increasing interest due to the development of challenges such as Detection and Classification of Acoustic Scenes and Events (DCASE) and Learning 3D Audio Sources (L3DAS), which provide large datasets for participants to train various neural network architectures for solving the tasks. The fast progress of machine learning techniques in the audio source localization field was enhanced by some earlier papers that showed the improvement of localization accuracy when using neural networks compared to previous well-established signal processing techniques such as beamforming, MUSIC or ESPRIT. Some of the papers worth mentioning here are4–6. If one considers the use of machine learning for sound source localisation applications only, the survey by Grumiaux et al.1 shows the considerable amount of work that has been undertaken in recent years and interestingly, all papers mentioned there but one, use real-valued neural networks (RVNNs), where all parameters are real. The paper by Tsuzuki et al.7 used a complex-valued neural network (CVNN) to localize sound sources and showed that the localization accuracy is improved compared to the performance of RVNNs for their particular application case. Generally it can be observed that the most popular network architectures are the RVNNs. Some of the possible reasons for this could be the fact that CVNNs contain only complex-valued inputs, training parameters and outputs and therefore any operation with complex data require more computational complexity. In addition to this, the design of fully-complex activation functions can become difficult due to Liouville’s theorem8, which states that a complex-valued function cannot be bounded and fully differentiable at the same time, unless it is a constant. Following this, complex-valued activation functions can be divided into fully-differentiable, differentiable around certain points only, or not differentiable at all9. Despite certain difficulties when designing the CVNNs, recent studies showed the benefits of CVNNs especially in the signal processing field (MRI, speech enhancement and image editing). The survey by Lee et al.10 shows the increase in popularity of CVNNs over RVNNs in some cases, especially when the relation between phase and magnitude (or real and imaginary parts) is of maximum importance. RVNNs usually use the real and imaginary parts of an input as separate components, which leads to a doubling of learnable parameters, while CVNNs can work directly with the complex-valued data. Another important advantage of the CVNNs during training has been discussed11, where the author suggests that due to the phase rotation and the magnitude attenuation that is possible due to the complex-valued operations, the degree of freedom in the neural network is reduced, which can lead to a better generalization. Based on the observation that the use of CVNNs is currently increasing in the signal processing field, but only little research has been yet undertaken on audio source localisation using complex-valued data, this paper attempts to compare the performance of CVNNs compared to RVNNs for different localization tasks. A particular focus is on the influence of the choice of the input features on localisation performance. 2 THE COMPLEX MULTILAYER PERCEPTRON This section will present the analytical extensions of the real multilayer perceptron (MLP) to its complex version using the complex-valued data as a one dimensional input. The derivation is based on the rules of the Wirtinger calculus12 and the computation of gradients is derived to give a final form that can be easily implemented for a complex MLP with any number of hidden layers. Since most of the theory has been presented previously by Paul and Nelson13, only the important parts will be discussed here. 2.1 Forward and Backward Propagation in the Complex MLP The architecture of a multilayer perceptron (MLP) with one hidden layer is shown in Figure 1. Figure 1: MLP model with one hidden layer The forward propagation of such an MLP network can be written in matrix form as: where the weights of the network are denoted as W(2), W(1), the biases as b(2), b(1) and a(2), z(2) denote respectively the inputs and outputs of the hidden layer. Similarly, a(1) is the input into the hidden layer and z(1) = is the output of the network. All variables are defined in the complex domain. The loss function can be defined as the sum of squared errors given by where is the difference between the network output and the desired output, where k is the k-th component of vector d and the superscript H denotes the Hermitian transpose. Using an approach similar to that used for the MLP with real variables14, in order to derive the backpropagation for the complex MLP, the gradients of the real loss function L with respect to the complex weights are evaluated. 2.1.1 Complex Backpropagation for an MLP with any number of Hidden Layers The detailed derivation of the complex gradients with respect to the weight matrices W(1), W(2) has been presented in Paul and Nelson13 and will not be repeated here. Using the chain rule approach, a general expression for the complex backpropagation of an MLP with any number of hidden layers is given by where w(l) is a vector containing all values in W(l) and is computed as w(l) = vec(W(l)). It follows that, using the steps detailed in Paul and Nelson13, a general gradient expression for the l-th layer can be expressed as where (l) corresponds to the composite matrix containing the diagonal matrices of derivatives of activation functions at the l-th hidden layer and is of form with In the gradient expression, (l−1) denotes the composite matrix of the weight matrix at the previous layer and is of form and δ(l) is a vector containing the gradient terms up to the input into the l-th layer. For example, To make the dependencies between the variables clear, Figure 2 shows a diagram of the complex MLP with any number of hidden layers together with the method for computing the backpropagation algorithm. The update equations for the weight matrices are shown in the figure below, where ∂L / ∂W(l)∗ is given by Eq. (7). Figure 2: The backpropagation algorithm for a complex MLP with L hidden layers 3 FUNDAMENTALS OF MICROPHONE ARRAY PROCESSING Consider a sensor array consisting of M microphones that can be placed anywhere in a two or three dimensional space and L far-field sources with source strengths ql(n), l = 1, ..., L whose signals are recorded by the sensor array. An example of the coordinate system is shown in Figure 3. Figure 3: 3D coordinate system with an audio source of strength q. The figure shows a source of strength q positioned somewhere in the 3D space that can be defined using the rectangular coordinates x, y, z or the spherical coordinates r, θ,φ. The relationship between these two coordinate systems is described by Assuming spherical coordinates, the difference between two different source locations (or an estimated vs. a target location) can be computed using an angular Euclidean distance defined for one angle θ as where θ1, θ2 are the two locations. For a location consisting of θ and φ, the difference can be computed by averaging dE(θ1, θ2) and dE(φ1,φ2) This error is chosen to account for the fact that some angle values can be close to each other on the circle, but far away from each other in terms of mean-squared error. As an example, 359◦ and 1◦ are close to each other on the circle, but the mean-squared error between them is high. The pressure at the microphones can be computed using a spherical wave propagation monopole source model, which means that the source strength is delayed and attenuated based on the distance between the source and the microphones. Assuming there are L sources, the pressure can be expressed as where ql(t) denotes the strength of the l-th source and rl,m denotes the vector of distances from the source l to the m-th microphone. The distance rl,m can be simply computed by using the following equation: where xl, yl, zl are the coordinates of the l-th source and xm, ym, zm are the coordinates of the m-th microphone. 4 METHODOLOGY 4.1 Dataset Generation The comparison between the real and complex-valued networks was performed for two different cases, a 2D scenario with 4 sensors placed in a linear microphone array (LMA) spaced at 2 cm apart from each other and a 3D case with 4 sensors placed in a tetrahedral array shape (one microphone on each of the 3 axes and one in the center), where the distance between microphones was also 2 cm. In the 2D scenario, 10 different source locations equally spaced were picked on the half-plane in front of the array at a radius of 1 m, while in the 3D case, 25 sources equally spaced on the sphere at 1 m away from the array were used to train the networks. Figures 10 and 11 shows the positions of the microphones and the sources for both cases. The distance from sources to microphones has been rescaled for visualization purposes. The circles correspond to the sources and the symbol x corresponds to the microphone positions. Figure 4: Source and microphone positions in 2D scenario Figure 5: Source and microphone positions in 3D scenario The source strength signal used for all sources is white noise and for each source location, around 1200 signals were used during training and 150 for testing the generalization of the network. 4.2 Network Initialization Several multilayer perceptron networks (MLPs) were designed in MATLAB for both real- and complex valued data. The real-valued MLP is denoted as rMLP, while the complex-valued as cMLP. One hidden layer was used and the size of the layer was varied. Two different sizes of input feature were used. The networks were trained to estimate the location of the sources in terms of azimuth φ and elevation θ, but in order to enable a fair comparison between the rMLPs and cMLPs, the locations were expressed as complex numbers using ejθ = cos(θ) + j sin(θ). For example, if a source is positioned at φ = 60◦ (π/3) and θ = 30◦(π/6), the target location is and . For the rMLP, the real and imaginary values were concatenated into a 2x1 vector, while in the cMLP case, the output was only one complex-valued number. The activation function used for the rMLP in the hidden layer was ReLU and for the output layer, the tanh function was used, since the output values were between -1 and 1. In the cMLP case, the complex cardioid function15 was used in both layers. The error between the estimation and the target outputs during training was computed using the mean squared error. The training was computed using a real-valued learning rate for the rMLP and a real or a complex-valued learning rate for the cMLP. The choice of a complex-valued rate is based on recent studies16,17 that showed the potential of a complex-valued step size when implementing the complex-valued gradient descent algorithm. In short, a complex-valued learning rate gives the possibility of changing the real and imaginary gradient terms differently when updating the weight matrices. 4.3 Feature Extraction Once the pressure at the microphone arrays was computed, several different features were used to evaluate the network performance. The first input feature was the simple FFT spectrum of the multichannel signals, where for the rMLP, the real and imaginary values were concatenated into a long 1-dimensional vector (denoted as real-imag), while for the complex MLP, the FFT spectrum was applied directly as a complex value input to the network. An additional input feature for the rMLP was the magnitude and phase of the FFT spectrum concatenated into a vector (denoted as mag-phase). The second type of input feature is related to the generalized cross correlation with phase trans form (GCC-PHAT) between the microphone channels18. The GCC-PHAT is defined for a pair of microphones as where τ is the discrete time-delay used when computing the inverse FFT, X1, X2 are the N-point FFT spectra of two microphones and k corresponds to the frequency bin, which should not be confused with the acoustic wavenumber. The expression above can be seen as the inverse FFT of the weighted cross power spectrum (CPS) between the microphone signals. For the training of both rMLPs and cMLPs, only the weighted CPS was used without the inverse FFT, to keep the data in the complex-valued domain. This is given by where at each frequency k, the CPS is different. For the real networks, both real and imaginary parts and the magnitude and phase pairs were used as separate input features, while for the cMLP, the output of the weighted CPS was used directly to train the network. To create two different input feature sizes, two FFT lengths were chosen (256 and 512 samples). In order to use the MLP which accepts vectors as inputs, the multichannel features were concatenated into a long 1D vector. The frequency range used during the analysis was chosen so that there was no spatial aliasing at high frequencies. A maximum frequency limit was chosen based on the relation between microphone spacing d and speed of sound c, given by For the microphone arrays used in these simulations, a spacing of 2 cm between microphones results in a maximum frequency of fmax ≈ 8000 Hz. 5 RESULTS The comparisons between the rMLP and cMLP are divided into the 2D and the 3D scenario. The performance of the networks is expressed using the testing dataset computed from 150 signals at one location. The networks were trained 10 times using the same input features and training parameters and the performance results were averaged. This was done due to the fact that the weight initialization using random numbers leads to a slightly different performance every time. The performance of the networks was computed using the angular error defined above and the resulting values were transformed to degrees for a better understanding of the performance. 5.1 2D Localization Scenario Table 1 shows the localization performance for both network types using both input features described above, using a rMLP and a cMLP with 256 FFT samples to create the input features. One hidden layer of 20 neurons was used in the network architecture. The angular error was averaged first over the 10 trials and then over all location estimates. Table 1: 2D localization performance in degrees using 1 hidden layer with 20 neurons and an FFT length of 256 to create the input feature. The performance is computed using the angular error and is averaged over 10 trials Network FFT Spectrum CPS real-imag mag-phase real-imag mag-phase rMLP 51.52◦ 38.60◦ 3.03◦ 2.97◦ cMLP 14.50◦ N/A 5.29◦ N/A Results show that the cMLP works better than the rMLP if the FFT spectrum is directly used as an input feature, however the rMLP outperforms by a small amount the cMLP when the CPS is used as input. In the table above, the best cMLP performance was achieved using the complex-valued learning rate 0.001 + 0.007j, while the real learning rate was 0.002. It is interesting to observe that the performance difference between the two main input features is quite high in the rMLP case (2.97◦ compared to 51.52◦), which suggests that the rMLP was not able to learn well using the FFT spectrum as an input feature. The cMLP seems to be less sensitive to the input features, as illustrated by the small differences in performance. If one compares the estimated locations with the target locations, it can be observed in the figure below that both rMLP and cMLP give results very close to the target values. Figure 6: Target vs. estimated angle values for the 2D task using rMLP Figure 7: Target vs. estimated angle values for the 2D task using cMLP 5.2 3D Localization Scenario Moving on to the 3D case, Table 2 shows the same comparison between networks and features as in the 2D case. Table 2: 3D localization performance using 1 hidden layer with 20 neurons and an FFT length of 256 to create the input feature. The performance is computed using the angular error and is averaged over 10 trials. Network FFT Spectrum CPS real-imag mag-phase real-imag mag-phase rMLP 58.45◦ 37.77◦ 0.65◦ 1.75◦ cMLP 9.20◦ N/A 3.84◦ N/A The performance of the real- and complex-valued MLPs for the 3D task is similar in behaviour to the 2D case, since the rMLP performs best using the CPS as input feature, while the cMLP performs better using the FFT spectrum as input. Overall, using CPS as input feature shows a better performance than using the FFT spectrum. If one looks at the angle estimations, the figure below shows the difference between the best rMLP performance and the best cMLP performance in the 3D localization scenario. Figure 8: Target vs. estimated angle values for the 3D task using rML Figure 9: Target vs. estimated angle values for the 3D task using cMLP 5.3 Discussion Overall, it can be observed that the best performance of the rMLP is similar to that of the cMLP, while the cMLP estimates are slightly worse only at some locations. Even if the network architecture is changed in terms of the size of the input or hidden layer, the ranking of performance is similar. The main observation is that the cMLP with the chosen hyperparameters can outperform the rMLP using the FFT spectrum as input and is able to get very close to the performance of the rMLP using the CPS. In addition, the cMLP seems to be quite robust to different input features, while the rMLP is very sensitive and is not able to learn using the FFT spectrum as input feature. Furthermore, the main issues discovered during training the cMLP was the slightly irregular convergence of the network, compared to the rMLP. This can be seen in figure below, where the decay of the cost function for the same training task for all 10 trials is shown on a logarithmic scale. 1 hidden layer with 20 neurons and an FFT length of 256 to create the input features were used as parameters. Figure 10: Convergence of the rMLP for 10 trials using 1 hidden layer with 20 neurons and an FFT length of 256 to create the input features Figure 11: Convergence of the cMLP for 10 trials using 1 hidden layer with 20 neurons and an FFT length of 256 to create the input features It can be observed that all 10 trials of the rMLP network converge in a similar way without any sudden changes in the cost function decay. In the cMLP case, the cost functions of the 10 trials decay in a less smooth way, which on one hand can lead to the possibility of escaping local minima easier, but on the other hand can lead to a worse generalization performance. Therefore, the choice of training hyperparameters like learning rate, network shape or activation functions seems to be crucial for a smooth convergence of the cMLP. From our simulations it follows that even a small change in learning rate for example can lead to a different cMLP network performance. Following this, the convergence behaviour of the proposed cMLP has to be further investigated in order to show its full potential. A different activation function or a different complex-valued learning rate might improve the localization performance, however a different cMLP network with more hidden layers or a different hidden layer size didn’t show any change in performance based on our simulations. 6 CONCLUSION This paper has attempted to present a fair comparison between real and complex-valued MLPs for localizing a sound source in two different scenarios, using two different 4-microphone arrays. Simulation results show the clear potential of the complex-valued MLP for audio source localisation using only the FFT of the multichannel signal as a training feature. For future research, the authors are planning to analyse how different training hyperparameters like complex-valued activation functions or learning rates are influencing the convergence of the cMLP. In addition, other data representation techniques such as spherical harmonics will be tested as input features for training the networks. 7 REFERENCES P.-A. Grumiaux, S. Kitic, L. Girin, and A. Guérin, “A survey of sound source localization with deep learning methods,” The Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 107–151, 2022. H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang, and T. Sainath, “Deep learning for audio signal processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 206–219, 2019. M. J. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M. A. Roch, S. Gannot, and C.-A. Deledalle, “Machine learning in acoustics: Theory and applications,” The Journal of the Acoustical Society of America, vol. 146, no. 5, pp. 3590–3628, 2019. X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, “A learning-based approach to direction of arrival estimation in noisy and reverberant environments,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2814–2818. IEEE, 2015. S. Chakrabarty and E. A. Habets, “Broadband DOA estimation using convolutional neural networks trained with noise signals,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 136–140. IEEE, 2017. S. Adavanne, A. Politis, and T. Virtanen, “Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network,” in 2018 26th European Signal Processing Conference (EUSIPCO), pp. 1462–1466. IEEE, 2018. H. Tsuzuki, M. Kugler, S. Kuroyanagi, and A. Iwata, “An approach for sound source localization by complex-valued neural network,” IEICE Transactions on Information and Systems, vol. 96, no. 10, pp. 2257–2265, 2013. M. Rosenlicht, “Liouville’s theorem on functions with elementary integrals,” Pacific Journal of Mathematics, vol. 24, no. 1, pp. 153–161, 1968. J. Bassey, L. Qian, and X. Li, “A survey of complex-valued neural networks,” arXiv preprint arXiv:2101.12249, 2021. C. Lee, H. Hasegawa, and S. Gao, “Complex-valued neural networks: A comprehensive survey,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 8, pp. 1406–1426, 2022. A. Hirose, “Nature of complex number and complex-valued neural networks,” Frontiers of Electrical and Electronic Engineering in China, vol. 6, no. 1, pp. 171–180, 2011. W. Wirtinger, “Zur formalen theorie der funktionen von mehr komplexen veränderlichen,” Mathematische Annalen, vol. 97, no. 1, pp. 357–375, 1927. V. Paul and P. A. Nelson, “Complex-valued neural networks for audio signal processing,” in Vol. 43, Pt. 3., Milton Keynes, United Kingdom, 2021. V. S. Paul and P. A. Nelson, “Matrix analysis for fast learning of neural networks with application to the classification of acoustic spectra,” The Journal of the Acoustical Society of America, vol. 149, no. 6, pp. 4119–4133, 2021. P. Virtue, X. Y. Stella, and M. Lustig, “Better than real: Complex-valued neural nets for MRI fingerprinting,” in 2017 IEEE International Conference on Image Processing (ICIP), pp. 3953–3957. IEEE, 2017. H. Zhang and D. P. Mandic, “Is a complex-valued stepsize advantageous in complex-valued gradient learning algorithms?,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 12, pp. 2730–2735, 2015. Y. Zhang and H. Huang, “Adaptive complex-valued stepsize based fast learning of complex-valued neural networks,” Neural Networks, vol. 124, pp. 233–242, 2020. C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976. Previous Paper 12 of 16 Next