A A A Volume : 44 Part : 2 Proceedings of the Institute of Acoustics Inter-channel Conv-TasNet for source-agnostic multichannel audio enhancement Dongheon Lee1, KAIST, Daejeon, Republic of Korea Jung-Woo Choi2, KAIST, Daejeon, Republic of Korea ABSTRACT Deep neural network (DNN) models for the audio enhancement task have been developed in various ways. Most of them rely on the source-dependent characteristics, such as temporal or spectral characteristics of speeches, to suppress noises embedded in measured signals. Only a few studies have attempted to exploit the spatial information embedded in multichannel data. In this work, we propose a DNN architecture that fully exploits inter-channel relations to realize source-agnostic audio enhancement. The proposed model is based on the fully convolutional time-domain audio separation network (Conv-TasNet) but extended to extract and learn spatial features from multichannel input signals. The use of spatial information is facilitated by separating each convolutional layer into dedicated inter-channel 1x1 Conv blocks and 2D spectro-temporal Conv blocks. The performance of the proposed model is verified through the training and test with heterogeneous datasets including speech and other audio datasets, which demonstrates that the enriched spatial information from the proposed architecture enables versatile audio enhancement in a source-agnostic way. 1. INTRODUCTION Recently, many DNN models for the single-channel speech separation or enhancement task have been proposed [1, 2]. These models learn the characteristics of sound sources to separate them from a single channel mixture of multiple sources. However, if the model is trained for a certain type of sound sources, such as speech, then the model cannot work well for other sources. One possible remedy to this source dependency problem is to use spatial information from multichannel measurements rather than source characteristics. The directional characteristics of a target signal and background noise are different and can be exploited to reduce the noise for the audio enhancement task. The spatial information between sound sources and microphones, or microphones and micro phones, can be utilized in many different ways. Popular choices for binaural signals are interaural phase difference (IPD), interaural time difference (ITD), and interaural level difference (ILD). For multichannel signals, beamforming techniques have been developed to separate or enhance signals in terms of their angles of incidence [3], inter-channel coherence, correlation, and relative transfer functions. However, beamforming techniques require some pre-defined models of acoustic propagation or array manifold, and only linear filtering is possible to separate signals. To exploit spatial in formation, DNN-based multichannel source separation techniques have been developed [4]. The approach proposed in this work also utilizes spatial information from the multichannel input signal to realize source-agnostic audio enhancement. In recent years, deep learning-based audio denoising and enhancement techniques have been extensively studied. The deep learning-based audio enhancement technique aims to generate a clean audio signal by estimating spectrograms [5] or time-domain waveforms [6]. In the spectrogram estimation approach, a DNN model is configured to learn a mapping between input and output spectrograms. Due to its weakness in phase reconstruction, magnitude spectrogram-based approaches have been quickly replaced by complex amplitude-based techniques [7]. However, both approaches use a short-time Fourier transform (STFT) to obtain the spectrogram, which is not optimized for sound enhancement tasks [8]. For this reason, several end-to-end approaches directly utilizing time-domain waveforms as input and output have been proposed [6, 8]. Among end-to-end DNN architectures, Conv-TasNet [9] is a popular one proposed for speech separation tasks, and speech and audio enhancement. Conv-TasNet utilizes a time convolutional network (TCN) [10] composed of depthwise dilated one-dimensional convolutional (1-D Conv) blocks. This can reduce the checkerboard artifact induced by the down- and up-sampling blocks of other architectures like autoencoder and U-Net, which is an important aspect to enhance the quality of enhanced audio signals. Also, its computation time can be reduced by performing parallel convolution. Following the success of Conv-TasNet in the single-channel speech separation task, its extension to multi-channel speech separation tasks was attempted. For example, multichannel Conv-TasNet (MC Conv TasNet) [4] with a multichannel encoder that processes multichannel input data and adds all encoded outputs to single-channel Conv-TasNet (SC Conv-TasNet). However, the spatial information is lost due to the addition of all encoded outputs. Therefore, the multichannel information cannot be fully utilized in the TCN. To accomplish the end-to-end source-agnostic audio enhancement, we use the multichannel architecture, inter-channel Conv-TasNet (IC Conv-TasNet) [11], which was proposed recently for the mul tichannel speech enhancement and still overtakes the performance of existing state-of-the-art models on the CHiME-3 dataset [12]. IC Conv-TasNet enhances speech by extracting many inter-channel features through the combination of point-wise and 2-D dilated convolutions in TCN. In this work, we demonstrate that IC Conv-TasNet can be applied to various source signals and even enhance source signals unseen during the training stage, realizing true source-agnostic audio enhancement independent of source signal types. The rest of this paper consists as follows. In the following section, the structures of SC and IC Conv-TasNet are briefly explained. In the experiment section, the construction of the source-agnostic dataset is discussed. Then we compare train and test results conducted with two datasets (speech target or source-agnostic audio). Finally, the analysis of the results and their significance are summarized in the conclusion. 2. Model Architecture The baseline architecture, SC Conv-TasNet, is briefly reviewed here. SC Conv-TasNet was developed for the single-channel audio separation and consists of the encoder, masking network, and decoder. The input waveform is divided into L overlapping segments of window length K . The 1-D Conv layer of the encoder module converts the data of each segment into F feature vectors. The encoder output can be expressed as where x ∈ ℝL x K is the input time-domain waveform, U ∈ ℝK x F is an encoding matrix, ReLU [13] is the rectified linear unit to guarantee the non-negative representation of the encoder output w ∈ ℝL x F. The encoder output is then fed to the masking network that estimates an audio separation mask. For the audio enhancement task, however, only one target signal is to be separated. The estimated mask m is applied to where is the element-wise product, m ∈ ℝL x F is the estimated mask, and d ∈ ℝL x F is called masked encoder output. The decoder decodes the masked encoder output by a transposed 1-D Conv layer. The decoding process is the inverse operation of the encoder and transforms the masked encoder output into a time-domain waveform. For a decoding matrix V ∈ ℝF x K , the decoded signal segment is given by which is then transformed into a final waveform using an overlap-and-add operation. The masking network of SC Conv-TasNet consists of a single 1×1 Conv layer compressing the number of features from F to N and multiple stacks of 1-D Conv blocks. We denote this 1×1 Conv layer as the ‘bottleneck layer’. Each 1-D Conv block includes a 1×1 Conv layer upsizing the number of features from N to the number of hidden layers H , a 1-D dilated depthwise convolution (DConv) layer, and two 1×1 Conv layers for a residual path and skip connection, respectively. The nonlinear activation function (PReLU) and layer normalization are applied after 1×1 Conv and D-Conv, respectively. The series of connections of 1-D Conv blocks constituting a single TCN stack uses increasing dilation factors along 1-D Conv blocks to increase the receptive field. The multiple stacks are used to extract different information from multiple skip connections, which are subsequently added to estimate a source separation mask. The estimated mask is multiplied by the element-wise product with the encoder output to produce a separated waveform through a transposed 1-D Conv layer. In the MC Conv-TasNet introduced to extend SC Conv-TasNet for multichannel signals, the encoder module has multiple 1-D Conv layers connected to individual microphone channels. The 3-D tensor output of size ( L ,F ,M ) obtained from the encoder is superposed along the microphone channel dimension to yield a single channel tensor of size ( L, F ). This superposition makes the 2-D encoder output of MC Conv-TasNet compatible with the TCN architecture of SC Conv-TasNet. The rest of the network, the mask estimation network and decoder, has the same structure as SC ConvTasNet [9]. However, spatial information such as inter-channel relations is completely lost afterward because all channel signals are simply superposed. On the contrary, IC Conv-TasNet can fully exploit the spatial information between multiple channels throughout the whole TCN blocks of the masking network. The differences between IC ConvTasNet and the previous networks are with the stacking of encoder outputs in 3-D dimension and the use of 2-D Conv in TCN instead of 1-D Conv. The overall architecture of IC Conv-TasNet is shown in Figure 1(a). The 1-D Conv layers of the encoder are applied to each microphone channel, and the encoder outputs of individual channels are stacked along the channel dimension in form of a 3-D tensor. The specialty of IC Conv-TasNet is in its TCN blocks modified to handle the 3-D tensor. There are two 1×1 Conv blocks in the bottleneck, one increasing the channel dimension from M to C , and the other decreasing the feature dimension from F to N . The enriched channel and feature dimensions are processed separately in the following TCN blocks. Inside each 2-D Conv block of TCN (Figure 1(b)), 1×1 Conv layer increases the size of the channel dimension C to the number of hidden layers H . This operation mixes up the spatial information along the channel dimension, so feature and time dimensions remain unaltered. Then the D-Conv takes an input feature map of size ( L, N, C ) and extracts feature and time-dependent information from a 2-D dilated convolution. The size of the feature map is maintained by applying 2-D zero padding. Then, spatial information is extracted by two 1×1 Conv layers before the skip and residual path. The 1×1 Conv layers expand or compress information between channels, and hence, can extract inter-channel features. SC or MC Conv-TasNet applies 1×1 Conv in the feature dimension so cannot exploit inter-channel relations. Contrary to previous networks, IC Conv-TasNet separates feature and channel dimensions and extracts spatial information from 1×1 Conv working along the channel dimension to focus on the relationship between channels. The skip connections from TCN stacks are summed up to build a single channel mask for audio enhancement. The resultant feature map is fed to PReLU [14] activation function and compressed to a single channel through 1×1 Conv. Then, another 1×1 Conv increases the feature dimension N to F for matching the mask size to that of the encoder output. A sigmoid function is used to constrain the value of the estimated mask within [0, 1]. The generated mask of size ( L, F ) and the encoder output from the reference channel are multiplied element-wise to produce a masked encoder output. The decoder part of this model is the same as that of SC Conv-TasNet, in which the transposed 1-D Conv block transforms the encoder output to a 1-D waveform. In summary, the 2-D Conv block of IC Conv-TasNet aggregates spatial information through 1×1 Conv layers. The main question here is, however, how many portions of this spatial information are useful for enhancing an audio signal compared to single-channel techniques. Since single-channel techniques rely on source characteristics identified from the training data, their performance would be low for signals unseen during the training. To investigate the source signal dependency of the single- and multi-channel models, we conducted tests with various source signals seen or unseen during the training stage. Figure 1: (a) Overall architecture of IC Conv-TasNet (b) structure of 2-D Conv block 3. EXPERIMENT 3.1. Construction of Datasets To test the performance of models against various source signals, we employed and constructed two different multichannel datasets. The first dataset is CHiME-3, which includes 10,098 speech data recorded by six microphones installed on the tablet device. CHiME-3 provides both clean speech and background data in a multichannel format, as well as tools to extract multichannel impulse responses (IRs) based on the tracking TDoAs from a speaker to microphones. Since the posture change and head movement of a speaker was allowed during the measurement, IRs of individual data are all different. The CHiME-3 dataset, however, only contains speech signals, so DNN models trained by this dataset would utilize the source context information for audio enhancement. To evaluate the source-agnostic performance, we prepared the second multichannel dataset using Free Sound Dataset 50k (FSD50k) [15]. FSD50k contains 51,197 mono audio clips drawn from AudioSet Ontology with more than 200 different type labels. Accordingly, FSD50k has no unified context and is advantageous for testing the source-independent performance. For training and testing, single-channel data of FSD50k were truncated to 10 seconds in length and spatialized into multi-channel data by convolving them with multichannel impulse responses from the CHiME-3 dataset. Then, to build noisy datasets, noise data in the CHiME-3 dataset recorded in four different real venues with the same device configuration were added to the spatialized FSD50k data. Furthermore, for the cross-validation of source agnostic audio enhancement performance, clean speech signals of the CHiME-3 dataset were also spatialized by the same set of IRs. 3.2. Experiment procedure The SC and MC Conv-TasNets were chosen as baseline models for the performance comparison with IC Conv-TasNet. The comparisons were made by disjoint training and test datasets constructed from the CHiME-3 dataset. Train and test were repeated for the spatialized FSD50k dataset as well. Then, their source signal dependency was evaluated by testing with the heterogeneous dataset unseen during the training stage (e.g., testing with CHiME-3 when trained by FSD50k and vise-versa). The parameters of the model were as follows; the number of layers in each TCN stack: 8, the number of TCN stacks: 3, the number of encoder features (F): 512, the number of feature dimensions (N): 128, the number of channel dimensions (C): 8, the number of hidden layers (H): 32. The network was trained to minimize the signal-to-distortion ratio (SDR) loss for 200 epochs. The window of length 256 was used for the encoder and decoder with 50% overlap. The kernels of 3×3 sizes were used for the 1-D and 2-D Conv blocks of TCN, and the network was trained by an ADAM optimizer with a learning rate of 10-3. 4. RESULTS The evaluation results for the case of train and test datasets drawn from the same type of data are presented in Table 1. Speech enhancement performance for the CHiME-3 dataset is over 10 dB in SDR for all three models. However, IC Conv-TasNet significantly outperforms the other models by a large margin of over 3 dB. The results for the spatialized FSD50k dataset show somewhat degraded performance for all three models. This is because the FSD50k dataset consists of various source signals, which hinders the learning of source context, unlike the speech dataset. Nevertheless, IC Conv TasNet still shows a high performance over 15 dB SDR. Its SDR differences from the other models are larger for the spatialized FSD50k dataset, indicating the source-agnostic behavior of IC Conv TasNet. Because spatial information is the only clue to separating various source signals in the spatialized FSD50k dataset, we can conclude that IC Conv-TasNet can utilize spatial information better than MC and SC Conv-TasNet. Table 1: performance comparison of modelstrained and tested byCHiME-3 and spatialized FSD50k. To analyze how the training of IC Conv-TasNet is different for CHiME-3 and FSD50k datasets, its training losses (negative SDR) with respect to the number of training epochs are depicted in Figure 2. The initial loss is lower for the speech dataset, but the rate of loss reduction is relatively smaller than in the case of FSD50k. This trend in loss reduction stresses that the model had a difficulty in enhancing various source signals in the beginning stage but quickly adapted to the task by learning spatial information as training progresses. Figure 2: Comparison of training losses of IC Conv-TasNet for CHiME-3 and spatialized FSD50k datasets. Figure 3: Spectrograms of (a) cough signal, (b) beep signal. (left column: noisy signal, middle column: signal enhanced by IC Conv-TasNet, right column: clean signal) Two examples of audio enhancement by IC Conv-TasNet for FSD50k are shown in Figure 3. The first example of Figure 3(a) includes three spectrograms (target, enhanced, and noisy signals) of cough sound. Both the target signal and noise are widely spread across the entire frequency region, but IC Conv-TasNet can reduce the noise and reconstruct the target signal. There are some remaining noises in the silent region, but they were mostly imperceptible in the unofficial listening test. However, there were a few cases with remaining noises or the wrongly reconstructed signals. For example, the residual noise can be identified in the enhanced spectrogram of Figure 3(b). Table 2: The audio enhancement performance of models trained by FSD50k Lastly, the source-agnostic behavior of IC Conv-TasNet was tested in a harsher condition by differentiating train and test datasets. The performance of IC Conv-TasNet trained by FSD50k but tested with CHiME-3 is shown in Table 2. Even though the model was tested by the unseen type of data (speech), an enhanced SDR of 16.3 dB was obtained. This performance is only 1.3 dB less than in the case of the model trained by CHiME-3 (Table 1), which is even higher than the performances of two conventional models trained and tested by the same FSD50k dataset. From these results, we can reconfirm that IC Conv-TasNet more actively uses spatial information and can suppress noises for various source signals. This desirable characteristic of IC Conv-TasNet is obtained with a much lower computational cost as indicated by the parameter sizes of models in Table 2. The presented parameter size indicates the total number of trainable parameters in the DNN model, such as the kernels of convolution layers. The parameter size of IC Conv-TasNet is much smaller than those of previous models because of the compact kernel sizes of the 1×1 conv in TCN ( C × 4C vs. N × 4N, for C = 8 and N =1 28 ) and D-Conv ( 32 × 4C vs. 3 × 4N ). These compact sizes are possible due to the aggregation of information along the channel dimension as compared to the feature dimension utilized for the previous network. Therefore, the efficient utilization of spatial information plays a key role in enhancing various audio signals using a small-sized model. 5. CONCLUSION In this study, a source-agnostic DNN model for the audio enhancement task was introduced and tested by datasets including speech and various audio signals. The conventional audio enhancement models trained for a certain type of source signal can be easily scrambled by different types of test signals unseen during the training stage. Spatial information embedded in the multichannel signals can overcome this limitation to realize the true source-agnostic audio enhancement. We employed IC Conv-TasNet which can extract and exploit spatial relations between multichannel signals. To evaluate the performance of source-agnostic audio enhancement, two spatialized audio datasets were constructed from the CHiME-3 speech data and FSD50k audio data. While the conventional DNN models suffered from the heterogeneity of audio data when trained and tested by FSD50k, IC Conv TasNet maintained the audio enhancement performance only showing a decrease of 2.3 dB in SDR compared to the CHiME-3 case. Under harsh circumstances where models were trained by FSD50k but tested by CHiME-3, IC Conv-TasNet successfully reduced noises with only a 1.3 dB decrease in SDR compared to the model trained by the CHiME-3 dataset itself. This outstanding performance demonstrates that IC Conv-TasNet utilizes spatial information more than the contextual information of source signals, and hence, can be applied to various audio enhancement tasks without retraining of models and source data collection. 6. ACKNOWLEDGEMENTS This work was supported by the National Research Council of Science & Technology (NST) grant by the Korean government (MSIT) (No. CRC21011). 7. REFERENCES Kounovsky, T. & Malek, J. Single channel speech enhancement using convolutional neural net work. Proceedings of 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), pp. 1-5. Donostia, Spain., May 2017. Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M. & Zhong, J., Attention is all you need in speech separation. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21-25. Toronto, Canada, June 2021. Cox, H., Zeskind, R. & Owen, M. Robust adaptive beamforming. IEEE Transactions on Acous tics, Speech, and Signal Processing, 35(10), 1365-1376 (1987). Gu, R. et al., End-to-end multi-channel speech separation. arXiv:1905.06286 (2019). Lu, X., Tsao, Y., Matsuda, S. & Hori, C., Speech enhancement based on deep denoising autoen coder. Proceedings of Interspeech, pp. 436-440. Lyon, France, August 2013. Venkataramani, S., Casebeer, J. & Smaragdis, P., End-to-end source separation with adaptive front-ends. 52nd Asilomar Conference on Signals, Systems, and Computers, pp. 684-688, Pacific Grove, CA, USA, October 2018. Williamson, D. S., Wang, Y. & Wang, D. Complex ratio masking for monaural speech separation. IEEE/ACM transactions on audio, speech, and language processing, 24(3), 483-492 (2015). 8. Luo, Y. & Mesgarani, N. Tasnet: time-domain audio separation network for real-time, single channel speech separation. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696-700. Calgary, Alberta, Canada, April 2018. Luo, Y. & Mesgarani, N. Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 27(8), 1256-1266 (2019). Lea, C., Vida, R., Reiter, A. & Hager, G. D. Temporal convolutional networks: A unified approach to action segmentation. Proceedings of European Conference on Computer Vision (ECCV), pp. 47-54, Amsterdam, Netherlands, October 2016. Lee, D., Kim, S. & Choi, J. W. Inter-channel Conv-TasNet for multichannel speech enhancement. arXiv:2111.04312 (2021). Barker, J., Marxer, R., Vincent, E. & Watanabe, S. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 504-511, Scottsdale, Arizona, USA, December 2015. Agarap, A. F. Deep learning using rectified linear units (relu). arXiv:1803.08375 (2018) He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of IEEE international conference on computer vision (ICCV), pp. 1026-1034, Santiago, Chile, December 2015. Fonseca, E., Favory, X., Pons, J., Font, F. & Serra, X. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 30, 829- 852 (2022) 1 donghen0115@kaist.ac.kr 2 jwoo@kaist.ac.kr Previous Paper 59 of 808 Next