Welcome to the new IOA website! Please reset your password to access your account.

Proceedings of the Institute of Acoustics

BINAURAL ASSESSMENT OF LISTENING EFFORT: INTRODUCTION, COMPARISON, AND REALITY Jan Reimes HEAD acoustics GmbH, Herzogenrath, Germany Ossi Raivio HEAD acoustics GmbH, Herzogenrath, Germany 1 INTRODUCTION

We live in an age of communication plagued by environmental noise. Even though non-speech methods such as various text-based messengers are increasing in popularity, spoken word remains as the primary mode of communication. We use two-way and one-way speech communication to stay connected, conduct business, tend our relationships, and navigate our way through public spaces. These spaces are filled with environmental noise caused by traffic and crowds. Informational and commercial messages from public address (PA) systems are an important method of communication, especially when travelling through unfamiliar spaces. On the other hand, the announcements of which we are not interested add to the environmental noise pollution. To investigate the impact of degradation to speech communication, various speech intelligibility metrics exist. The latest addition to these metrics is listening effort (LE). When investigating the impact of environmental noise and noise cancelling methods for communication devices, such as mobile telephones or headsets, several studies 1–5 suggest that LE would be more suitable metric than intelligibility. Most of the practical experience gathered with LE has concentrated on headsets and mobile handsets. In order to expand this view, we have chosen to look at LE performance with a PA system. 2 WHY LISTENING EFFORT?

Before introducing listening effort, it is useful to have an overview of other methods that are often used to assess speech in noisy environments. Articulation Index (AI) is an old and basic method to assess intelligibility of speech. The calculation method was standardized in ANSI S3.5-1969 6 . AI is based on averaged spectra of noise and speech, both idealized as stationary signals. The algorithm processes single-channel, separate spectra of processed speech and noise, which is not always available in real measurement setups. Additionally, there is no comparison of the degraded speech with regard to the clean speech reference. For modern telecommunication devices, this method provides little information. Speech Intelligibility Index (SII) as standardized in ANSI S3.5-1997 7 is the successor of the Articulation Index. The SII method represents an energy-based comparison of processed speech and noise. It is carried out over the two average 1/3-octave spectra of processed speech and the noise- only components. In addition, a simple masking model identifies how strongly the noise interferes with the speech signal in each spectrum. As Articulation Index, SII requires separated noise and speech components, which are not always available in real measurement setups. Additionally, the lack of comparison between clean speech and degraded speech signals limits its informative value on intelligibility. The Speech Transmission Index (STI) method is part of many certification measurement procedures. It is standardized as IEC 60268-16 8 . STI was originally developed for analysing room acoustics, for which it works well. For non-linearly processed signals, its prediction quality is limited. Based on an

Vol. 43. Pt. 2. 2021

Proceedings of the Institute of Acoustics

octave-band filter bank representation, modulation frequencies of each band are analysed. The loss of modulation between the reference and the degraded signal is then determined for each octave band between 0.63 to 12.5 Hz. The final single value is then calculated as the average over time, modulation frequencies and octave bands. STI generally uses modulated noise signals and therefore is not designed to process speech. Several modifications of this method evaluated the capability to handle real speech signals. All of these approaches originate from the domain of audiology, but none of these concepts has gone beyond research and development phase. Therefore, none has been validated and evaluated on e.g. publicly available listening test material. The studies 1–5 have shown that LE provides a way to evaluate a wider range of signal-to-noise ratios (SNR) without reaching negative or positive saturation of intelligibility tests. Further benefit of using LE is that the testing can be done with much more limited corpus because repetitions of the stimulus do not result into learning in test subjects and eventual corruption of the results. The test design can be combined with additional opinion scales, such as speech quality (SQ) or preferred loudness. Using an optional SQ attribute supports test subjects to differentiate between possible noise components of the signal (major impact in LE) and possible speech degradation included in the stimulus (minor to medium impact on LE). Auditory scales and questionnaires from Recommendation ITU-T P.800 9 are used: Table 1 lists the categories for both attributes.

Score Listening Effort Speech Quality 5 Complete relaxation possible; no effort required Excellent 4 No appreciable effort required Good 3 Attention necessary; moderate effort required Fair 2 Considerable effort required Poor 1 No meaning understood with any feasible effort Bad Table 1 . Auditory scales for combined assessment 9 . However, auditory evaluations with test subjects require both time and resources. An instrumental assessment of listening effort is the more preferred solution. For this purpose, the specification ETSI TS 103 558 10 provides such a prediction model, based on a large set of auditory databases and was collaboratively developed in ETSI technical committee (TC) Speech and Multimedia Transmission Quality (STQ). 3 PREDICTION MODEL

The prediction model for LE is described in detail in ETSI standard TS 103 558 10 . As the standard is publicly available, we will provide a concise summary of the algorithmic parts of the model. Figure 1 provides an overview of the algorithm and its different stages. The input signal for the algorithm is a diffuse-field equalised binaural recording that can be considered as a tuple of two single-channel signals: d ( k ) = ⟨ d L ( k ), d R ( k ) ⟩ . Monaural input signal is assumed to be presented diotically to the listener, leading to a pseudo-binaural signal: d ( k ) = ⟨ d ( k ), d ( k ) ⟩ . The clean speech reference r ( k ) is always a single-channel signal. Noise-only signal n ( k ) is diffuse-field equalised binaural recording of the noise in the degraded signal, without any speech. In some applications, the measurement procedure itself might make it impossible to separate speech and noise components. In this case, the noise will be estimated internally by the algorithm. However, in our contribution, n ( k ) could be measured separately and used in the analysis. The pre-processing step normalises the input signals w.r.t. time and level. The delay between degraded signals d ( k ) and reference signal r ( k ) is calculated by a cross-correlation analysis for both ears. The reference signal r ( k ) is scaled to a fixed active speech level (ASL) of 79 dB SPL ( r Opt ( k )), which is assumed to be optimal regarding listening effort 11,12 . The transfer functions H ( f ) between degraded and reference signal are calculated by the H1 methodology (cross-power spectral density) to exclude non-correlated noise components. Here, we also separate noise from the degraded signal, resulting into (processed) speech: p ( k ) = d ( k ) – n ( k ).

Vol. 43. Pt. 2. 2021

Proceedings of the Institute of Acoustics

Figure 1 . Flow chart of prediction algorithm for listening effort (based on ETSI ES 103 558 10 ). As in related speech quality methods, a perceptually motivated transformation to the time-frequency domain of the signal is applied to the pre-processed signals (time index i , frequency bin j ). The hearing model according to Sottek 13,14 is calculated for the clean speech and degraded signals. The transformation includes an auditory filter bank, a hearing-adequate envelope and a downsampling to a frame resolution of about 8 ms. The frequency resolution is configured for 27 critical bands, with centre frequencies between 70 Hz and 18.3 kHz. In contrast to other hearing-adequate frequency scales, the proposed method includes the whole full-band range up to 20 kHz. This time-frequency representation is calculated for all pre- processed signals, which are denoted with capital letters: D (egraded), N (oise), P (rocessed), and R (eference). Speech power estimate is derived from the product of R Opt ( i , j ) and H ( f ), which corresponds to a compensated reference spectrum R Comp ( i , j ). The binaural processing step is included to address the capability of human hearing to improve SNR compared to monaural listening. The spectral components for left and right ears are combined by a short-term equalization-cancellation (STEC) 15 model. This extension of the well-known model of Durlach 16 requires the availability of the isolated speech and masker (noise-only) components. As a result of this stage, combined and enhanced hearing model spectra vs time are provided, e.g., D B ( i , j ) for the degraded signal. Based on the binaural spectra, several comparisons and single value metrics are possible, such as comparing the degraded and processed signals to the reference or using single-ended metrics without a comparison. For the current model, speech and noise levels, four correlation-based metrics 17,18 , and one speech-to-noise distance index 19 are calculated. The metric extraction step provides seven single values, which are combined to an instrumentally determined Mean Opinion Score (MOS) for LE. In the current model, a random forest regression 19 is used. Decision trees of the regression are trained with the metrics as features and auditory MOS values as target. These originate from multiple application-specific listening test databases whose description can be found in detail Annex D and E of the corresponding standard 7 or as an overview in other sources 20 . Finally, MOS LE is scaled by using a penalty function for too loud speech levels: 104 dB SPL and above will produce MOS LE of 1.0. 4 TEST ARRANGEMENT

The data was acquired in a semi-anechoic chamber using a head and torso simulator (HATS), simulation systems for background noise and reverberation, and a loudspeaker taking the role of a PA system. The HATS complies with all relevant standards 21,22 .

Vol. 43. Pt. 2. 2021

Proceedings of the Institute of Acoustics

4.1 Physical Setup

The HATS was placed in the centre of the chamber on a turntable to rotate the HATS in eight different positions. In the starting position (0°) the HATS was pointed towards the PA loudspeaker

Figure 2 . Physical setup of the test scenario, showing the relative positions of HATS, background

noise loudspeakers (small white symbols) and PA loudspeaker (large gray symbol). The eight

positions of rotation for the HATS are also shown. The PA loudspeaker was positioned at approximately 1 m distance from the HATS right ear. This distance is labelled as “01m” in the result graphs. The loudspeaker is equalised and calibrated to produce a pre-defined signal level measured at the HATS ears. The reason for the equalisation was to minimise the effects of the loudspeaker itself and to have the variables of the system better under control. The speech stimulus was the British English test vector from Recommendation ITU-T P.501 23 (B.3.3 in Annex B), consisting of two male and two female voices speaking two sentences each, in total eight sentences. The active speech level according to ITU-T P.56 24 of the speech stimulus was 82 dB SPL . Due to the constraints of the room size, it was not possible to move the PA loudspeaker physically, so two other distances, 3 m and 10 m, were simulated by attenuating the signal by 10 dB and 20 dB and by adding a delay of 6 ms and 26 ms, respectively. These values do not represent the exact physical attenuation of the signal and its propagation delay over the said distances, but they provide a reasonable approximation. These two simulated distances are labelled as “03m” and “10m” in the result graphs. 4.2 Background Noise and Reverberation Simulation

Figure 3 . Block diagram of the test setup. The blocks that are time-synchronised with each other

are marked with an asterisk (*). The loudspeaker symbols correspond to those of Fig. 2.

Vol. 43. Pt. 2. 2021

Proceedings of the Institute of Acoustics

The background noise simulation was done by using a system conforming to ETSI standard TS 103 224 25 . The system consists of eight loudspeakers and amplifiers, a digital equaliser, and a suitable computer programme. Equalisation of the setup is done by recording the impulse responses from all eight loudspeakers to each microphone of an eight-microphone array placed on the HATS. The necessary filters are generated by inverting the impulse responses. A test recording is done to adjust the filters to compensate any errors in the inversion process. The equalisation procedure preserves the noise field’s overall signal level, spectrum, and coherence within predefined limits. The background noises used in the tests are listed in Table 2. To simulate various reverberant environments apart from the low-reverberation (dry) chamber itself, the original speech signal is split into two paths. The first branch consists of an equaliser, a delay path, an amplifier, and the PA loudspeaker itself (direct path). The second branch consists of reverberation simulation, eight amplifiers and eight loudspeakers (reverberant path). The reverberation simulation system utilises the same equaliser, amplifiers, and loudspeakers as the background noise simulation system. The reverberation simulation is implemented in the same software as the background noise simulation. The system complies with ETSI standard TS 103 557 26 . The simulation requires impulse responses of the environment to be simulated (reverberation scenarios). These are recorded by using an eight-microphone array. We used three scenarios that are available as part of the standard; these are listed in Table 3. When simulating the additional distances for the PA loudspeaker, the reverberation simulation parameters were not modified. Despite being a radical simplification, this approach is not completely without merits and provides a practical first approximation.

Label Description Level [dB SPL (A)] None No background noise. – Call Centre HATS and microphone array in business office (“Callcenter 1”). 71.2 Inside Bus HATS and microphone array in passenger cabin of a bus. 72.2 Inside Train HATS and microphone array in passenger cabin of a train. 68.6

Pink Noise HATS and microphone array in a periphonic loudspeaker array, playback of equalized, diffuse pink noise. 80.0

Sales Counter HATS and microphone array in a supermarket. 66.5 Table 2 . Background noises used in the tests. We use the label to refer to them in this contribution.

The descriptions and levels are taken from the standard TS 103 224 25 . The level reported here is

the average value of the individual levels of the eight channels.

Label ID Description RT 60 [ms]

C 50 [dB]

DRR

[dB] None – No reverberation. – – – Medium 3 Office room with three desks; walls and ceiling of concrete, plasterboard and bricks; three windows; carpet.

544 10.7 2.0

High 6 Hall with three desks; rectangular main room with adjoining corridor; walls and ceiling mainly of concrete; two brick walls; one big window; carpet.

1228 8.5 5.3

Highest 8 Rectangular open staircase with four floors; concrete walls; glass doors; window front on one side; tiled floors.

2277 4.6 0.5

Table 3 . Reverberation scenarios used in the tests. We use the label to refer to them in this

contribution. Identifying number, description, reverberation time RT 60 , clarity index C 50 , and

direct-to-reverberant energy ratio DRR are taken from the standard TS 103 557 26 . 5 ANALYSIS AND RESULTS

The measurement runs were optimised by separately recording BGN and speech (with and without reverberation). The signals acquired in this way were mixed with each other, preserving the correct

Vol. 43. Pt. 2. 2021

Proceedings of the Institute of Acoustics

signal-to-noise ratio. MOS LE was then calculated individually for each of the eight sentences in the speech stimulus; we report the average of the eight values. 5.1 Impact of Angle of Rotation

First, our attention is drawn to the behaviour of MOS LE as the function of the HATS rotation. As the instrumental algorithm uses binaural signals and contains a specific binaural processing step, we expect to see a strong correlation between the rotation angle and MOS LE values in several cases. The test setup for Fig. 4 and 5 used “Call Centre” background noise exemplarily. In Fig. 4(a), we can observe that there is almost no relation between the angle and MOS LE value, and all four reverberation scenarios provide the same performance. Here at the distance of ca. 1 m, the PA loudspeaker provides sufficient level for all angles of rotation and the direct sound dominates completely over all reverberation components. In Fig. 4(b), we can see that the four reverberation scenarios start to separate from each other for a simulated distance of 3 m, where the PA loudspeaker is attenuated by 10 dB: the ratio between direct and reverberant signal levels is no longer completely dominated by the direct path. This effect becomes even more visible when a 10 m distance is simulated. In Fig. 4(c), the four reverberation scenarios have separated so much from each other that a more detailed analysis becomes possible. First, it is worth noticing that when reverberation simulation is inactive (blue marker), the MOS LE values are at their minimum. This can be understood when we consider the reverberation as an additional amount of signal energy that is brought into the measurement chamber. However, with decreased direct path sound, MOS LE values only up to 3.0 can be achieved. Second feature that draws our attention is the roughly sinusoidal shape of the curves, more pronounced with no reverberation and less so for the highest reverberation scenario. Again, considering what we simulate, the result is plausible. Reverberation is a very diffuse sound field, so adding it will result to lower relation between the HATS rotation angle and MOS LE value. Fig. 5 shows us the same set of data but split differently in the variable space. We present the four reverberation scenarios as the subfigures (a) to (d). The same effects can be seen here, but from a different point of view. As expected, shortest distance between HATS and PA loudspeaker provides the highest MOS LE in all reverberation scenarios. The case with no reverberation and 10 m distance shows a sinusoidal curve with largest variance, with the highest peaks located at 45°–90° and 225°– 270° angles, which indicates that first the right and then the left HATS ear is pointed directly towards the PA loudspeaker. This angle-dependency becomes less pronounced when the amount of direct sound is diminished and the amount of diffuse reverberation is increased. The asymmetry of the curves (cf. values at 135° and 225°) can be explained with the asymmetry of the physical setup, as the HATS was not positioned along the axis of the PA loudspeaker. 5.2 Impact of Distance

As a second parameter, we want to study the effect of distance between HATS and PA loudspeaker to MOS LE values. The results shown in Fig. 6 were measured with HATS facing the PA loudspeaker (0° angle in Fig. 2) and different background noises. In addition to the already introduced three distances, we also show the results from the situation where HATS mouth was muted (value “None” on the abscissa). This means that only the reverberant signal part was present, without any direct path sound. In the case of no reverberation and no signal from HATS mouth, there is no signal present at all, so we do not have any MOS LE values to report, either (abscissa value “None” in Fig. 6(a)). Our results show that increasing distance leads to lower MOS LE values. Even though this result might seem rather trivial, it is important for the plausibility of the results. However, we can acquire some further insight. First, the sets of curves are sorted according to the level of the background noise (see Table 2). No noise produces the best results, pink noise the worst, and results from the other four noise scenarios are found between of these two extremes.

Vol. 43. Pt. 2. 2021

Proceedings of the Institute of Acoustics

(a) 1 m distance

(b) 3 m distance

(c) 10 m distance

Figure 4 . Measured MOS LE as a function of angle of HATS rotation. The curve colours represent the reverberation scenario (see Table 3).

(a) No reverberation

(b) Medium reverberation

(c) High reverberation

(d) Highest reverberation Figure 5 . Measured MOS LE as a function of angle of HATS rotation. The curve colours represent the distance between PA loudspeaker and HATS (see Section 4.1).

Vol. 43. Pt. 2. 2021

Proceedings of the Institute of Acoustics

(a) No reverberation (b) Medium reverberation

(c) High reverberation

(d) Highest reverberation

Figure 6 . Measured MOS LE as a function of distance between PA speaker and HATS.

“None” denotes the case where HATS mouth was muted. The curve colours represent the background noises used (see Table 2). An interesting detail is shown in Fig. 6(d). When comparing the MOS LE values of no background noise scenario at distance of 10 m and HATS muted (“None”), the latter one is slightly higher. The result can be understood by studying what signals are present in both cases. With “10m”, we have a rather low direct signal level combined with high reverberation signal. At distance “None”, practically no direct signal is present anymore, so we would expect MOS LE to be lower. However, the delay between direct sound and reverberation must be considered, too. When we simulate 10 m distance, we add 26 ms delay for the direct sound (PA loudspeaker), but we do not delay the reverberation. What we see here is the relatively low direct sound being an additional disturbing factor and the reverberation taking the role of the “main message carrier”. When the disturbing direct sound is removed, MOS LE value increases ever so slightly (approx. 0.1 MOS LE ). 6 SUMMARY

In this contribution we have provided a short introduction to listening effort and to the instrumental model defined in ETSI ES 103 558. We have run a simulation of a PA system by using background noise and reverberation simulation. The acquired data was analysed with the instrumental LE method. When analysing the results as a function of various parameters the results are shown to be plausible, even though PA systems were not strictly within the scope of the standard. We have not run additional subjective tests to evaluate the correlation between instrumental and auditory LE scores in this field of application. However, the extensive subjective testing done of the development phase of the instrumental prediction model provides us with necessary confidence that the observed results are meaningful and LE is a suitable metric to evaluate PA systems.

Vol. 43. Pt. 2. 2021

Proceedings of the Institute of Acoustics

7 REFERENCES

1. J. Rennies et al., ‘Listening effort and speech intelligibility in listening situations affected by noise

and reverberation’, J. Acoust. Soc. Am., vol. 136(5), pp. 2642-2653. (2014) 2. J. Reimes, ‘Listening effort vs. speech intelligibility in car environments’, in Fortschritte der Akustik

- DAGA 2015, pp. 394-397, Berlin, Germany (2015). 3. H. F. Schepker et al., ‘Perceived listening effort and speech intelligibility in reverberation and

noise for hearing-impaired listeners’, Int. J. Audiol., vol. 55(12), pp. 738-747. (2016) 4. A. Pusch et al., ‘Binaural listening effort in noise and reverberation’, in Fortschritte der Akustik -

DAGA 2018, pp. 543-546, Berlin, Germany (2018). 5. J. Rennies and G. Kidd, ‘Binaural listening effort in noise and reverberation’, in Fortschritte der

Akustik - DAGA 2018, pp. 615-616, Berlin, Germany (2018). 6. ANSI S3.5-1969 (R1986), Methods for the calculation of the articulation index. National Standards

Institute (1969). 7. ANSI S3.5-1997, Methods for the Calculation of the Speech Intelligibility Index. American National

Standards Institute (1997). 8. IEC 60268-16, Sound system equipment - Part 16: Objective rating of speech intelligibility by

speech transmission index (IEC 60268-16:2011). (May 2012). 9. ITU-T Recommendation P.800, Methods for subjective determination of transmission quality

(Aug. 1996). 10. ETSI TS 103 558 V1.3.1, Methods for objective assessment of listening effort (July 2021). 11. ITU-T, Handbook on Telephonometry. ITU (1992). 12. ITU-T, Practical Procedures for Subjective Testing. ITU (2011). 13. R. Sottek, Modelle zur Signalverarbeitung im menschlichen Gehör. PhD thesis, RWTH Aachen

University, Aachen, Germany (1993). 14. R. Sottek, ‘A hearing model approach to time-varying loudness’, Acta Acus. united Ac., vol. 102,

p. 725-744 (July/Aug. 2016). 15. R. Wan, N. I. Durlach, and H. S. Colburn, ‘Application of a short-time version of the equalization-

cancellation model to speech intelligibility experiments with speech maskers’, J. Acoust. Soc. Am, vol. 136, no. 2, pp. 768-776. (2014) 16. N. I. Durlach, ‘Binaural signal detection: Equalization and cancellation theory’, Foundations of

Modern Auditory Theory, vol. Vol. 2, 02. (1972) 17. C. H. Taal et al., ‘A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted

Noisy Speech’, in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217, Dallas, Texas, USA (Mar. 2010). 18. C. H. Taal et al., ‘An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy

Speech’, IEEE Trans. Audio Speech Lang. Process., vol. 19, no. 7, pp. 2125-2136. (2011) 19. L. Breiman, ‘Random forests’, Machine Learning, vol. 45, no. 1, pp. 5-32 (2001). 20. Jan Reimes, ‘Assessment of Listening Effort for various Telecommunication Scenarios’, 14th ITG

conference on Speech Communication, pp. 219-223, Kiel, Germany (Sept. 2021). 21. ITU-T Recommendation P.58, Head and torso simulator for telephonometry (June 2021). 22. ITU-T Recommendation P.57, Artificial ears (June 2021). 23. ITU-T Recommendation P.501, Test signals for use in telephony and other speech-based

applications (Apr. 2020). 24. ITU-T Recommendation P.56, Objective measurement of active speech level (Dec. 2011). 25. ETSI TS 103 224 V1.3.1, A sound field reproduction method for terminal testing including a

background noise database (July 2017). 26. ETSI TS 103 557 V1.1.1, Methods for reproducing reverberation for communication device

measurements (Dec. 2018).

Vol. 43. Pt. 2. 2021