Welcome to the new IOA website! Please reset your password to access your account.

Predicting Speech Intelligibility for People with Hearing Loss: The Clarity Challenges

Michael A. Akeroyd Jennifer Firth Holly Griffiths Graham Naylor Eszter Porter School of Medicine University of Nottingham, NG7 2RD, UK

Jon Barker Department of Computer Science University of Sheffield, S1 4DP, UK

Trevor J. Cox 1

Simone Graetzer Lara Harris Zuzanna Podwinska Acoustics Research Centre, University of Salford, M5 4WT, UK

John F. Culling Rhoddy Viveros-Munoz School of Psychology University of Cardiff, CF10 3AT, UK

ABSTRACT Objective speech intelligibility metrics are used to reduce the need for laborious listening tests, during the design of audio systems, room acoustics and signal processing algorithms. Historically, speech intelligibility metrics have been developed using young adults with so-called ‘normal hearing’. They therefore do not work well for those with different hearing characteristics. One of the most common causes of aural diversity is sensorineural hearing loss. While partially restoring perception through hearing aids is possible, results are mixed. This has led to the Clarity Project, which is running an open series of Enhancement Challenges to improve the processing of speech-in-noise for hearing aids. To enable this, better objective metrics of speech intelligibility are needed that work for signals produced by hearing aids and diverse listeners. For this reason, Clarity is also running Prediction

1 t.j.cox@salford.ac.uk

21-24 AUGUST SCOTTISH EVENT CAMPUS ? 02 ? GLASGOW

Challenges to improve speech intelligibility metrics. Competitors are given a set of audio signals produced by hearing aid algorithms, and challenged to predict how many words a listener with a particular hearing characteristic will correctly identify. This paper outlines the running of the first prediction challenge, including a preliminary analysis of the entries.

1. INTRODUCTION

Imagine a psychoacoustic experiment where listeners audition short examples of speech-in-noise and relay what words they heard. Speech Intelligibility is then the percentage of words in the target sentence that the listener correctly identified. Such experiments can assess how changes to the speech-in-noise affect a listener’s ability to participate in a conversation. The signal might be altered in various ways, but of most interest to the Clarity project are the effects of background noise, room reverberation, and processing by a hearing aid. Psychoacoustic experiments usually use young adults as participants, along with a selection criterion of ‘normal hearing’. Consequently, this excludes diverse hearing, whether that comes from hearing or cognitive differences.

Objective speech intelligibility metrics are used as a convenient substitute for laborious psychoacoustic experiments. They allow a computer to estimate the speech intelligibility that a listener would score in a listening test. Metrics have been standardised and used in the specification of audio systems and rooms. They have also been used in the design of signal processing, for example evaluating the likely performance of hearing aid algorithms. They have also been used as target functions for optimising audio processing, for example in the development of speech enhancement that uses machine learning.

Objective measures for speech intelligibility are developed drawing on the results of psychoacoustic tests and are consequently biased by the listeners used in those experiments. Nearly all of the metrics overlook key factors such as: hearing loss; non-native listeners; or people with different cognitive resources. A notable exception is the hearing-aid speech perception index (HASPI), which includes an auditory model that incorporates peripheral hearing loss [1]. Other issues with current objective measures include: (i) most are monaural but listening is usually binaural; and (ii) metrics can be over-fitted to the corpus and listening scenarios used during development.

To overcome some of these issues, the Clarity Project is running a series of challenges to improve objective speech intelligibility metrics. Open signal processing challenges have driven innovation in other areas such as speech processing (e.g. NIST, CHiME and Blizzard [2]). These are similar to The Defense Advanced Research Projects Agency (DARPA)’s Common Task Method, which has a number of proven benefits, including bringing in many more researchers collaborating across disciplines, with more diverse approaches to a problem than achieved in a traditional research project [3].

In this paper, the first Clarity Prediction Challenge (CPC1) is outlined. It was focussed on listeners with typical age related hearing loss and speech-in-noise processed by hearing aids. The results from the challenge are briefly presented. The paper finishes by looking forward to the next prediction challenge in 2023.

21-24 AUGUST SCOTTISH EVENT CAMPUS ? 02 ? GLASGOW

2. METHODOLOGY

The first Clarity Prediction Challenge ran between November 2021 and April 2022. Entrants were tasked with predicting the intelligibility of a set of speech-in-noise sentences auditioned by listeners with quantified hearing loss. The speech-in-noise had been processed by a variety of experimental hearing aid processors, which had been personalised for the specific listener. The ground-truth data had been established through psychoacoustic experiments.

The challenge ran as follows.

1. Initially entrants were provided with a training set. This was used to develop their

speech intelligibility algorithms. Much information was given in the training set: clean speech, target text, hearing aid input and output signals; listener hearing characteristics; the ground truth of listener responses in the psychoacoustic tests; and comprehensive metadata on how the signals were created. 2. About a month before the challenge deadline, the held-back test set was released.

This had much more limited information. Crucially, it did not include the ground-truth, so the entrants provided their estimates of the speech intelligibility without knowing the correct answers. The test set contained just the hearing aid output signals and the hearing characteristics of the listeners. The clean target speech and the text of the target speech were provided for those building ‘intrusive’ or double-ended speech intelligibility metrics. 3. After the deadline, the Clarity team then compared the entrants’ predicted speech

intelligibility for the test set, to the held-back ground-truth from the listening tests. Two tracks were run as part of the challenge:

● Track 1 was a closed set task. The same listeners and hearing aid processors were in the training set (4812 responses) and test (2421 responses). ● Track 2 was an open set task with the test data including responses from one hearing aid processor and five listeners not in the training set. 22 of the 27 listeners and 9 of the 10 hearing aid processors were in both the training and test set (3545 responses). A little more detail of the materials is given in the following subsections.

2.1. Speech-in-noise signals and hearing aid processing The target sentences were 7-10 words in length. They were a subset of 1,500 utterances from the Clarity speech corpus [4]. The interfering noises were recordings of a variety of non-impulsive domestic sounds such as washing machines, vacuum cleaners and kettles. The speech-in-noise was auralised to be in a random set of typical living rooms using a geometric room acoustic model and an HRTF (head related transfer function) database that includes measurements for hearing aid microphones [5]. The target talker, interfering noise source and listener position were randomised between sentences, but were constant within each sentence. The level of the interfering noise was adjusted to obtain a specific speech-weighted better-ear signal-to-noise ratio between -6 and +6 dB at the front hearing aid microphone.

The speech-in-noise was processed by ten experimental hearing aid processors from the entrants to the Clarity Enhancement Challenge [6]. A variety of approaches were used including: beamforming, source separation and signal amplification. For more details see reference 7, which includes short technical reports for each hearing aid processor.

21-24 AUGUST SCOTTISH EVENT CAMPUS ? 02 ? GLASGOW

2.2. Listeners and perceptual tests Ethical approval was obtained from Nottingham Audiology Services and NHS UK (IRAS Project ID: 276060).

The listeners were characterised by bilateral pure-tone audiograms. Exclusion criteria included: use of hearing intervention other than acoustic hearing aids; diagnosis of Meniere’s disease or hyperacusis, or of severe tinnitus. Hearing loss severity, defined as the average loss in dB HL between 2 and 8 kHz inclusive, was mild (15–35 dB) for 1 listener, moderate (35–56 dB) for 9 listeners and severe (>56 dB) for 17 listeners with a range of 35 dB to 76 dB.

Hearing abilities for most listeners were also characterised by DTT (digit-triplet test) [8]; GHAPB - Glasgow hearing-aid benefit profile questionnaire [9]; and SSQ12 - Speech, Spatial, & Qualities of Hearing questionnaire, 12-question version [10].

The listening tests were conducted in a quiet room in the participants’ homes. They used Clarity’s Listen@Home software running on Lenovo 10e Chromebook tablets with Sennheiser PC-8 headsets. The software presented the signals in blocks (with one hearing aid processor per block). Listeners were asked to repeat what they heard, and this was recorded and later transcribed. The results presented below are for human transcription. 27 listeners completed the tests, providing a total of 7233 responses. The intelligibility score was the percentage of words correct for a listener’s response to a sentence of speech-in-noise.

3. RESULTS

Nine teams entered the challenge, submitting a total of fifteen speech intelligibility algorithms. In addition there was: (i) a baseline algorithm that combined a hearing loss model [11] with a speech intelligibility metric MBSTOI [12]; (ii) predictions using HASPI [1]; and (iii) a simple algorithm (‘prior’) that just guessed the mean intelligibility of the training set for every sentence. Tabulated results for each entrant and a more detailed description of methods can be found in [13] and [14]. Entrant algorithms were anonymised and are referred to by a codename Exxx.

3.1 Track 1 (closed set) Figure 1 shows results with one box-whisker graph per speech intelligibility algorithm. The y- axis is the predicted percent-correct score from the objective speech intelligibility algorithm, and the x- axis the observed percent-correct score (i.e. the ground-truth score achieved by the listeners in the psychoacoustic experiments). The plot is ordered from the lowest average root mean square error (RMSE) (top left) to largest (bottom right), calculated over the test set. The RMSE is shown in the title, along with the Spearman’s rank correlation coefficient ( s ). ‘Intr’ means an intrusive system that uses the clean speech or text alongside the hearing aid output audio for predicting the speech intelligibility. ‘Non-intr’ is a blind system that only uses the output audio from the hearing aid. To create the box-whisker plot, the data was put into classes based on the observed percent-correct. The first and last classes are for 0-5% and 95-100%. The other classes are 10% wide: 5-15%; 15-25% … 85-95%. The x- axis shows the median observed percent-correct within each class.

21-24 AUGUST SCOTTISH EVENT CAMPUS ? 02 ? GLASGOW

21-24 AUGUST SCOTTISH EVENT CAMPUS ? 02 ? GLASGOW

Figure 1. Predicted vs observed percent-correct for all speech intelligibility algorithms in Track 1. One plot per algorithm, ordered according to RMSE score (smallest top left, largest bottom right). The black dotted line is predicted=observed. Blue titles are intrusive systems (labelled ‘Intr’). Black

titles are non-intrusive systems (‘Non-intr’).

The worst performing systems (bottom row) have little variation in the predicted speech intelligibility across the test set. In general, systems are less successful in predicting the speech intelligibility for low observed scores. This might be because of the skewed distribution of the data. For example, the test set has 3.9 times more 100% correct observed scores than 0% correct. Similar distributions in the training set would bias learning algorithms towards getting the high values right.

As intrusive (double-ended) metrics have access to the clean reference speech or text, it would be expected that they would perform better than non-intrusive (blind, single-ended) algorithms. Indeed the intrusive approaches tend to perform slightly better; but the performance difference is not very marked. For example, the difference between the best intrusive and non-intrusive entry is not significant (Wilcoxon signed rank test on absolute prediction error, p = 0.263, z = 1.12, x̃ 1 = 9.92, x̃ 2 = 9.62, n 1 = n 2 = 2421 ).

Overall, reviewing the different approaches, there is no clear conclusion about what is the best process to predict the speech intelligibility. One observation is that simpler algorithms did just as well as the more complex ones.

© Ps g 8 S fame] OOL ‘8 (sae | OOF a OL a OL F bee armor | 88 y, a---er | 88 = 88 wt 88 Lpteecwr | ez ={fe el Pa el =p el Spesses [0L g SHTY---+ JOL og 7 OL og hes OL Qf se--+ 42S g i fpooa--+ ] Zs g Wa fe es LS g Pt LS Whaat + {OS G Br m--4 j0S GF Shea % os G§ Show os Spook- jer 8 Zp fey B Ef w-s ey 8 Kh s er wp tk +f62 OP PLO 162 9 Ep 162 FO Bw 62 E} --+Cck 1c Sf +-ok |e e}-4D-+ <4 & 04+ 4 1 e% Spe Aeb &}-- oH eb 2p-to+ \jer 2 ho \yer Sbe----D +10 Q lw ---— 0 #0 3 he 30 [=} 3 o To OS oO So Sa oO So l=} o S wo S oO WS o WS o 2 peyoipeid 5 peyoipeid 2 peyoipeid payoipoid Tira ]00l «fimo: 8 ffl—wmrtoo She 001 copie asstes | 88 Sehr titss ] 88 Sti - tse | 88 The 88 QP tie BL IG PO oe BL pa BL a oh, BL nw Pobae wet OL yy PRS 102 Gy ES RH--4 702 Gg pA OL Wc} --- Lg g Ct a 1g g w I} ----4 41S g rhe Lg Shar y--+ 40S § Sho --- jos G§ Sprss----jos & Blo os Sho --4erp ZB Chi --ssenp BY S}or--+ler B z|* er Eh s-46z2 0 Bho y-s}6z 0 Elbow --jyoz 0 Ele S162 ep-- y+] 2 & f--- Le +] 2% éf---- +] 2 = A ee 2}-- yeh 2h--- 44 eb 2}--<9--44 eb =} el ote ---- THO Ai--- 40 @{--o--}0 Siw +"}0 So oo So ao Ss a to Qo ms 8 ms 8 i 8 8 8 2 es payoipaid oor Ff OL Th OL 8g ie 88 a 88 2) e@ Sloe ‘ el bs gL S OL gp VE R----H J 0L Gg SPIN Ly ? OL x do 2 wWierss--+/s8 Q Wrery---+ js Veoh 2g ra os § Spore---jos § Bho o 5 g 0s = ee 8 Kcr y--yer 2B Spirik-+jer Be ey = 6c 9 B}-a 7 s-462 9 Ep-ats--j67 0 EB 6z Ep---1 ry 42 Ef +--1 | 2 E} com +] 2 < (aA Sip --— tay eh 2{---< hk eb SpA eb s Jeb S fans - Oe} O Spe ---D}0 SL--oD-40 & 0 "Ss Qo 5 ‘eo @ 0 &s So — o uk S oO oS Oo peyoipeid 3 payoipeid 5 payoipaid Spammer joor Ff 8]: =F 00L F Pace --usitetey 88 erfia— ines | 88 FD kina -- aise | 88. eha-+ 88 pp] 82 Mpa-+ + BL wie + BL 13} Ds sl fobs 440L |, po e--- OL mo CHITY---+ 702 g | -tR- OL Qo ---4 2s $ Woop --~ LS $ VHoea--+ 42S $ Wp Pe 1g Wht y---40S G§ Spor—k--- 50S @ Whar k--- jos G§ Spm os Zp {er 8 Epos jer 2 Zpoox--yer 2 Epo, er cpr 62 OP Epc 62 OE }-ork~+62 9 EB} 4s 62 Ey} +----h4 ec cp +k |e E}+---| e efriba & 1e% Sp eo AY eb 2}--- A eb @--oF Ae EF 2h-m+ Ayer 3 lus ---

Surprisingly, the listener characteristics were less useful than expected. Some speech intelligibility algorithms used the audiogram of the listener as an input to the prediction and others did not, and the choice did not obviously differentiate between high- and low-performing systems. This might have arisen because during listening tests audibility was not an issue for the subjects. There are two reasons for assuming this: (i) the hearing aid processors all had amplification stages to ensure audibility; and (ii) the listeners had a volume control and so could ensure audibility. The additional listener data (like the digit triplet test) was little used.

The best system (E030) used the listener ID as a proxy for the hearing characteristics. This might have allowed the prediction algorithm to compensate for different behaviours in the listening tests. For example, whether a subject was likely to give up or try to guess words for difficult cases.

3.2 Track 2 (open set) Track 2 reveals how well the speech intelligibility algorithms could generalise to previously unseen hearing aid processors (systems) and listeners. To enable this, track 2 had one hearing aid processing system and five listeners in the test set, which were not in the training set. Figure 2 left shows the RMSE for track 2 compared to track 1 across all samples in the test set. This is for the 13 speech intelligibility algorithms that were submitted to both tracks. The box-whisker plot is for the mean RMSE values across the test set for each of the 13 speech intelligibility algorithms. Given that dealing with unseen listeners and systems is a more difficult problem, it would be expected that the average RMSE for track 2 would be higher than for track 1, but actually the difference was not significant. This was demonstrated with a two-sided Wilcoxon rank sum test to compare the medians ( z = 1.23, x̃ 1 = 26, x̃ 2 = 29, n 1 = n 2 = 13, p = 0.2 ). This was probably because the Track 2 data includes a large number of listeners and systems that were seen both in the training and test sets.

21-24 AUGUST SCOTTISH EVENT CAMPUS ? 02 ? GLASGOW

Figure 2 shows the RMSE for speech intelligibility prediction. Left figure: track 1 and track 2 compared for the 13 systems in both tracks. Middle figure: seen and unseen systems (hearing aid

processor) in Track 2. Right figure: seen and unseen listeners in Track 2.

The other graphs in Figure 2 are just for track 2, and show the change in RMSE between the seen and unseen hearing aid processing systems (middle) and seen and unseen listeners (right). Again, the box-whisker plot was calculated from the average RMSE values for each speech intelligibility algorithm. The RMSE for the unseen hearing aid processor was significantly bigger

20 ——_ 55 . + Listeners 50 System 55 —__ + 45 50 40 45 Ww w wn ia) —— = 35 = 40 a ce 30 35 25 30 20 —__ 25 + Track 1 Track 2 Seen Unseen Seen Unseen

than for the seen systems ( z = 3.11, x̃ 1 = 23, x̃ 2 = 38, n 1 = n 2 = 15, p = 0.002 ). In contrast, The RMSE for the seen and unseen listeners were not significantly different (z = 0.41, x̃ 1 = 38, x̃ 2 = 36, n 1 = n 2 = 15, p = 0.7) .

Superficially, the speech-intelligibility prediction algorithms appear to be better at generalising to unseen listeners, than to unseen hearing aid processors. This might have arisen because there was only one unseen processing system, and it was difficult to choose one that was representative of the systems in the training set. This was less problematic when partitioning the listeners into seen and unseen.

4. DISCUSSIONS AND CONCLUSIONS

We undertook the first-ever open challenge where entrants predicted the speech intelligibility for speech-in-noise processed by hearing aids and auditioned by listeners with hearing loss. The best speech intelligibility algorithms by the entrants had improved performance when compared to the baseline system and also the current state-of-the-art metric HASPI. While intrusive algorithms that had knowledge of both the clean speech and the hearing aid microphone signal did better than non-intrusive (blind) approaches, the difference in performance was quite small.

Even for the best algorithms, the prediction errors were quite large, 22.5 ± 0.5 for track 1 and 23.5 ± 0.9 for track 2. These are equivalent to getting two words wrong in a nine word sentence. Consequently, more work is needed to improve speech intelligibility prediction. The next challenge in 2023 should give entrants access to a larger training set, which should help with prediction. But the samples being considered will be more complicated, involving more types of noise interferers and head movements. It is suggested that to improve predictions, models might need to go beyond signal processing of the audio and quantified measures of hearing, to also characterise the behaviour of diverse listeners in psychoacoustic experiments.

5. ACKNOWLEDGEMENTS

This research was funded by the UK's Engineering and Physical Sciences Council under Grants EP/S031448/1, EP/S031308/1, EP/S031324/1 and EP/S030298/1. We are grateful to Amazon, the Hearing Industry Research Consortium, the Royal National Institute for the Deaf (RNID), and Honda for their support.

6. REFERENCES

1. Kates, J.M. & Arehart, K.H., The hearing-aid speech perception index (HASPI) version 2.

Speech Communication , 131 , 35-46 (2021). 2. Barker, J.P. Akeroyd, M.A., Cox, T.J., Culling, J., Graetzer, S., Naylor, G. & Porter, E. Open

challenges for driving hearing device processing: lessons learnt from automatic speech recognition. proceeding of SPIN (2020). 3. Liberman, M. & Wayne C. Human Language Technology. AI Magazine, 41, 22-35 (2020). 4. Graetzer, S., Akeroyd, M.A., Barker, J., Cox, T.J., Culling, J.F., Naylor, G., Porter, E. &

Viveros-Muñoz, R., Dataset of British English speech recordings for psychoacoustics and speech processing research: The clarity speech corpus. Data in Brief , 41 , 107951 (2022). 5. Denk, F., Ernst, S.M., Heeren, J., Ewert, S.D. & Kollmeier, B., The Oldenburg Hearing Device

(OlHeaD) HRTF Database. University of Oldenburg, Tech. Rep . (2018).

21-24 AUGUST SCOTTISH EVENT CAMPUS ? 02 ? GLASGOW

6. Graetzer, S.N., Barker, J., Cox, T.J., Akeroyd, M., Culling, J.F., Naylor, G., Porter, E. & Viveros

Munoz, R.. Clarity-2021 challenges: Machine learning challenges for advancing hearing aid processing. Proc. Interspeech, 2 , 686-690 (2021). 7. Proc. ISCA Clarity Workshop on Machine Learning Challenges for Hearing Aids (Clarity-2021). https://claritychallenge.github.io/clarity2021-workshop/ (2021). 8. Van den Borre, E., Denys, S., van Wieringen, A. & Wouters, J., The digit triplet test: A scoping

review. International Journal of Audiology , 60 , 946-963. (2021). 9. Whitmer, W.M., Howell, P. & Akeroyd, M.A., Proposed norms for the Glasgow hearing-aid

benefit profile (Ghabp) questionnaire. International journal of audiology, 53 , 345-351. (2014). 10. Noble, W., Jensen, N.S., Naylor, G., Bhullar, N. & Akeroyd, M.A., A short form of the Speech,

Spatial and Qualities of Hearing scale suitable for clinical use: The SSQ12. International journal of audiology, 52 , 409-412. (2013). 11. Nejime, Y. & Moore, B.C., Simulation of the effect of threshold elevation and loudness

recruitment combined with reduced frequency selectivity on the intelligibility of speech in noise. The Journal of the Acoustical Society of America , 102 , 603-615. (1997). 12. Andersen, A.H., de Haan, J.M., Tan, Z.H. & Jensen, J., Refinement and validation of the

binaural short time objective intelligibility measure for spatially diverse conditions. Speech Communication , 102 , 1-13. (2018). 13. Barker, J., Akeroyd, M.A., Cox, T.J., Culling, J.F., Firth, J., Graetzer, S., Griffiths, H., Harris,

L., Viveros-Munoz R., Naylor G., Podwinska Z. & Porter E., The 1st Clarity Prediction Challenge: A machine learning challenge for hearing aid intelligibility prediction. Submitted to proc. Interspeech (2022). 14. Proc. Clarity Workshop on Speech Intelligibility Prediction for Hearing Aids (Clarity-2022)

http://claritychallenge.org/.

21-24 AUGUST SCOTTISH EVENT CAMPUS ? 02 ? GLASGOW