A A A Volume : 44 Part : 2 Virtual sound source perception challenges of binaural audio systems with head-tracking Vedran Planinec 1 Kristian Jambrošić 2 Petar Franček 3 Marko Horvat 4 Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 10000 Zagreb, CroatiaABSTRACTWith the recent leaps in spatial audio technology, the use of binaural head-tracking for spatial audio can revolutionize the way how music and audio are experienced. Moreover, to research noise- related perception in laboratories, binaural head-tracking is frequently used as an audio reproduction system. Although this technology is getting significantly better in recent years, signal processing of binaural audio is often experiencing problems such as non-adequate sound source externalization and unacceptably high values of system response time since it influences the usability of the technology at fast head rotation. In this paper, the results of an experiment are presented, in which test subjects are determining the direction of a virtual sound source in the horizontal plane with multiple parameter changes. The experiment is done in a controlled environment in a listening room with known acoustical properties. Parameter variation includes: hardware head-tracker variation (commercial one and head-tracker based on simple embedded system), various software solutions for real-time binaural synthesis, and variation in the used Head-Related Transfer Functions. The problem of externalization of a virtual sound source via headphones is discussed, and possible solutions to the problem are given. Results of the experiment and recorded data are presented.1 vedran.planinec@fer.hr2 kristian.jambrosic@fer.hr3 petar.francek@fer.hr4 marko.horvat@fer.hr 1. INTRODUCTIONSpatial audio is a term that is increasingly being mentioned in the professional audio industry as “the future of listening audio” [1]. The stereo audio format has been a leading method of listening to a wide array of audio content for a long time, and spatial audio might just be the solution for the listening experience people are looking for. One of the biggest industries that could potentially benefit from spatial audio is the gaming industry, especially the part of the industry that develops virtual/augmented reality hardware and content. To experience the spatial audio properly, binaural head-tracking is used as a part of an audio reproduction system. Although this technology has evolved rapidly in recent years, signal processing of binaural audio is often experiencing problems such as non-adequate sound source externalization and unacceptably high values of system response time [2] (which influences the usability of the technology at fast head rotation). To better understand what makes the perception of virtual sound sources realistic and precise, an experiment was designed in which different hardware and software parameters were changed to determine if there were statistically significant improvements for any of the combinations. The externalization of a virtual sound source is potentially one of the bigger problems to solve [3], particularly how different reverberation times affect the feeling of externalization. Although it would be logical to set the reverberation time to match the room in which the person is, the assumption is made that greater reverberation time added onto a virtual sound source could be experienced as more externalized. This assumption only applies to augmented reality, because for virtual reality, room properties are negligible.In this paper, the results of the mentioned experiment are presented, in which test subjects are determining the direction of a virtual sound source in the horizontal plane with multiple parameter changes, as well as the level of externalization of a speech sound source. The experiment is done in a controlled environment in a room with known acoustical properties. 2. LISTENING EXPERIMENT2.1. Experimental setupFor setting up the audio reproduction system with head-tracking, the following equipment and software were required to conduct the experiment: DAW (Digital Audio Workstation), scene rotator, stereo encoder, binaural decoder, Ambisonics reverb, headphones, and head-tracker (see Figure 1).Figure 1: Head-tracker mounted on Sennheiser HD650 headphones The experiment was implemented using the IEM plug-in Suite [4] in Reaper DAW. The IEM Suite template for 5 th Order Ambisonics production was used for the required binaural rendering. (see Figure 2).Figure 2: Listening experiment in Reaper DAW with IEM template for 5 th Order AmbisonicTwo different head-trackers were used: a commercial Waves NX head-tracker and a smartphone iPhone 12. The variation in binaural decoding and Head-Related Transfer Functions (HRTFs) was achieved by using two different binaural decoders, each with a different set of HRTFs. The first one is the IEM Binaural Decoder plug-in which uses a Magnitude Least Squares binaural rendering method [5], and uses a set of HRTFs recorded with the Neumann KU 100 dummy head. The second one is the Ho-DirAC Binaural plug-in [6], which is based on Higher-Order Directional Audio Coding method [7], and uses a set of HRTFs measured on a Gras 45BB KEMAR head & torso with normal pinna. These two setups were chosen because they were widely used in previous research on binaural head-tracking synthesis. In order to check the influence of binaural rendering and HRTF database on the precision of sound source localization and externalization, these parameters should have been varied independently, which is planned for future research. The azimuthal direction of the virtual sound source was encoded using the IEM Stereo Encoder plug-in. The IEM Scene Rotator plug-in (see Figure 3) received head-rotation data via OSC protocol and controlled the rotation of the DAW project. Additional software was required for transferring the positional data (Nxosc software for Waves NX tracker, and iOS application GyrOSC for iPhone smartphone). IEM Reverb was used as Ambisonics reverb plug-in.Figure 3: IEM Suite Scene Rotator plug-in Two sounds were selected for the experiment: the sound of knocking on wood for the localization part of the experiment and recorded speech for the externalization part of the experiment (explained in more detail in subchapter 2.2).2.2. The experimental procedureThe experiment was divided into two parts. There were 15 listeners in total, who took part in the entire experiment. The listeners took part in the experiment individually. They were seated in the testing room and all the instructions were read to them inside that room. That way they could experience the reverberation of the testing room as a potentially influential factor in the externalization part of the experiment.The first part of the experiment focused on comparing the accuracy and precision of virtual sound source localization of all four possible combinations of head trackers and binaural decoders and was implemented in two stages. The listening experiment setup can be seen in Figure 4.Figure 4: Listening experiment setup in FER Auralization Laboratory in Zagreb, CroatiaA headphone set with a mount for a head-tracking device was placed on the listener’s head. The head-tracker was calibrated several times throughout the experiment, and the calibration was tested by reproducing the sound of knocking on wood from the frontal direction. The listeners were then asked to confirm that they indeed hear the sound coming from that direction. To localize the sound source and determine its azimuthal direction, the listeners made use of a paper tape stretched symmetrically over a 240-degree arc around the listening position, with the markings of the azimuthal angle in 1-degree resolution. Therefore, the azimuth of 120° was assigned to the frontal direction. To determine the direction of arrival of sound, the listeners were seated on a rotating bar stool and were allowed to rotate on it, i.e. to move their body and head.The first stage of the localization experiment dealt with static localization. A 10-second sound of knocking on wood sound was reproduced from 5 fixed azimuthal directions. After each sound sample had ended, the listeners simply said out loud the azimuth marking the direction they thought the sound came from. To check the consistency of the responses, 2 of the 5 fixed azimuthal directions were the same value, placed at random positions.The second stage of the localization experiment focused on dynamic localization. The task given to the listeners was to determine the final azimuthal direction of a virtual sound source moving in the horizontal plane around the listener. The same sound as before was used, but its duration was shortened to 6 seconds. Its direction of arrival was changing from a starting to the ending value at a constant angular speed. Three scenarios were defined, each of them with a different “travel angle” between the initial and the final direction, resulting in different angular speeds for the three cases. When each sound sample ended, the listeners were asked to say out loud the azimuthal direction they thought the sound was coming from at the end.The second part of the experiment was designed to test if there was any influence of reverberation and binaural decoding on the externalization of a virtual audio source. Consequently, two parameters were varied: the binaural decoder and the reverberation time. Variation in head- trackers was assumed to be negligible for the second part of the experiment, which should be confirmed by the statistical analysis results of the first part of the experiment. The sound used in this part of the experiment was recorded speech. A dummy head was placed in front of the listener as a visual stimulus that mimics a human speaker (see Figure 5).Figure 5: Dummy head placed in front of the listener as a visual stimulus The dummy head was placed on a table directly opposite the frontal position used in the localization experiment. That way the listeners would just make a 180-degree turn on the rotary chair and were ready for this part of the experiment. Moreover, this way they were not distracted by the setup used in the localization part of the experiment. Three sound samples were made from the original, “dry” speech sample, differing only in the amount of added reverberation. In particular, reverberation times of 0 s, 0.4 s, and 0.8 seconds were chosen. Binaural decoding was realized using the same two decoders as before. The samples were reproduced to the listeners, and they needed to rate the level of externalization of the sound on a scale ranging from 1 (“The sound is in my head”) to 9 (“The sound is coming from the mouth of the dummy head”). 3. STATISTICAL ANALYSIS OF THE RESULTSThe results of the experiment were processed and then statistically analyzed by means of visual analysis, i.e. box-and-whisker plots, and two-way ANOVA [8], which was recommended by a statistics expert from the Faculty of Electrical Engineering and Computing, University of Zagreb. Normal distribution was checked and confirmed.3.1. Localization of the sound sourceTo assess the precision of localization depending on the head-trackers and binaural decoders used in the setup, the raw results had to be converted into a more convenient form. In particular, the azimuthal directions indicated by the listeners, i.e. the perceived directions of the virtual sound source, were compared to the true directions of the virtual sound source defined in the IEM Stereo Encoder. The absolute value of the deviation from the true direction was then calculated. As each listener was an independent participant in the experiment, their individual observations (deviations) were averaged over all investigated directions, separately for the static and dynamic localization. Thus, 15 averaged observations were acquired for the statistical analysis for each combination of head tracker and binaural decoder. The results shown as box-and-whiskers plots can be seen in Figure 6.Figure 6. Box-and-whiskers plots for the absolute deviations from the true direction of the virtual sound source, given for different combinations of head-tracker and binaural decoder for static localization (left) and dynamic localization (right)The interquartile range static localization was expected to be below 10 degrees of absolute deviation from a true direction, and the interquartile range for dynamic localization was expected to be below 15 degrees for the same parameter. The obtained data were analyzed using two-way ANOVA to examine if the choice of head tracker and/or binaural decoder as independent factors have an influence on the localization of virtual sound sources. The resulting p-values exceed 0.05 for the main effects and their interaction, both for static and dynamic localization. In particular, the p-values for static localization are 0.43 and 0.93 for head- tracker and binaural decoder, respectively, as the main effects, and 0.13 for their interaction. The p- values for dynamic localization are 0.69 and 0.09 for head-tracker and binaural decoder, respectively, and 0.36 for their interaction. Thus, the hypothesis that the choice of head-tracker and the binaural decoder does not have an influence on the precision of static and dynamic localization cannot be rejected, at least not for the systems examined in this research.Static localization ts I NX tracker, IEM decoder Absolute deviation of direction (degrees) ‘ WH NX tracker, Ho-Dirac decoder ° I smartphone tracker, IEM decoder I Smartphone tracker, Ho-Dirac decoder3.2. Externalization of the sound sourceBased on the two-way ANOVA data for the localization of the sound source, for the purpose of examining the influence of different aspects of the system setup on the externalization of the sound source, the choice of head-tracker was declared unimportant. Therefore, the choice of the binauralAbsolute deviation of direction (degrees) Dynamic localization 30 25 20 : un ° Wi Nx tracker, 1EM decoder WH Nx tracker, HoDirac decoder Bi smartphone tracker, IEM decoder I smartphone tracker, HoDirac decoder decoder and the reverberation time was examined as the independent factors. As the results of the ANOVA test indicated that the choice of the binaural decoder does not have an influence on perceived externalization, the box-and-whiskers plot in Figure 7 shows the results differentiated only by the chosen reverberation times.Figure 7. The box-and-whiskers plot that shows the influence of added reverberation on the perceived externalization of the sound sourceThe two-way ANOVA performed on the obtained data yields the p-values of 0.39 and ~ 10 -18 for the binaural decoder and reverberation time, respectively, as the independent factors, and 0.58 for their interaction. It seems evident that the reverberation time is a major influential factor in the perceived externalization of the sound source, in the sense that the reverberation added to an otherwise “dry” sound of a virtual sound source greatly enhances the sense of externalization of that source. The results indicate that the assumption that greater reverberation time could influence listeners to claim the sound source is more externalized is confirmed.4. CONCLUSIONSThe results of the performed research indicate that the precision of localization of the tested binaural audio system with head-tracking is generally better than expected, both for static and dynamic localization. Moreover, the difference in the performance of a commercial head-tracker and a smartphone used as a head-tracker is negligible, although some difference was expected for dynamic localization, since the Waves NX tracker uses a Bluetooth connection, whereas the iPhone 12 Pro uses a Wi-Fi connection to send positional data via OSC. The choice of a binaural decoder with a generic set of HRTFs has no significant influence on the precision of localization. As for the externalization of a virtual sound source, added reverberation time has a major influence on the sense of externalization of virtual sound sources for augmented reality. Further research is needed with respect to “personalized” binaural decoders that contain individualized HRTFs. The influence of the amount of added reverberation should be looked into more closely as well.Since the variations of binaural rendering methods and HRTFs variation were not doneExternalization ° = 8 go7 2 6 234 $3 B? = independently, this is a recognized limitation of the presented research results. Therefore, future research is planned to tackle this problem. 5. ACKNOWLEDGEMENTSThe authors acknowledge financial support from the Croatian Science Foundation, (HRZZ IP- 2018-01-6308, "Audio Technologies in Virtual Reality Systems for Auralization Applications (AUTAURA)". 6. REFERENCES1. Fitzpatrick, F. ‘3 Ways Spatial Audio Can Transform The Future Of Digital Health’, Forbes, 13January 2022, accessed 27 th April 2022, https://www.forbes.com/sit e s / fra nkf i t zpatrick/2022/01/13/3-ways-spatial-audio-can-transform- the-future-of-digital-health/ ?s h=5a fc c 2 6d208f 2. Jambrošić, K., Krhen, M., Horvat, M. & Jagušt, T. 2020, ‘Measurement of IMU sensor qualityused for head tracking in auralization systems’, U: Parizet, E. & Becot, F. (ur.)Proceedings of e- Forum Acusticum 2020., 2063-2070 3. Kim SM, Choi W. ‘On the externalization of virtual sound images in headphone reproduction: aWiener filter approach’ J. Acoust. Soc. Am. 2005 Jun;117(6):3657-65. doi: 10.1121/1.1921548. PMID: 16018469. 4. Institute of Electronic Music and Acoustics, ‘Plug-in Suite by IEM’, accessed 28 th April 2022,https://plugins.iem.at/ 5. Schoerkhuber, C., Zaunschirm, M., and Hoeldrich, R., “Binaural Rendering of Ambisonic Signalsvia Magnitude Least Squares”, Fortschritte der Akustik, DAGA, 2018 6. Politis, A., McCormack, L., and Pulkki, V., ‘Enhancement of ambisonic binaural reproductionusing directional audio coding with optimal adaptive mixing’, October 2017, in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 379-383). IEEE. 7. Politis, A., Vilkamo, J., and Pulkki V., "Sector-Based Parametric Sound Field Reproduction inthe Spherical Harmonic Domain," in IEEE Journal of Selected Topics in Signal Processing , vol. 9, no. 5, pp. 852-866, Aug. 2015, doi: 10.1109/JSTSP.2015.2415762. 8. Pyrczak F., Oh DM. ‘Two-Way ANOVA’, in Making Sense of Statistics. 7th ed. New York:Routledge; 2018:252. Previous Paper 524 of 808 Next