A A A Volume : 44 Part : 2 Automating the assessment of sound power levels of running vehicles using information extracted from a static video Marjorie Takai 1 The University of Tokyo, Graduate School of Engineering 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan Hyojin Lee 2Seoul National University #220 Liberal studies, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, Korea zipcode 08826 Miki Yonemura 3 , Shinichi Sakamoto 2The University of Tokyo, Institute of Industrial Science 4-6-1, Komaba, Meguro-ku, Tokyo 153-8505, JapanABSTRACT Noise has become a ubiquitous pollutant in big cities, especially Road Traffic Noise compared with other means of transportation. Several countries/regions have developed road noise prediction mod- els based on local measurements, adjusting their requirements and goals for evaluating this pollutant. Differently than most developed countries, industrializing countries have different characteristics of noise generation due to implied differences, such as degree of maintenance (both vehicle and road conditions) and the driving behavior, for instance. The data acquisition needed for evaluating the vehicle fleet's sound power emission is expensive and time-consuming. This research proposes using a video camera close to the microphone to automate the data gathering and analysis. Using a Python script, this system extracts the sound pressure, estimates the running speed, assesses the distance from the receiver point, and classifies the vehicle under investigation. In addition, to discard inaccu- rate data, the expected trajectory and sound pressure are evaluated. The proposed system's perfor- mance was compared against manual measurements. The resulting sound levels differ less than 0.2 dB for most cases. Thus, acquired 3.4 times the amount of data in the same time interval. After this verification, measurements in different conditions and vehicle fleet were tested in Hokkaido, Oita, and São Paulo cities.1. INTRODUCTIONRoad Traffic Noise (RTN) is considered the major contributor of noise in cities [1][2] even when comparing with other means of transportation, such as trains or airplanes, due to the number and the proximity of this source surrounding us. In Europe, several countries developed their RTN prediction models, based on local measurements, adjusting to their requirements and objectives for noise eval- uation. Even with the publication of CNOSSOS [3] in 2012, by the European Union, as an attempt1 takai@iis.u-tokyo.ac.jp, 2 leehj@iis.u-tokyo.ac.jp, 3 m-yone@iis.u-tokyo.ac.jp, 4 sakamo@iis.u-tokyo.ac.jpi, orn inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS ? O? ? GLASGOW to standardize the methodology, some countries still use their models locally, like Germany, publish- ing their latest model in 2019 [4]. Thus, the different available methods in the literature, are mostly based on data acquired in measurements locally, as the base for the modeling, integrating several characteristics beyond the interaction of the vehicle and road in the sound source model.Differently than most developed countries, industrializing countries have different characteristics of road traffic noise generation, even with the same physical mechanisms of noise generation, other differences, such legislation, degree of maintenance (both vehicle and road conditions), behavior of the driver - cultural stances, are not addressed in the actual standardized models used in these places because they were created in other countriesi, orn inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS ? O? ? GLASGOWThis research attempts to propose a method with automatization of data collection for RTN source modeling to address the lack of characterization of the sound emission regionally, either to a better understanding of understudied places or to update existing databases to comfort new technologies of vehicles and new types of road surfaces.2. SOUND POWER LEVEL ESTIMATIONThe sound pressure level measurement of free running vehicles at constant speed is done from a known perpendicular distance d’ of a strait road section, where the sound level meter (SLM) is posi- tioned at height h show a schematic drawing where R represents the reception point (microphone) position, d is the distance between R and the center of lane, in other words, the minimum distance between R and the moving vehicle, v is the speed of vehicle during the measurement. With this infor- mation, under the assumption of point source emission, the sound power is estimated according to Equation (1).Figure 1: (Right) schematics of sound power level measurements, which shows the position of the microphone ( R ), the receiving point, relative to the vehicle ( s ), the sound source and its ideal sound pressure level, during a vehicle passage in front of the microphone. (Left) example of measurement setting.𝐿 = 𝐿 ,,୫ୟ୶ + 20 log ଵ 𝑑+ 8 (1)Each vehicle passage, is composed by the maximum A-weighted sound pressure level L A,F,max , corresponding to the value when the vehicle is closest to R , speed and vehicle type. Usually, speed, category and distance of each event are obtained during the sound pressure measurements, requiring a person to gather this information. In this research, the characterization of each event is done byarty siese using the information extracted from video recorded by a static video camera placed close to the SLM, as explained in next section.3. AUTOMATION OF DATA GATHERINGThe data gathering process can be time consuming, expensive and prone to human error due to inter- pretations of the executors. In an attempt to improve the data collection, in this research, part of the needed data collection is automated using information from a static video camera, processed using a Python script, reducing the amount of resources, as described below.i, orn inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS ? O? ? GLASGOW3.1. Proposed system The measurements were performed using a RION NL-52 SLM and a video camera mounted close to each other at 1.5 m from the ground. After tests, it was clear that a camera with at least 60 fps was necessary to obtain enough points to estimate speed of events. In addition, using wide angle lens allowed a compact setting, keeping it portable. After measuring , the video and audio files are used for processing, in order to obtain the speed and category, i.e. characterization of each event, from video and the L A,F,max from audio. This analysis is done by using Python [6] with NumPy[7], OpenCV [8] and TensorFlow [9] as main libraries, developed from [10], [11].Measurement — Analysis | al Verification | eal Results 5 5, _ ste Lwa versus speed Sound Sound processing Expected Ly pen teldelbreatesney Video ‘Trajectory analysisFigure 2: Basic workflow of the proposed system.3.2. Video camera characteristics Using a wide-angle or fisheye lens allows capture of a wider view from the camera standpoint, how- ever, it causes distortions in the image due to its lens characteristics. It can be undistorted by using a calibration factor considering intrinsic (align function, for instance) and extrinsic (rotation and trans- lation) parameters. Figure 3 shows a raw image and its undistorted version. Deformities can decrease the accuracy of further analysis.Figure 3: Example of (left) original frame and (right) its undistorted version.Another property of interest is the relationship between pixels in an image and its relationship with the 3D world. This is done by mapping the real distance and size of a known object. In this research, the used camera had the characteristics, as shown in Figure 4. This allows the estimation of a size in a known distance or the distance of a known object; mostly used in the vehicle distance d estimation.3.3 Video processing Along with acoustic data necessary, the characterization of the events, in other words, speed and vehicle categorization, was done automatically using the techniques explained next. i, orn inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS ? O? ? GLASGOWFigure 4: Relationship between distance and pixels based on measurement of a one-meter object of the camera used.3.3.1 Tracking and speed estimation The first information needed is whether if there is a vehicle or not inside the field of view of the video camera. With the assumption of fixed video camera and no abrupt illumination change, a background subtraction technique was used to extract the area which was changing during the frame sequence. In this work, the K nearest neighbor (KNN) method, as implemented in OpenCV, was used to classify which parts of the image was the background (static) or the foreground (moving) based on 500 frames.However, intermediate operations are needed to extract the image of vehicles properly. In outdoor environment, the background is not completely static as can be observed indoors, neither the condi- tion of lighting is constant. Wind can blow the leaves of plants and clouds, changing the scene mod- eled as background and showing it as foreground, for instance.W1280 x H720 (96dpi) 1920 x H1080 (96dpi) . Y= 687.5108 wot y= 1043.8 10% R? = 0.9999 so 1 @ » *. distance m] distance m]Each original frame was converted to grayscale, then blurred with a gaussian filter with kernel of 25 x 25 px. Next, the modeled background was subtracted and a threshold to separate values below and above 127 (of 255, the 8-bit maximum value), generating a binary image in which what was identified as background is shown in black and as foreground, in white. Still, non-connected parts were not enough to identify most of vehicles as one single object, resulting in multiple object detec- tions. To correct this, the image was dilated by using a structured element of 3 x 3 px. Finally, only contours with area greater than 0.5 % of the frame area was considered in further analysis. The result of each step of this process is shown in Figure 5. Any individual vehicle passage identified had its coordinates saved if they passed three checkpoints inside the frame field of view, to consider the start and end of each event.: >:(a) (b) Figure 5: (a) example of image processing used for tracking the moving objects and (b), an example of the resulting coordinates of the trajectory (above) and the data used for speed estimation (move- ment on horizontal plane).After tracking the object, the speed is estimated using saved coordinates. Figure 5 (right) shows an‘Speed estimation example of tracked trajectory inside of video camera field of view (top) and its progression of x coordinate over time (below), data used to estimate the speed. The y part of saved points is used to evaluate the object behavior, because no expressive movement in this direction is expected.Not only the relationship between pixels per second is necessary, but also the distance d of the object to the camera for speed estimation. This was obtained using the known distance from the center of the lane and the video camera. Even when analyzing several lanes, this is done automatically, first using the direction of the moving object, then, if necessary, the position of trajectory with estimated size of object, using the relationship shown in Figure 4.3.3.2 Vehicle classification Another important information to necessary for this analysis is the type of vehicle under investigation. There are several state of art methods used for image classification in the literature, the method chosen to perform this automatically is a neural network, NN, ResNet50 [12] pre-trained. Instead of training all the weights again, the structure and weights were used as start, substituting the last layer with the desired categories and retraining with created database.This work follows the 4-category classification parameters given by the ASJ RTN-Model 2013[13], and, in addition, two more were used: special vehicles and other, total of 6 categories. Motorcycles was not considered due the small number of events recorded at the time.The NN was retrained using data from recorded events manually pre-processed, selecting vehicles and diving into the 6 categories. Also, some transformations, such as small rotations, scale, shift, change in brightness, and horizontal flips, were used to increase the number of samples, technique known as data augmentation. Each event had the classification done in three points of the expected trajectory, and only the image identified as foreground (inside the bounding box) was analyzed using the original ResNet50 and the retrained NN, at each point, in order to check each event categorization accuracy and what was being tracked if it felt in other category in the retrained NN. As long the used camera has a fisheye lens, before each classification, only this frame was undistorted and then had the area of interest pre-processed to input in each NN.3.4 Sound processing The data recorded with the SLM is a RAW audio file. In this system, each identified event has the respective A-weighted sound pressure level with fast temporal weighting calculated and stored. When no events are identified by the system using information from video, no data is examined in this step.3.5 Data quality verification Identifying moving objects from a video is not enough to guarantee data quality, especially when relying on an autonomous system. This step was created to reduce the number of invalid results by using some hypotheses on the object movement and expected sound pressure values. The system currently uses two verifications: trajectory analysis and an evaluation of the sound pressure level over time. The first uses the coordinates evolution over time of each identified moving object, which the main movement is expected in the horizontal axis and inexpressive along the vertical axis. At the top part of the graphs in Figure 6, the x and y points are shown for each coordinate observed in that event. The bottom part shows the evolution of x coordinates over time. After manual verification of fewi, orn inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS ? O? ? GLASGOW hundred events, it was determined that events with discontinuities (as shown at right), without move- ment, tracked area less than half of the region of interest, expressive deviations on the y axis can be excluded as invalid events and are eliminated at this step automatically by the system.i, orn inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS ? O? ? GLASGOWFigure 6 Examples of good (left) non-successful (right) tracking results.Another characteristic considered is the relationship between the expected and the measured sound pressure level considering the sound source as omnidirectional moving at constant speed running at the same lane during the measurement. Two parameters were used: correlation higher than 0.9 and a maximum value of accumulated difference among both curves. Figure 7 exemplifies cases of valid and invalid events.These restrictions are enough for the system to discard most of the invalid data found up to the present moment (~90%), nevertheless, does not consider the presence of external noises nor the ob- struction of the camera’s field of view.‘Height (px) x displacement [px] 8 es38 & Displacement of the center of mass Displacement of the center of mass ny i mm xx E oo 00 a eT 5 seen ae = wo Co $ Fo 00 7 so 75 wo us 0 time unit (67)Figure 7: Examples of valid (left) and invalid (right) of expected sound pressure and measured val- ues.Other event characteristics are used for flagging unusual events, such as the consistency of the vehicle categorization over time at predetermined points, the passage of the object through check- points, length of events, for instance.4. MEASUREMENTSMeasured and expected ls ‘speed -54.2 fawn) time(s} conesThe first important result to check was the calculation of the sound power level calculation and the speed estimation obtained with the proposed system and data gathered and processed using the con- ventional method. The second is the manual annotations and speed obtained using a doppler effect sensor, with manual matching of time events and its characteristics. Two places were used for this, Okita and Yagachi, in Okinawa prefecture, both with an overcast sky and no significant amount of wind (in the height of microphone and sky).The calculation used the same audio data for both methods, the difference was how the character- istics of the event (vehicle type and speed) were obtained, at the place with manual annotations or using data from the video camera. This first verification yielded the results showed in Figure 8 (a)Measures and expected Lp time (5) as i, orn inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS ? O? ? GLASGOWand (b), in which the sound power levels different in only two events (due to external noise), with most cases differing less than 0.2 dB; speed showed more differences. The latter is attributed to the differences in where the speed was obtained during the event.In this setting, the proposed method had a satisfactory overall result. For instance, in the Yagachi case, whereas using the manual data acquisition could gather 76 events of small vehicles, the auto- mated system gathered 332 valid events, during the same time interval, considering both measure- ments, as shown in Figure 8 (c) and (d). These events were checked manually to verify their validity. Among the data, less than 5% was invalid and yielded similar regression formulations, when consid- ering that the sound power level can be estimated as function of the vehicle running speed V [km/h], 𝐿 = 𝑎+ 30 ∙log ଵ 𝑉 , the manual method 𝑎 =47.7 and the automated method valid data resulted in 𝑎 =47.2.8 ‘Speed gun speed [kh] gs 8 8 Vehicles speed Yagachi kita a 30 7080 ‘Automate system speed [km/h] 90After this first verification on how the proposed system performs against the manual analysis, it was used in different conditions in other environments with diverse lightening condition and vehicle fleet in an attempt to verify the limits and restrictions of the hypothesis used.300, Automated method — AS) RTN-Model ©e= Yagachi-manual — Regression-Yagachi «+» Okita-automated =~ _Regression-Yagachi a a a a ‘Speed [km/h]120 Sound power level of passenger cars, us . no 205 100 95 ca 90 speed {km/h}(a) (c) (e)———EeEe ee no ‘Sound power levels 105 é 100 35 f ra Yagachi San kita 5 a a ae ‘Automate system estimation {48} Tho(b) (d) (f) Figure 8: Comparison of results obtained with manual and automated measurements, speed (a) and L WA (b). Comparing number of measurements done in the same time interval in Okita and Yagachi, automate (c) and manual (d). Sound power level of passenger cars obtained in São Paulo region (e) and Hokkaido (f).Another place was used for testing the performance of the system: two points in São Paulo city and two points in its metropolitan area, as shown in Figure 8. This set of pilot measurement set was used to verify the performance of the system in a more diverse vehicle fleet, including different mod- els, colors, and types. In addition, it was a good chance to sample the values of sound power level emission in a different country of commonly running condition.20) Manual method = ASJ RTN-Model — Regression-Yagachi =~ Regression-Yagachi Yagachi-manual Okita-manual 2930S B80 Speed [kmy/h)Lvs (88) 120 ‘Sound power level of passenger cars. us no 105 100) 95 90 = ASJRTN-Model <<< Hokkaido-P4 Hokkaido-P1 Hokkaido-P3 >>» Hokkaido-P5: soo Oita 30 In general, the same conditions applied for this pilot set of data, passenger cars, and trucks were successfully processed (tracked, identified, speed estimated, and correspondent sound power level calculated) and very large vehicles with problems in tracking, (independently of colors and symbols), and with moving background due to the hypothesis and methods used in this work requires objects that can fit inside the field of the camera’s view and a relatively static background.5. CONCLUSIONSThe data acquisition needed for evaluating the vehicle fleet's sound power emission can be expensive and time-consuming. This research proposes using a video camera close to the microphone to auto- mate the data gathering and analysis. Using a Python script, this system extracts the sound pressure, estimates the running speed, assesses the distance from the receiver point, and classifies the vehicle under investigation. In addition, to discard inaccurate data, the expected trajectory and sound pressure are evaluated. The proposed system's performance was compared against manual measurements. The resulting sound power levels differ less than 0.2 dB for most cases in the Yagachi measurement.The actual method/system for acquiring data performs under certain conditions: relatively static background, in other words, no significant moving shadows or clouds, for instance; no external noise, optimal distance from vehicles between 7 and 20 m, no source of light against the camera, and the vehicle must fit inside the camera view.In order to achieve a fully automated data acquisition system, the current research aims to include in the future work the tracking of vehicles which does not fits completely in the camera field of view and a more robust method for identifying events that might be affected by external sources of noise, identifiable or not in the camera, such as airplane noise, insects, birds.6. REFERENCES[1] A. Alexandre, Road traffic noise . London: Applied Science Publishers LTD, 1975. [2] J. F. Colin Nugen, Núria Blanes Miquel Sáinz de la Maza, Maria José Ramos, Francisco Domingues,Annemarie van Beek, Danny Houthuijs, “Noise in Europe 2014,” European Environment Agency, 2014. [3] S. Kephalopoulos, M. Paviotti, and F. Anfosso-Lédée, Common noise assessment methods in Europe(CNOSSOS-EU) . PUBLICATIONS OFFICE OF THE EUROPEAN UNION, 2012, p. 180 p. doi: 10.2788/31776. [4] “Richtlinien für den Lärmschutz an Straßen,” FGSV, Cologne, Germany, RLS-19, Sep. 2019. [5] “The World’s Cities in 2018,” United Nations, Department of Economic and Social Affairs, PopulationDivision, Data Booklet ST/ESA/ SER.A/417, 2018. Accessed: Jan. 07, 2020. [Online]. Available: https://www.un.org/en/events/citiesday/assets/pdf/the_worlds_cities_in_2018_data_booklet.pdf [6] G. Van Rossum and F. L. Drake, Python 3 reference manual . Scotts Valley, CA: CreateSpace, 2009. [7] T. E. Oliphant, A guide to NumPy , vol. 1. Trelgol Publishing USA, 2006. [8] G. Bradski, “The OpenCV library,” Dr Dobbs J. Softw. Tools , 2000. [9] M. Abadi et al. , “Tensorflow: A system for large-scale machine learning,” in 12th {USENIX} symposiumon operating systems design and implementation ({OSDI} 16) , 2016, pp. 265–283. [10] M. Takai, “Study on automatic determination of sound power level of vehicles n situfor environmentalnoise prediction,” 2018, Master dissertation. [11] M. Takai, H. Lee, and S. Sakamoto, “Automatic power level estimation of running vehicles by using videocamera and microphone,” 2018 Autumn Architectural Institute of Japan Annual Convention, p. 2. [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition.” 2015. [13] S. Sakamoto, “Road traffic noise prediction model ‘ASJ RTN-Model 2013’: Report of the Research Com-mittee on Road Traffic Noise,” Acoust. Sci. Technol. , vol. 36, no. 2, pp. 49–108, 2015, doi: 10.1250/ast.31.2.i, orn inter.noise 21-24 AUGUST SCOTTISH EVENT CAMPUS ? O? ? GLASGOW Previous Paper 380 of 808 Next