Welcome to the new IOA website! Please reset your password to access your account.

Exploring the use of AI in marine acoustic sensor management

 

Edward Clark, Alan Hunter, Olga Isupova, et al.

 

Citation: Proc. Mtgs. Acoust. 47, 070008 (2022); doi: 10.1121/2.0001601

 

View online: https://doi.org/10.1121/2.0001601

 

View Table of Contents: https://asa.scitation.org/toc/pma/47/1

 

Published by the Acoustical Society of America

 

ARTICLES YOU MAY BE INTERESTED IN

 

Comparison of two different methods to include a beam pattern in parabolic equation models

 

Proceedings of Meetings on Acoustics 47, 070006 (2022); https://doi.org/10.1121/2.0001593

 

Acoustic attenuation of cohesive sediments (mud) at high ultrasound frequencies

 

Proceedings of Meetings on Acoustics 47, 070009 (2022); https://doi.org/10.1121/2.0001594

 

Experimental investigation of a virtual planar array for MIMO sonar systems

 

Proceedings of Meetings on Acoustics 47, 070005 (2022); https://doi.org/10.1121/2.0001591

 

Analysis of hydroacoustic time series by parametric predictive modelling

 

Proceedings of Meetings on Acoustics 47, 055001 (2022); https://doi.org/10.1121/2.0001596

 

Accuracy of numerically predicted underwater sound of a ship-like structure

 

Proceedings of Meetings on Acoustics 47, 070001 (2022); https://doi.org/10.1121/2.0001565

 

Correlations between sound speed and density in seabed sediment cores collected in Norwegian waters

 

Proceedings of Meetings on Acoustics 47, 070004 (2022); https://doi.org/10.1121/2.0001588

 

 

 

Exploring the use of AI in marine acoustic sensor management

 

Edward Clark and Alan Hunter

 

Department of Mechanical Engineering, University of Bath, Bath, BA2 7AY, UNITED KINGDOM; eprc20@bath.ac.uk; A.J.Hunter@bath.ac.uk

 

Olga Isupova

 

Department of Computer Science, University of Bath, Bath, UNITED KINGDOM; oi260@bath.ac.uk

 

Marcus Donnelly

 

Systems Engineering & Assesment Ltd, Beckington, Somerset, BA116TA, UNITED KINGDOM; marcus.donnelly@sea.co.uk

 

Underwater passive acoustic source detection and tracking is important for various marine applications, including marine mammal monitoring and naval surveillance. The performance in these applications is dependent on the placement and operation of sensing assets, such as autonomous underwater vehicles. Conventionally, these decisions have been made by human operators aided by acoustic propagation modelling tools, situational and environmental data, and experience. However, this is time-consuming and computationally expensive. We consider a ‘toy problem’ of a single autonomous vehicle (agent) in search of a stationary source of low frequency within a reinforcement learning (RL) architecture. We initially choose the observation space to be the agent’s current position. The agent is allowed to explore the environment with a limited action space, taking equal distance steps in one of $n$ directions. Rewards are received for positive detections of the source. Using OpenAI’s PPO algorithm an increase in median episode reward of approximately 20 points in the RL environment developed is seen when the agent is given a history of it’s previous moves and signal-to-noise ratio compared to the simple state. The future expansion of the RL framework is discussed in terms of the observation and action spaces, reward function and RL architecture.

 

1. INTRODUCTION

 

Underwater passive acoustic source detection and tracking is important for various marine applications. The performance in these applications is dependent on the placement and operation of the available sensing assets. Currently decisions are made by human operators using acoustic propagation models, situational and environmental data, and domain expertise. With the advent of numerous different sensing assets including :dipping sonar, sonobouys, towed arrays and Autonomous underwater vehicles (AUVs), a decision aid to guide deployment decisions is an invaluable tool. In this paper we address the question

 

‘How can we use machine learning to provide explainable deployment policies in passive marine acoustic sensor management?’

 

Numerous conventional approaches (i.e. without machine learning) exist for determining sensor performance in a given environment.4 A common method is to compute a localised performance map, using acoustic propagation simulations, of the survey area. Unfortunately, this approach is computationally and data intensive, and this limits its utility for making rapid sensor management decisions. However, there is potential for approaches based on machine learning to generalise better across a range of environments.14

 

Machine learning has been used extensively in marine applications.8 Existing applications of machine learning in marine applications often focus on data interpretation using models developed using existing data. In many application areas, such data is challenging or expensive to collect, and the resulting data scarcity can limit performance of the methods. This issue also affects the development of sensor management decision tools based on machine learning, since running large numbers of experimental surveys to train models would be prohibitively expensive.


One can formulate the management of a single sensor as a partially observed Markov decision process as in.3 This is where the management decision is dependent only on the currently observed data. In passive acoustic sensor management this could be a subset or all of the available real time or forecast environmental data and the current location. The partial observation refers to the hidden variables that cannot be forecast such as complex multi-path effects or un-mapped bathymetry.

 

Using a limited state and action space, the number of allowable observations and actions to take is small, hence linear programming can be used to optimise the sensor path as in.3 This linear programming technique is limited by the size of the domain and the basic assumption of a fixed detection probability for each location. We build on assigning each location an arbitrary detection probability by introducing acoustic modelling to assess whether a detection is made at a specific state.

 

There are multiple methods for optimising a path through an environment, including reinforcement learning3 and genetic algorithms.5,19 Reinforcement learning (RL) is a machine learning technique that utilises the formalisation of the problem as a Markov Decision chain to optimise the decision made for a given state.15 Figure 1 shows how an agent takes a state, decides an action and then is given an updated state and a reward from the environment based on how good that action was. This is used in the marine sector for communication networks where a reward is given for successful data packet transfer over acoustic modems. The reward allows the system to learn to optimise its operation in time-varying sound channel.17

 

Other promising machine learning techniques applied to this problem include genetic algorithms.5,19

 

Here they typically use a fixed detection range from the sensor which is unrealistic in the passive detection scenario due to the detection probability varying depending on environmental conditions. The main complexity in the problem we are addressing here is the environmental conditions rather than the variability in path which5 tries to solve, the complexity in actions can be changed once the environmental variability has been introduced and understood. Here we construct a framework that can be built on to produce passive acoustic management decisions.

 

So that this framework is as robust as possible we use well established tools to build it. OpenAI’s Gym2 is a Python library used to build and release reinforcement learning problems. It separates the agent and environment allowing the author to release environments independently allowing others to benchmark their agents and learning algorithms. This powerful tool is already being used within the marine domain by using weather forecasts to train a ship path planning policy.1

 

 

Figure 1: The cycle of reinforcement learning. Agents interact with an environment through actions based on their state to receive rewards. Adapted from15

 

2. SIMPLIFIED PROBLEM

 

The learning environment shown is a basic localisation task. It is trivial to change the parameters presented here to generate more complex environments. However, the parameters selected produce a set of results one can intuitively critically analyse. Figure 2 shows the basic concept of the environment. A source is assigned a location within the environment and then an agent (the acoustic sensor platform) is allowed to explore to locate this source.

 

 

Figure 2: Reinforcement learning Environment. L is the distance the agent travels at each step, dn the number of directions it can choose to take.

 

Large marine mammals such as Finn and Blue whales typically vocalise in the 15 - 30 Hz range,10 we use a single frequency of 20 Hz for all simulations. Source levels for these low frequency calls are between 188 and 191 dB10 and originate from 15 to 30 metres depth. For simplicity we restrict calls to originate from a single horizontal plane at 30m depth with a source level of 190 dB.

 

The underwater environment is kept simple but realistic to show that this method will work in real world scenarios, while maintaining the ability to intuitively interpret the results. A Munk channel sound speed profile11 is used as shown in Figure 2, with the sub-bottom given a sediment to water sound speed ratio of 0.98 (1505 m/s) typical of deep sea mud. This sediment layer is modelled as flat starting from 3km depth. We use a noise background from the Wenz curves across the domain of 90 dB re 1µPa at 20Hz.6,18 The size of the domain is 25km by 25km, large enough to capture complex acoustic behaviour but small enough so that simulations are not prohibitively long.

 

The sensor platform we simulate is an AUV. These platforms typically survey at a speed of 2 ms−1, and have an endurance of up to 72 hours, to reduce our domain size we will restrict it around half a day or 100km. As with the source the sensor platform is restricted to a horizontal domain at 300m depth. The platform moves in N directions dn and at each step move length l. The sensor on the platform is modelled as a basic threshold detector, that is if the signal to noise ratio (SNR) is above a threshold the sensor will always make a detection.

 

3. METHODS

 

Reinforcement learning is dependant on an agent interacting with an environment to learn a policy. We implement acoustic propagation modelling in the environment to define the reward function and thus behaviour. This environment is modelled with realistic parameters but in a simplified toy problem.

 

A. ACOUSTIC MODELLING

 

The acoustic modelling is used to calculate the transmission loss (TL) from the source to the sensor. We use pyRAM9 to calculate this TL value, a python implementation of the Range-dependent Acoustic Model (RAM) parabolic equation solver. RAM is appropriate in this low frequency domain outperforming other solvers such as BELLHOP or Kraken7 for range dependant TL.

 

Once the TL has been calculated we use the passive sonar equation to determine the SNR. Equation 1 shows how we combine the source level (SL) from the chosen source characteristics, TL from the acoustic model and noise (N) from the Wenz curves.

 

 

Once the SNR is calculated we use a detection threshold of 10 dB to determine if there is a detection. Figure 3 shows how SNR is mapped to detections within the domain. Even with no complex bathymetry complex concentric slow patterns can form from the reflection and refraction through the sound speed channel and the ocean bottom.

 

B. REINFORCEMENT LEARNING

 

As already set out RL is a sub space of of machine learning where we allow an agent to learn an optimal policy in a given environment. The learning phase can either be on or off policy. Off policy means the policy learnt is not the one used for exploration. On policy means that the exploration and final policy are the same. Off policy learning lends itself to learning a static policy that can be reliably interrogated, on policy can either be used to produce a static policy or the agent can continually learn after the ‘training’ phase. This online learning can lead to the agent learning unexpected behaviour that cannot be replicated.

 

 

Figure 3: Translation of SNR (left) to detection probability (right) at 300m depth of a 100km by 100km domain with flat bathymetry.

 

The policy governs how an agent determines the action to take for a given state. This state in reality is a vector containing information that uniquely identifies the current situation of the agent. During the training phase the agent chooses an action dependant on its state and interacts with the environment receiving two pieces of information back: an updated state and a reward. The policy is then updated depending on the reward for the action taken in that state. This cycle of state, action, reward is repeated until one of the termination criteria are met, this is known as an episode. The training phase should be run until the policy converges to a stable average reward. Learning a policy is non trivial and does not always converge if the problem is not well defined. The algorithm we use to explore this problem is OpenAI’s proximal policy optimisation (PPO) from the Stable-Baselines3 library.12,13

 

C. REWARD FUNCTION

 

The reward function along with the termination criteria set the learning behaviour of the agent. Using a negative or 0 reward at each step gives two advantages. First we penalise longer episodes encouraging shorter more direct paths to the source, secondly we fix the maximum episode reward at 0 which gives a limit to quantify how good a policy is. The reward function is the most computationally expensive step having to run an acoustic simulation and also update the policy. Our framework uses a priori environmental data to calculate the reward.

 

 

Our chosen reward function is shown in Equation 2, where Rn is the reward at step n . The agent is penalised for no detection and given zero reward for a detection. In addition we introduce a penalisation term p = − 100 which linearly scales towards the edge from a border b as the agent’s distance to the edge de decreases. This is to encourage the agent to stay within the domain without terminating the episode. A demonstration of how the agent might move through the environment is shown in Figure 4.

 

Three termination criteria are given for ending an episode:

 

  • Reaching n = N total steps, here N = 100
  • Coming within the step length l of the source
  • Exiting the domain boundary

 

Figure 4a shows a high reward path that terminates close to the source at timestep tT rather than tN . In contrast Figure 4b shows how an agent may get a few detections then veer off to the edge of the domain.

 

 

Figure 4: Demonstration of a high (-3) and low (-12) reward sensor path. Yellow crosses are detections. This runs from the episode start t0 to the termination step tT .

 

D. STATE AND TRAINING

 

The simplest state vector (denoted o for observation) that describes a sensing platform’s situation is simply its location, this is therefore a reasonable baseline to compare against as seen in Equation 3. This will be specific to each training domain, therefore we introduce a new ‘memory’ state that will allow transfer of learnt behaviour to a new environment. The memory state (Equation 4) includes the SNR values from the acoustic propagation model, and relative rather than absolute position, this should allow the agent to learn a simple local gradient from the last m steps.

 

 

We also use two training scenarios. Firstly we fix the source in one location between episodes, this should allow the agent to learn to go towards this location every time. Secondly, we have a changing source location, this forces the agent to learn to explore and navigate unseen conditions to detect and locate the source.

 

 

Figure 5: Training rewards during learning for the simple state with a fixed source location between each episode in blue and, the memory state model with a memory length of two. Solid lines denote the median episode reward over the previous 50 episodes the interquartile range is shaded. The minimum nominal score of -100 is shown where the agent is assumed to have taken the maximum 100 steps with no detection without transitioning to the penalty boundary.

 

4. RESULTS

 

Figure 5 shows the learning behaviour for two scenarios. In red is an agent using the simple state vector with a fixed source location between each episode. Shown in blue is the memory state vector trained with a changing source location between episodes. The overall better performance of the simple state vector in the fixed training scenario is expected as it is a simpler task to solve. Both of the policies learnt converged to a stable median reward after 1500 episodes showing this framework can produce stable solutions.

 

The behaviour of the simple state vector model with a fixed source location is shown in Figure 6. The agent has learnt to go towards the source with each one of the four episodes reaching the termination criteria within l = 1 km from the source. We also see the expected detection behaviour of no detections far from the source but the density of detections increases towards the source. The basic structure of: environment definition, agent training, understandable policy is demonstrated to clearly work.

 

Figure 7 demonstrates how this simple state breaks down for a changing source location. While there are detections made they show no relation to the source. Taking an example from the episode start at E1 which is meant to locate source S1 it is obvious the agent should detect this source due to its initial proximity, however instead of navigating towards the source it continues towards the bottom left of the domain.

 

Implementing a state vector with a short memory of its previous step and the SNR we see a marked change in Figure 8 compared to 7. We have more exploratory behaviour as seen from E2 . We also see the agent has learnt to accumulate rewards rather than just heading straight for the source as seen from E3 to S3 . Obviously this policy does not perform as well as in the simple fixed scenario and one must also note this is a selected sample of four episodes used to visually highlight behaviour patterns.

 

To compare the simple and memory model with a changing source location between episode we evaluate the learnt policy over 300 episodes. The histograms of the episode rewards are shown in Figure 9.

 

 

Figure 6: Four example sensor paths (from Episode start En looking for source Sn ) using the policy learnt with a simple state model ( o = ( xn , yn ) ), and a fixed source location between each episode during training at location x = 10km, y=10km.

 

 

Figure 7: Four example sensor paths (from Episode start En looking for source Sn ) using the policy learnt with a simple state model ( o = ( xn , yn ) ), and a changing source between each episode during training. (b) shows the paths relative to a centred source.

 

 

Figure 8: Four example sensor paths (from Episode start En looking for source Sn ) using the policy learnt with the memory state model ( m = 2 ), and a changing source between each episode during training.(b) shows the paths relative to a centred source.

 

 

Figure 9: Policy evaluation between the simple state model as in Figure 7 and the memory state model as in Figure 8. The histograms show the rewards of 300 evaluation episodes using the learnt policies.

 

The high density of episodes with -100 reward is due to the maximum episode length being reached without a single detection, we would expect these to be higher even with a good policy as the start location of the agent may be just too distant from the source to make an initial detection. They are very high for the simple state model due to a higher proportion of all episodes gaining no reward. The simple state model is also likely to have a spike closer to the zero to -10 range as it will have learnt to seek out the termination criteria as already seen in the bottom left of Figure 7a, thus if an episode starts close to the edge it is likely to head towards the edge and terminate leaving no time to accumulate more penalisation. The introduction of the memory state leads to a higher density of rewards in the -60 to -20 range, showing that giving the agent more information allows it to learn more optimal behaviour.

 

5. CONCLUSION

 

We have demonstrated a framework for incorporating passive acoustic modelling into reinforcement learning for sensor path optimisation. In this framework, industry standard algorithms can be trained and have been show to converge. By introducing more information to the agent it can learn a policy that is robust to changes between environments. The memory state model introduced here achieved an approximately 20 point increase in the mean reward from 300 episodes over a naive location state in the RL environment developed.

 

This work forms a framework that can be built on in future. As agent behaviour can have large changes with the amount of information available as demonstrated in the results, future work will focus on investigating increasing the action and observation space during learning. Currently the simple environment used allows individuals to intuitively have an idea of a ‘good’ pathway. By introducing more information to the agent such as environmental parameters and operating conditions we intend to build more robust policies that can generalise across domains. This may require compressing these complex inputs into ‘features’ rather than providing the entire vector to the learning algorithm.16 These path planning policies can then be utilised in larger deployment management learning scenarios. The sequential build up from simple single sensor path planning to more complex multi heterogeneous sensor management will provide an understandable tool to aid decision makers in an increasingly complex domain.

 

REFERENCES

 

  1. E. ARTUSI. Ship path planning based on Deep Reinforcement Learning and weather forecast. In 2021 22nd IEEE International Conference on Mobile Data Management (MDM), pages 258–260, June 2021. ISSN: 2375-0324.
  2. G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym. arXiv:1606.01540 [cs] , June 2016. arXiv: 1606.01540.
  3. J. N. Eagle. The Optimal Search for a Moving Target When the Search Path Is Constrained. Operations Research, 32(5):1107–1115, Oct. 1984.
  4. C. Gervaise, Y. Simard, F. Aulanier, and N. Roy. Optimizing passive acoustic systems for marine mammal detection and localization: Application to real-time monitoring north Atlantic right whales in Gulf of St. Lawrence. Applied Acoustics, 178:107949, July 2021.
  5. H. Guo, Z. Mao, W. Ding, and P. Liu. Optimal search path planning for unmanned surface vehicle based on an improved genetic algorithm. Computers & Electrical Engineering , 79:106467, Oct. 2019.
  6. W. A. Kuperman and P. Roux. Underwater Acoustics. In T. D. Rossing, editor, Springer Handbook of Acoustics, Springer Handbooks, pages 157–212. Springer, New York, NY, 2014.
  7. E. T. K¨usel and M. Siderius. Comparison of Propagation Models for the Characterization of Sound Pressure Fields. IEEE Journal of Oceanic Engineering, 44(3):598–610, July 2019. Conference Name: IEEE Journal of Oceanic Engineering.
  8. K. Malde, N. O. Handegard, L. Eikvil, and A.-B. Salberg. Machine intelligence and the data-driven future of marine science. ICES Journal of Marine Science, 77(4):1274–1285, July 2020.
  9. marcuskd. marcuskd/pyram, June 2022. original-date: 2017-09-30T22:37:21Z.
  10. B. S. Miller, S. Calderan, R. Leaper, E. J. Miller, A. ˇSirovi´c, K. M. Stafford, E. Bell, and M. C. Double. Source Level of Antarctic Blue and Fin Whale Sounds Recorded on Sonobuoys Deployed in the Deep Ocean Off Antarctica. Frontiers in Marine Science, 8, 2021.
  11. W. H. Munk. Sound channel in an exponentially stratified ocean, with application to SOFAR. The Journal of the Acoustical Society of America, 55(2):220–226, Feb. 1974. Publisher: Acoustical Society of America.
  12. A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann. Stable-Baselines3: Reliable Reinforcement Learning Implementations. page 8.
  13. 13 J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs], Aug. 2017. arXiv: 1707.06347.
  14. M. Sonnewald, R. Lguensat, D. C. Jones, P. D. Dueben, J. Brajard, and V. Balaji. Bridging observations, theory and numerical simulation of the ocean using machine learning. Environmental Research Letters, 16(7):073008, July 2021. Publisher: IOP Publishing.
  15. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. page 352.
  16. D. F. Van Komen, T. B. Neilsen, D. P. Knobles, and M. Badiey. A feedforward neural network for source range and ocean seabed classification using time-domain features. page 070003, Bruges, Belgium, 2019.
  17. C. Wang, Z. Wang, W. Sun, and D. R. Fuhrmann. Reinforcement Learning-Based Adaptive Transmission in Time-Varying Underwater Acoustic Channels. IEEE Access, 6:2541–2558, 2018. Conference Name: IEEE Access.
  18. G. M. Wenz. Acoustic Ambient Noise in the Ocean: Spectra and Sources. The Journal of the Acoustical Society of America, 34(12):1936–1956, Dec. 1962.
  19. E. Yakıcı and M. Karatas. Solving a multi-objective heterogeneous sensor network location problem with genetic algorithm. Computer Networks, 192:108041, June 2021.