Welcome to the new IOA website! Please reset your password to access your account.

Proceedings of the Institute of Acoustics

 

DAFI: A deep learning-based autofocus improvement metric for synthetic aperture sonar

 

J. Dale, Naval Surface Warfare Center, Panama City, FL, USA
M. Emigh, Naval Surface Warfare Center, Panama City, FL, USA
J. Prater Jr., Naval Surface Warfare Center, Panama City, FL, USA

 

1 INTRODUCTION

 

In synthetic aperture radar (SAR), and especially synthetic aperture sonar (SAS), uncompensated platform motion often leads to defocused(†1) images. When images are highly defocused, it is difficult to detect and identify objects on the seafloor, both for human operators and automated target recognition (ATR) algorithms. As a consequence, the detection rate of ATR algorithms can suffer and post-mission analysis times can increase. To make clear what we mean by focus and defocused SAS images, we provide example SAS image pairs in Figure 1.

 

First, we briefly survey a number of methods exist in the literature for directly performing autofocus on beamformed SAR/SAS images. Early methods for data-driven SAR autofocus techniques were developed by Brown and Ghiglia1,2, which culminated in the popular phase gradient autofocus (PGA) algorithm3,4. While SAS and SAR are similar sensing modalities, Callow et al.5 point out that PGA requires some tailoring to achieve SAR-like performance on SAS imagery. As such, they developed a method called stripmap phase gradient autofocus (SPGA), which is commonly used in practice for SAS systems.

 

It is not lost upon the community that SAR/SAS autofocus could benefit from deep learning-based approaches. For example, in SAR, SAE-Netemploys a denoising autoencoder to combine image formation with autofocus in a self-supervised paradigm and the AFnet/PAFnet models7 directly learn to correct polynomial phase error. Most relevant to SAS, Gerg and Monga developed a highly effective deep learning model for performing SAS autofocus on low size, weight, and power (SWaP) devices8.

 

The proposed DAFI metric, however, is just that — a metric, not an autofocus algorithm. Rather, it is designed to be paired with an autofocus algorithm and provide a second opinion as to whether image quality has actually been improved. Ideally, this will lead to both increased operator confidence in autofocus algorithms as a whole, and to generally clearer SAS images. We have found practical need for such a method, and found no references in the literature that seek to perform the same task. Thus, DAFI was developed.

 

The rest of this paper is organized as follows. We described DAFI in detail in Section 2, including the model architecture and training procedure. Then, experiments are performed to demonstrate the effectiveness of DAFI in Section 3. Finally, conclusions are drawn in Section 4.

 

2 METHODOLOGY

 

We seek a deep learning model that accepts two realizations of an ATR snippet as input and outputs a confidence indicating whether the latter is of higher quality than the former. When the two inputs are successive iterations of a SAS autofocus algorithm, the output confidence of the network should be a strong predictor of whether the autofocus has improved the quality of the snippet. In this section, we describe DAFI, the model we have developed in order to achieve this goal, and detail our procedure for training it.

 

 

Figure 1: Two examples of defocused (left) and focused (right) SAS image pairs, with key differences circled in red.


 

Figure 2: DAFI architecture, containing around 11.3 million learnable parameters. Most model parameters are contained in the red nodes.

 

 

2.1 Model Architecture

 

The basis for DAFI is the widely-used ResNet-189, with full model details given in Figure 2. The only difference between the ResNet-18 block in Figure 2 and the originally published model is the number of input channels, which was previously three (for RGB images) and is now one (high frequency SAS intensity). We choose this model due primarily to its relatively low parameter count and fast training time, thus lowering the size, weight, and power (SWaP) constraints required should this model be deployed to edge systems. To further reduce model SWaP, we separate the feature extraction components of ResNet-18 (layers prior to the fully-connected layer) from the classification components (fully-connected layer) in order to reuse the learnable feature extractor parameters for our paired SAS snippet inputs. Without this separation, the model would have 22.5 million parameters, as opposed to 11.3 million parameters with a shared-weight feature extractor.

 

During training, each of two snippets is first passed through the feature extractor. Then, the two feature vectors of length 512 are concatenated and fed to the classifier, a sequence of fully-connected layers. The output is the model’s confidence that the second snippet is of higher quality than the first, evaluated on the basis of binary cross entropy. This process, including activation and batch normalization, is diagrammed in Figure 2.

 

2.2 Training

 

To fully utilize our dataset and avoid a time consuming and subjective labeling process, we adopt a self-supervised paradigm for training DAFI, described in Algorithm 1. First, we degrade one SAS snippet S1 using the phase gradient Φ2 estimated from another snippet S2, according to

 

 

where F is the Fourier transform in the range dimension and G is the inverse of F. This process is very similar to PGA, except we are trying to defocus a snippet rather than focus it.

 

Then, we present both S1 and S'1to the model in a random order to simulate two successive iterations of an autofocus algorithm. The sequence (S'1, S1) corresponds to an iteration of autofocus that improves the quality of a snippet, as the defocused snippet is presented first. Conversely, the sequence (S1, S'1) corresponds to the case we want to detect in real use cases, in which autofocus has lowered the quality of a snippet.

 

To challenge the model, the phase gradient that is used to defocus each image is multiplied by a scalar function of epoch t, given by

 


At the first epoch, we have λ(0) = 1 and are thus using the phase gradient of S2 directly. As t → ∞, λ(t) 0.1 so that S1 and S'1 are always different enough that asking the model to distinguish between them is reasonable. To illustrate this behavior, λ is plotted as a function of epoch t in Figure 3.

 

A caveat of this degradation process is that we are not guaranteed to defocus a snippet when applying correction with respect to an estimated phase gradient from another image. If the two sampled images have very similar phase gradients, it is possible that this process will actually focus the original snippet, though we have not observed this to ever happen in practice.

 

The reason we use the phase gradient estimated from other snippets instead of generating them according to some model is to ensure that defocused images S'1 are plausible to occur in real data. Physical factors that induce quadratic phase error include uncompensated platform motion and error in underwater sound speed estimates, which are easier to estimate from other real data than to generate stochastically.

 

 

Figure 3: Phase gradient scale λ by epoch, given in Equation 2. At epoch zero, the scale is λ = 1.0 and approaches λ = 0.1 as epoch approaches .

 

 

3 EXPERIMENTS

 

We make two arguments in this section. First, we argue that the proposed DAFI architecture can successfully learn to discriminate between unfocused and focused SAS snippets. Second, we argue that self-supervised pretraining on synthetically defocused SAS snippets leads to better model generalization when finetuned(†2) on a small real dataset than training directly on the real dataset.

 

3.1 Datasets

 

The dataset used in all pretraining experiments presented in this manuscript is composed of a large corpus of SAS snippets extracted from around objects of interest detected by an ATR algorithm, as we are most interested in focusing images that contain targets. Following common machine learning practice, this dataset is partitioned into training and validation. The full images from which these snippets are extracted are collected in seven geographic regions, five of which are used during training and the other two only for evaluation. We refer to this dataset as the “ATR dataset.”

 

A small test dataset, composed of 577 real non-focused/focused PGA pairs manually sourced from geographic regions not seen during training, was constructed to verify that model performance is as expected on new imagery. These snippets were carefully filtered from 1000 candidates to ensure that the focused snippet is of indisputably higher quality than the non-focused snippet. Although small by modern machine learning standards, we believe that in this domain, quality is more valuable than quantity when evaluating models. We refer to this dataset as the “PGA dataset.”

 

To combat overfitting and evaluate robustness to particular train/test splits, we use four-fold cross validation on the PGA dataset. In this scheme, the dataset is divided into four equally-sized parts with each individual part used for testing a model trained on the other three. For each fold, 25% of the training set, which comprises 75% of the entire dataset, is used for validation. Thus, four different evaluation datasets are derived from the PGA dataset.

 

A concise breakdown of the datasets used in training, validation, and testing, along with some key network hyperparameters, is provided in Table 1.

 

 

Table 1: Size, geographic region, and network hyperparameters used with datasets used to train and evaluate DAFI.

 

3.2 Training

 

First, we pretrain DAFI on the ATR dataset using self-supervised learning for 50 epochs. As the full training set contains around 400,000 images, we define an epoch as 100,000 randomly chosen samples from the full set. This was done to reduce overall time between epochs, as we found experimentally that it was unnecessary to iterate over the full training set each epoch. Over the full training session, the number of times each image in the full training set was shown to the network is equal in expectation (12.5 times).

 

After pretraining, we use transfer learning to finetune DAFI on each fold of the four folds in the PGA dataset, producing four DAFI models from the same pretrained weights. The weights from the pretrained model are used as initialization for these experiments. Furthermore, we performed a control experiment on each fold of the PGA dataset which is identical to each finetuning experiment save for: (1) initial model weights are random instead of pretrained, and (2) learning rate starts at 103 and decreases with cosine annealing10, as the fixed 10learning rate was not suitable for training from random weights on such a small dataset.

 

In training all models, the final set of weights chosen was that which yielded the lowest validation loss and the models were trained for 50 epochs over their respective datasets.

 

3.3 Results

 

Each model is evaluated on the test set associated with its fold of the PGA dataset and receiver operating characteristic (ROC) curves are created. In general, when comparing the ROC curves of two models, the one with higher true positive rate (TPR, y-axis) at a fixed false positive rate (FPR, x-axis) is preferable at that FPR. ROC curves for each of the four DAFI models trained on the cross-validated PGA dataset are shown in Figure 4, with full ROC curve in Figure 4a and a zoomed in curve in Figure 4b. First, we note that all models are substantially better than random (black dotted line). In almost all cases, especially at low FPR, we observe the self-supervised models (solid lines) to exhibit higher performance than fully-supervised models (dotted lines), attesting to the value of the proposed self-supervised pretraining.

 

For a singular performance number across all FPRs, area under the ROC curve (AUC) is used, which can be interpreted as a measure of overall model quality. The AUC of each ROC curve of Figure 4 is reported in Table 2. On all folds except Fold 2, in which AUC was almost equal, the pretrained weights led to substantially higher AUC than random weights.

 

 

Figure 4: ROC curves comparing the performance of self-supervised models (solid lines) and corresponding supervised models (dotted lines), with color indicating the fold in a four-fold cross validation scheme.


 

Table 2: Area under the ROC curve for each model in Figure 4. The best model for each fold is bolded.

 

4 CONCLUSION

 

We have proposed DAFI, a deep learning-based autofocus improvement metric that provides a confidence as to which of two SAS snippet realizations is more focused. DAFI leverages self-supervised learning, meaning that far fewer ground truth labels are required to train it than fully-supervised models. This trait is especially advantageous given the subjective and time consuming nature of asking an operator to label thousands of image pairs. Moreover, the DAFI model could be retrained using imagery from different sonar sensors, increasing the longevity of this approach.

 

We evaluated DAFI on a gold-standard dataset not used for training and determined that the confidences produced by this model consistently agree with domain experts in deciding which of two images is better focused. The self-supervised DAFI models averaged above 0.97 AUC in a four-fold cross validation experiment on this dataset, a substantial improvement over the tested fully-supervised models which averaged around 0.91 AUC. Thus, we believe the claim that DAFI holds merit for guiding autofocus algorithms is well-founded.

 

As future work, we wish to further reduce the model size of DAFI, thus decreasing its size, weight, and power (SWaP) footprint and increasing its suitability for deployment to embedded autonomous underwater vehicle (AUV) platforms. Ideally, DAFI could be integrated directly into the autofocus process in order to transparently collect higher quality images on future missions.

 

ACKNOWLEDGMENT

 

The authors would like to express gratitude to Dr. Tory Cobb at the Office of Naval Research and Dr. Darshan Bryner at the Naval Surface Warfare Center Panama City Division for guidance with development of this project and preparation of this report.

 

5 REFERENCES

 

  1. W. D. Brown and D. C. Ghiglia, “Some methods for reducing propagation-induced phase errors in coherent imaging systems. I. Formalism,” JOSA A, vol. 5, no. 6, pp. 924–941, 1988.
  2. D. C. Ghiglia and W. D. Brown, “Some methods for reducing propagation-induced phase errors in coherent imaging systems. II. Numerical results,” JOSA A, vol. 5, no. 6, pp. 942–957, 1988.
  3. P. H. Eichel, D. C. Ghiglia, and C. Jakowatz, “Speckle processing method for synthetic aperture radar phase correction,” Optics Letters, vol. 14, no. 1, pp. 1–3, 1989.
  4. D. E. Wahl, P. Eichel, D. Ghiglia, and C. Jakowatz, “Phase gradient autofocus – a robust tool for high resolution SAR phase correction,” IEEE Transactions on Aerospace and Electronic Systems, vol. 30, no. 3, pp. 827–835, 1994.
  5. H. J. Callow, M. P. Hayes, and P. T. Gough, “Stripmap phase gradient autofocus,” in Oceans 2003. Celebrating the Past... Teaming Toward the Future (IEEE Cat. No. 03CH37492), vol. 5. IEEE, 2003, pp. 2414–2421.
  6. W. Pu, “SAE-Net: A deep neural network for SAR autofocus,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022.
  7. Z. Liu, S. Yang, Q. Gao, Z. Feng, M. Wang, and L. Jiao, “AFnet and PAFnet: Fast and accurate SAR autofocus based on deep learning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
  8. I. D. Gerg and V. Monga, “Real-time, deep synthetic aperture sonar (SAS) autofocus,” in 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS. IEEE, 2021, pp. 8684–8687.
  9. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  10. I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.

 


(†1)We use “unfocused” to refer to images that have not been focused and “defocused” to refer to images whose focus has been degraded.

(†2)By “finetuned”, we mean initializing a model with weights trained on another dataset instead of weights chosen at random.