Part 3 — coregistration
In a previous post, I explained what coregistration in remote sensing is, the differences with georeferencing and orthorectification, and the various steps to coregister two images.
Here, I briefly propose how one can consider integrating deep learning into the coregistration function. I suggest two main approaches: estimating optical flow, and identifying points of interest to match between two images.
Optical Flow estimation
In the context of remote sensing, optical flow refers to the apparent movement of objects between two images taken at different times. Deep learning algorithms can be trained to estimate this flow, thus enabling the understanding and quantification of changes in the observed scenes. This approach is particularly useful for monitoring dynamic changes such as glacier movements; or at very high resolution, when differences in projection between the two images result in a non-rigid deformation between them.
Convolutional Neural Networks (CNNs) are often used in this context. In his doctoral thesis here, Pierre Godet considered two types of existing architectures: FlowNet and PWCNet: https://www.theses.fr/253581230
These architectures were trained on synthetic deformation data between radar and optical images. The idea is to start with a pair of images that are supposed to be correctly coregistered, then apply a geometric transformation to one of them (for example, related to relief), and to train a supervised network to learn this deformation.
The different difficulties encountered are as follows:
- It is necessary to have a database of inputs that is already well aligned: it’s a bit like the snake biting its own tail!
- Most networks take in a single modality. To recalibrate multimodal, an optical image and a radar image, siamese networks must be used as input: this is what Pierre Godet had done by proposing a multimodal PWCNet
- Synthesizing a credible and adequate deformation: for errors related to relief projections, Pierre’s results were very promising, better than those of the classic method of optical flow estimation, Gefolki. But to estimate a realistic glacier movement, it’s not so adapted! Laurane Charrier had made some promising attempts during her Phd Thesis, but did not have time to retrain the network on more realistic bases of glacial flows.
Charrier, L., Godet, P., Rambour, C., Weissgerber, F., Erdmann, S., & Koeniguer, E. C. (2020, September). Analysis of dense coregistration methods applied to optical and SAR time-series for ice flow estimations. In 2020 IEEE Radar Conference (RadarConf20) (pp. 1–6). IEEE.
https://hal.science/hal-03103824/document
Matching Feature Points
Another technique involves using deep learning to identify and match points of interest between two images, before finding the transformation that allows one to move from one set to another. This generally involves detecting distinctive features in each image, then finding correspondences between these features.
Neural networks such as Siamese Networks or those based on transfer learning can be employed to learn to recognize these points of interest despite variations in lighting conditions, shooting angles, or resolutions. However, the search and matching of feature points between images greatly benefit from the incorporation of Vision Transformers (ViTs). These recent models, which originate from transformers used in natural language processing, bring significant improvements in image recognition and analysis. Vision Transformers are distinguished by their ability to handle image data as a sequence of patches (small pieces of the image), similar to how transformers process sequences of words in a text. This approach allows ViTs to capture long-distance contextual relationships between different parts of an image, which is particularly useful for identifying precise and reliable points of interest for matching in remote sensing images.
In the context of multimodal image registration, it is interesting because of:
- Precise Identification of Feature Points: ViTs can analyze the entire image and identify points of interest with great precision, taking into account not only local features but also broader spatial contexts.
- Robustness to Variations: ViTs, with their global understanding of image data, are particularly effective in maintaining constant performance despite variations in image styles.
- Efficient Matching over Long Distances: Unlike traditional methods that focus primarily on local features, ViTs can effectively match points of interest over greater distances in the image, which is essential for registering images taken from different angles.
Apparently, the web is full of notebooks that make these tools accessible. For example, this one, which allows finding quite improbable correspondences between a guy on a bench and a skier :). This goes to show that it is well suited for objects viewed differently.
Without changing any of the existing settings, I get a first draft of data to match between a radar image and a LIDAR image. Well, of course, there’s still work to be done to correct the faulty correspondences… but out of context and applied from scratch, I think there should be potential for those who have the time to get stuck in!