How to Synthesize Large and High-Resolution SAR Images with Generative AI?
This article is an excerpt from Solène Debuysere’s thesis, presented during the RADAR 2004 Conference, organized by the SEE and supported by the IEEE Aerospace & Electronic Systems Society, where we were awarded the Best Poster Prize for our paper on “Synthesizing SAR Images with Generative AI: Expanding to Large-Scale Imagery”
Well-known generative AI models such as Stable Diffusion are mostly pre-trained with 512x512 images. However, radar images (SAR) are typically much larger and contain high-resolution details essential for accurate analysis. How can we generate such large images and still take advantage of Stable Diffusion’s potential without starting from scratch?
We proposed and compared three different approaches, based on approximately 5,000 labeled radar images (thumbnails) from ONERA airborne images with resolutions of 160 cm, 80 cm, and 40 cm per pixel. These images were processed using the BLIP vision model to create SAR-text pairs.
First Approach: A Cascade Latent Diffusion Architecture
Our model uses a cascade latent diffusion process to produce high-resolution images from text. It generates and progressively improves image resolution from low (160 cm to 512x512), intermediate (80 cm to 1024x1024), and high (40 cm to 2048x2048) levels of detail. Model learning is refined at each stage with SAR images and corresponding captions, ensuring a detailed representation of SAR images across different resolutions.
During inference, we generate SAR images via this cascade architecture. The process starts by creating a basic latent image vector from textual prompts using the Stable Diffusion fine-tuned model on SAR images at a 160 cm resolution (SD160). This vector is then upscaled and transformed into a 1024x1024 image via a VAE decoder. To ensure text fidelity and contextual accuracy, we utilize the ControlNet model, a specialized model that aligns image features with those described in the prompts. Next, the model fine-tuned at 80 cm resolution (SD80) further refines the image, enhancing the resolution to 2048x2048 and adding finer details. Finally, the process was completed with the model trained at 40 cm resolution (SD40) and ControlNet, resulting in the highest quality and resolution. This multi-resolution approach produces images that faithfully reflect textual descriptions, with ControlNet ensuring consistency across scales.
Additionally, this method enables efficient enhancement of image resolution.
Stable Diffusion XL (SDXL)
The Stable Diffusion XL (SDXL) model features a UNet architecture three times larger than that of the Stable Diffusion SD1.5 model, primarily due to additional attention blocks and an expanded cross-attention context. This expansion is supported by a second text encoder, which enhances SDXL’s ability to generate larger, more detailed, and contextually accurate images based on input text. Due to hardware limitations, the model was not fully fine-tuned; instead, a low-rank adaptation (LoRA) learning method was used to achieve some degree of adaptability.
While results demonstrate the model’s capability to generate large, realistic images directly, the texture (specifically, the granular speckle typical of SAR images) appears less faithful to the training data. Further work would involve training the full model to assess whether these limitations stem from the constraints of the LoRA method, the base model itself, or the training dataset.
Outpainting
Outpainting, a variant of inpainting, modifies a partially noised latent image to seamlessly integrate it with surrounding content based on a prompt, focusing on expanding the boundaries of an image. This process is more complex due to its extrapolative nature, where image content is generated beyond existing boundaries. The input image is positioned on a larger, randomly noised mask to extend its dimensions. Examples demonstrate the method’s ability to effectively expand a 1024x1024 image horizontally to 1024x2048 pixels. However, it lacks the consistency seen in other methods, indicating the need for a specialized inpainting model to improve results.
In our example, an initial 1024x1024 pixel image is horizontally expanded to create a 1024x2048 image. The conditional prompt used is “city, buildings, roads.” These examples confirm the effectiveness of our method, highlighting its ability to extend images without introducing visible artifacts at the junction between the original image and its extension. However, the overall consistency of the final image does not match the level achieved by other methods. Developing a specialized inpainting model would be necessary to advance this concept.
Future works
This study used a reduced dataset that is insufficient for the required scaling, necessitating far more data than was provided. Since this study, we have been working to expand the dataset size to 100,000 images.