Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose Estimation

Anonymous authors

Micro Object · Sim-to-Real · Physics-Informed Machine Learning

Conceptual illustration placeholder
Fig. 1. Concept overview of the physics-informed machine learning network for efficient sim-to-real microscopy data generation.

Abstract

Accurate pose estimation is crucial for controlling optical microrobots in high-precision tasks such as object tracking and autonomous biological studies. However, collecting and labeling large-scale microscopy datasets is costly and time-consuming.

To address this, we introduce a physics-informed deep generative framework that leverages wave optics and depth-aware rendering within a GAN-based architecture. Unlike traditional AI-only approaches, our method realistically simulates complex microscopy effects—like diffraction and depth-dependent imaging—enabling the efficient generation of synthetic training data.

Our approach achieves:

This framework not only enhances data efficiency but also generalizes to unseen microrobot poses, enabling scalable and robust training without additional real-world data.

Method

Framework overview placeholder
Fig. 2. Workflow of the visualisation rendering system: A virtual optical microscope system was constructed in Isaac Sim based on real-time optical path parameters and predicted robotic poses. Using the initial CAD images and depth maps captured by a virtual camera, high-fidelity simulated images are generated via the visualisation rendering module based on wave optics. The reality gap of the virtual image was further reduced through a sim-to-real module using PixelGAN.

Pipeline

The pipeline starts by building a virtual optical system using NVIDIA’s Isaac Sim, incorporating real-time optical parameters and simulated microrobot poses. A virtual camera captures initial CAD images and depth maps aligned with real experimental setups. Foreground segmentation via k-means clustering enables tight cropping around the microrobot, reducing computation.

Next, a wave optics-based image formation process is used:

Finally, a PixelGAN-based sim-to-real module refines the rendered images, reducing the visual gap between simulation and real microscope data.

Experiments

Datasets and Implementation Details:

Conceptual illustration placeholder
Fig. 3. Alignment of rendering and experiment image features based on Laplacian of Gaussian (LoG) analysis. LoG values and normalised depth are extracted for each dataset (left). Peak LoG frames (corresponding to the focal plane) are used to segment the datasets. To enable one-to-one pairing, data within each segment is balanced (middle), facilitating aligned image pairs for downstream training (right).

The aligned data used for PixelGAN training consists of 15,820 images (each consisting of one physically rendered image and one experiment image), corresponding to 35 sets of optical microrobots with different poses. The resulting paired data are shown in the right part of Fig. 3, each pair has one rendered image and one experimental image on the same depth. Of these, 70% were allocated to the training set, 15% to the validation set, and 15% to the test set. The model is trained for 100 epochs.

The code was implemented in PyTorch 1.8.1 and Python 3.8, running on a system equipped with 1 NVIDIA A100 GPU with 80 GB of memory. The CUDA version used was 11.4, and the inference precision was set to float32.

Code

Will be released after the paper is accepted.

Results

Image Generation Results

Conceptual illustration placeholder
Table 1: Performance comparison of different models. The best results in each column are depicted in boldface.
Conceptual illustration placeholder
Fig. 4: Qualitative evaluation of image generation methods across varying poses and depths, demonstrating the visual fidelity of simulated microscope images compared to real experimental images.
Conceptual illustration placeholder
Fig. 5: Heatmaps of SSIM, PSNR, and MSE across robot poses and depths. The X-axis shows robot posture angles, and the Y-axis indicates height offsets from the focal plane. Each cell reflects performance for a specific pose–depth combination.

We evaluated three methods for generating microrobot images:

  1. GAN-only (using CAD images)
  2. Physics-based rendering
  3. Hybrid: Rendering + GAN

Our hybrid approach—combining physics-based rendering with a GAN—achieved the best image quality, improving SSIM by 35.6% over GAN-only methods, while maintaining fast rendering speeds (0.022s per image), as shown in Table 1.

Example outputs in Fig.4 and pose-wise performance heatmaps in Fig.5 demonstrate that the hybrid method consistently outperforms others across all 35 pose classes, producing the most realistic and accurate synthetic images.

Pose Estimation Results

Conceptual illustration placeholder
Table 2: Pose estimation results using models trained on experimental (Exp) and generated (Gen) images.

To assess the value of our synthetic data, we tested pose estimation (pitch and roll) using three model backbones:

All models were trained for 30 epochs and evaluated on 350 real test images covering 35 poses (strictly excluded from training to avoid data leakage). Result is in Table 2.

Findings:

These results demonstrate that our synthetic dataset closely matches real data quality, enabling robust microrobot pose estimation with minimal accuracy loss.

Hybrid Dataset Evaluation

Conceptual illustration placeholder
Table 3. Pose estimation results using CNN trained on hybrid experimental (Exp) and generated (Gen) images.

We tested how mixing real experimental images with our generated images impacts pose estimation (CNN backbone). Training datasets were built with varying ratios of real-to-generated data (100% Exp, 75/25, 50/50, 25/75, 100% Gen) and evaluated on the same 350 real test images (excluded from training to prevent leakage).

Results:

These results show that moderate integration of generated images preserves performance, confirming the effectiveness of our synthetic data for training pose estimators.

Generalisability to Unseen Poses

Conceptual illustration placeholder
Table 4. Pose Estimation Performance on Generated Images from PixelGAN Models Trained with Different Pose Sets. Average results over three experiments using the CNN-based pose estimation model. PixelGAN-35: trained on all 35 poses; PixelGAN-30: trained on Set B only (30 poses excluding Set A: P0_R20, P10_R30, P20_R40, P30_R50 and P40_R60). Both models generated images for all 35 poses for the pose estimation model training.

To test whether our model can handle unseen microrobot poses, we split the dataset into:

We trained two models:

  1. PixelGAN-35 (trained on all poses: Set A + B)
  2. PixelGAN-30 (trained only on Set B, with Set A unseen)

Results:

These findings demonstrate that our model generalises well to unseen, hard-to-simulate poses, maintaining robust performance even without direct training examples.