BGU Campus (~0.0004 kmΒ²) β’ High-density regression task
Forces the model to learn location-based features (building facades, pathways) rather than memorizing specific camera angles. Same GPS β 4 different views.
Images collected across campus
Min-max normalized coordinates
Train / Validation / Test
| Architecture | Params | Best Error (Best Config) |
|---|---|---|
| EfficientNet-B0 | 5.3M | 5.7m (100ep, LR=0.001) |
| ConvNeXt-Tiny | 28.6M | 9.6m (100ep, LR=0.0001) |
| ResNet18 | 11.7M | 11.3m (100ep, LR=0.0001) |
Epochs (30/50/100), Learning Rate (0.001/0.0001), Data Size (50%/100%), Test-Time Augmentation
Error increase with 50% data
For best convergence
Rotations, perspective shifts, color jitter
75% of predictions
95% of predictions
94.5% within 10m β’ 98.2% within 20m
Standing in the same location and taking photos in all directions while the GPS stabilizes taught the model to understand depth and spatial relationships, not just structures. The model learns that buildings look different from different distances, not just different angles.
| Model | Regular | Augmented | Degradation |
|---|---|---|---|
| Final Model | 5.24m | 20.5m | 3.91Γ |
| EfficientNet | 5.71m | 24.34m | 4.26Γ |
| ConvNeXt | 9.62m | 14.35m | 1.49Γ |
| ResNet18 | 11.34m | 24.38m | 2.15Γ |
ConvNeXt, despite having a higher baseline error, performed best under augmented testing and with half data. This indicates the architecture is better at generalizing β it learns location-invariant features rather than memorizing training viewpoints. Choose based on deployment: accuracy (Final Model) vs robustness to unseen conditions (ConvNeXt).
SIFT (Scale-Invariant Feature Transform) extracts keypoints (corners, edges)
with 128-dim descriptors that are rotation/scale invariant. We tried matching query images
to a database using SIFT descriptors to refine CNN predictions.
Result: This added significant computational overhead (FLANN matching on
3,646 images) and made training slower, but didn't improve accuracy β the CNN already learns
similar geometric features implicitly. Pure deep learning proved simpler and equally
effective, so we dropped this approach.