1 / 10

Campus Image-to-GPS Regression
for Localization and Navigation

Deep Learning for Visual Geo-Localization

πŸ“ Campus Navigation Without Satellites
Liran Attar & Tom Mimran
Ben-Gurion University of the Negev
Department of Computer Science β€’ January 2026
The Challenge
Can a neural network learn to localize from images alone?
GPS-Denied Environments: Tall buildings, indoor spaces, signal jamming
Phone GPS Limitations: ~5m accuracy on average, signal issues
Our Solution: Visual GPS using deep learning
255Γ—255
Input Image
β†’
CNN
(lat, lon)
GPS Output
🎯

Target Area

BGU Campus (~0.0004 kmΒ²) β€’ High-density regression task

BGU Campus Map
360Β° Data Collection Strategy
Training the model to be viewpoint-invariant
πŸ’‘

Why 360Β° Rotation?

Forces the model to learn location-based features (building facades, pathways) rather than memorizing specific camera angles. Same GPS β†’ 4 different views.

North view
North (0Β°)
East view
East (90Β°)
South view
South (180Β°)
West view
West (270Β°)
GPS Coordinate Normalization (Min-Max Scaling)
ynorm = (y - ymin) / (ymax - ymin)
Why this works: GPS coordinates (e.g., lat=31.262, lon=34.801) are very close numerically. Without normalization, tiny changes in the loss (~0.00001) would produce negligible gradients, making learning nearly impossible.

Benefit: Mapping to [0,1] ensures gradients are meaningful, loss values are interpretable, and the model can focus on relative positions rather than absolute coordinate magnitudes.

Source: Standard feature scaling from GPSDataManager β€” computed on training set only to prevent data leakage.

πŸ“Έ Dataset Size

3,646

Images collected across campus

🎯 Output Range

[0,1]

Min-max normalized coordinates

πŸ“Š Data Split

70/15/15

Train / Validation / Test

Experimental Design
Controlled experiments to isolate key factors
Architecture Params Best Error (Best Config)
EfficientNet-B0 5.3M 5.7m (100ep, LR=0.001)
ConvNeXt-Tiny 28.6M 9.6m (100ep, LR=0.0001)
ResNet18 11.7M 11.3m (100ep, LR=0.0001)
πŸ”¬

Variables Tested

Epochs (30/50/100), Learning Rate (0.001/0.0001), Data Size (50%/100%), Test-Time Augmentation

Mean Error Matrix
Mean Error Matrix: Models Γ— Experiments

πŸ“ˆ Data Sensitivity

+68%

Error increase with 50% data

⏱️ Optimal Epochs

100

For best convergence

πŸ† Best Under Augmented Test

ConvNeXt

Rotations, perspective shifts, color jitter

Final Model Architecture
EfficientNet-B0 backbone with custom regression head
Input255Γ—255Γ—3 β†’ EfficientNet-B01280 features β†’ Dropout(0.3) β†’ Linear(256) β†’ LayerNorm β†’ SiLU β†’ Linear(2) β†’ Scaled Sigmoid[0,1] output
Scaled Sigmoid (vs Tanh)
Maps output to [0,1] via Οƒ(x)Γ—range + min. Tanh has stronger gradients at center but can saturate at boundaries. β†’ We switched to ScaledSigmoid for stable boundary predictions.
LayerNorm (vs BatchNorm)
Normalizes each sample independently across features. BatchNorm uses batch statistics which can be unstable with small batches. β†’ Consistent behavior during training and inference with batch size 32.
Dropout 30%
Randomly zeros 30% of neuron outputs during training, preventing co-adaptation. β†’ Critical regularization for 3,646 image dataset to prevent overfitting.
SiLU Activation
SiLU(x) = x Γ— Οƒ(x). Smooth, self-gated, no "dead neuron" problem like ReLU. β†’ Better gradient flow for regression tasks.
Adam Optimizer
Adaptive Moment Estimation β€” combines momentum (moving average of gradients) with RMSprop (adaptive learning rates per parameter). Adjusts step size based on gradient history. β†’ Faster convergence than SGD, handles sparse gradients well.
AdamW (= Adam + Decoupled Weight Decay)
Standard Adam wrongly couples L2 regularization with gradient updates. AdamW decouples weight decay (Ξ»=0.001), applying it directly to weights. β†’ Proper L2 regularization without dampening learning.
Haversine Loss
Calculates great-circle distance on Earth's surface. MSE treats lat/lon as flat coordinates. β†’ Gradients directly minimize physical distance (meters).
LR Schedule: ReduceLROnPlateau
Starts at 0.0004, halves when validation loss plateaus (patience=3). β†’ Fast initial learning, fine-grained convergence later.
Final Results
5.24m mean error on test set
5.24m
Mean Error
4.01m
Median Error

P75

5.35m

75% of predictions

P95

10.22m

95% of predictions

βœ…

Accuracy Thresholds

94.5% within 10m β€’ 98.2% within 20m

CDF Comparison
Geographic Error Analysis
Where does the model succeed and fail?
Error Heatmap
Green zones: Unique building facades, open pathways β†’ <5m error
Red zones: Repetitive architecture, heavy vegetation β†’ Higher error
Prediction Arrows
πŸ’‘

Why 360Β° Data Collection Helped

Standing in the same location and taking photos in all directions while the GPS stabilizes taught the model to understand depth and spatial relationships, not just structures. The model learns that buildings look different from different distances, not just different angles.

Robustness Analysis
How do models handle viewpoint variations?
Model Regular Augmented Degradation
Final Model 5.24m 20.5m 3.91Γ—
EfficientNet 5.71m 24.34m 4.26Γ—
ConvNeXt 9.62m 14.35m 1.49Γ—
ResNet18 11.34m 24.38m 2.15Γ—
πŸ’‘

Key Insight: Generalization vs Accuracy

ConvNeXt, despite having a higher baseline error, performed best under augmented testing and with half data. This indicates the architecture is better at generalizing β€” it learns location-invariant features rather than memorizing training viewpoints. Choose based on deployment: accuracy (Final Model) vs robustness to unseen conditions (ConvNeXt).

Augmentation Comparison
Future Research Directions
Extending the Visual GPS system

πŸŒ… Temporal Robustness

Current Limitation: Single temporal snapshot (daytime, dry weather)
Challenge: Performance under different lighting (sunrise, sunset, night), weather (rain, fog), and seasons (summer vs winter vegetation)
Approach: Domain adaptation techniques β€” learn to map features from different conditions to the same location embedding
Data Needed: Same locations photographed across multiple times/conditions

πŸ“Š Uncertainty Quantification

Current Limitation: Model outputs a single point prediction with no confidence measure
Problem: Max error is still ~90m β€” how do we know when to trust the model?
Solution: Predict a distribution instead of a point: output (x, y, Οƒ) where Οƒ (sigma) represents uncertainty β€” the spread/variance of the predicted location
Benefit: High Οƒ = "I'm not sure about this prediction" β†’ we can filter unreliable predictions, limiting the worst-case errors
Method: Modify final layer to predict mean + variance, train with negative log-likelihood loss instead of MSE
πŸ”¬

SIFT Hybridization β€” Explored & Abandoned

SIFT (Scale-Invariant Feature Transform) extracts keypoints (corners, edges) with 128-dim descriptors that are rotation/scale invariant. We tried matching query images to a database using SIFT descriptors to refine CNN predictions.

Result: This added significant computational overhead (FLANN matching on 3,646 images) and made training slower, but didn't improve accuracy β€” the CNN already learns similar geometric features implicitly. Pure deep learning proved simpler and equally effective, so we dropped this approach.

Key Takeaways

360Β° data collection creates viewpoint-invariant models with depth understanding
5.24m mean error achieved with EfficientNet-B0 (with some improvements πŸ˜‰)
Architectural choices matter: LayerNorm > BatchNorm, ScaledSigmoid, AdamW, Haversine Loss
Trade-off identified: Accuracy (Final Model) vs Robustness (ConvNeXt)
Data is critical: 50% data β†’ 68% more error
Hugging Face Demo πŸ€— Open Hugging Face Demo
Thank You!
Questions?