Campus Image-to-GPS Regression
for Localization and Navigation

Deep Learning for Visual Geo-Localization

📍 Campus Navigation Without Satellites

Liran Attar & Tom Mimran

Ben-Gurion University of the Negev
Department of Computer Science • January 2026

The Challenge

Can a neural network learn to localize from images alone?

GPS-Denied Environments: Tall buildings, indoor spaces, signal jamming

Phone GPS Limitations: ~5m accuracy on average, signal issues

Our Solution: Visual GPS using deep learning

255×255

Input Image

→

CNN

(lat, lon)

GPS Output

🎯

Target Area

BGU Campus (~0.0004 km²) • High-density regression task

360° Data Collection Strategy

Training the model to be viewpoint-invariant

💡

Why 360° Rotation?

Forces the model to learn location-based features (building facades, pathways) rather than memorizing specific camera angles. Same GPS → 4 different views.

North (0°)

East (90°)

South (180°)

West (270°)

GPS Coordinate Normalization (Min-Max Scaling)

y_norm = (y - y_min) / (y_max - y_min)

Why this works: GPS coordinates (e.g., lat=31.262, lon=34.801) are very close numerically. Without normalization, tiny changes in the loss (~0.00001) would produce negligible gradients, making learning nearly impossible.

Benefit: Mapping to [0,1] ensures gradients are meaningful, loss values are interpretable, and the model can focus on relative positions rather than absolute coordinate magnitudes.

Source: Standard feature scaling from GPSDataManager — computed on training set only to prevent data leakage.

📸 Dataset Size

3,646

Images collected across campus

🎯 Output Range

[0,1]

Min-max normalized coordinates

📊 Data Split

70/15/15

Train / Validation / Test

Experimental Design

Controlled experiments to isolate key factors

Architecture	Params	Best Error (Best Config)
EfficientNet-B0	5.3M	5.7m (100ep, LR=0.001)
ConvNeXt-Tiny	28.6M	9.6m (100ep, LR=0.0001)
ResNet18	11.7M	11.3m (100ep, LR=0.0001)

🔬

Variables Tested

Epochs (30/50/100), Learning Rate (0.001/0.0001), Data Size (50%/100%), Test-Time Augmentation

Mean Error Matrix: Models × Experiments

📈 Data Sensitivity

+68%

Error increase with 50% data

⏱️ Optimal Epochs

100

For best convergence

🏆 Best Under Augmented Test

ConvNeXt

Rotations, perspective shifts, color jitter

Final Model Architecture

EfficientNet-B0 backbone with custom regression head

Input255×255×3 → EfficientNet-B01280 features → Dropout(0.3) → Linear(256) → LayerNorm → SiLU → Linear(2) → Scaled Sigmoid[0,1] output

Scaled Sigmoid (vs Tanh)
Maps output to [0,1] via σ(x)×range + min. Tanh has stronger gradients at center but can saturate at boundaries. → We switched to ScaledSigmoid for stable boundary predictions.

LayerNorm (vs BatchNorm)
Normalizes each sample independently across features. BatchNorm uses batch statistics which can be unstable with small batches. → Consistent behavior during training and inference with batch size 32.

Dropout 30%
Randomly zeros 30% of neuron outputs during training, preventing co-adaptation. → Critical regularization for 3,646 image dataset to prevent overfitting.

SiLU Activation
SiLU(x) = x × σ(x). Smooth, self-gated, no "dead neuron" problem like ReLU. → Better gradient flow for regression tasks.

Adam Optimizer
Adaptive Moment Estimation — combines momentum (moving average of gradients) with RMSprop (adaptive learning rates per parameter). Adjusts step size based on gradient history. → Faster convergence than SGD, handles sparse gradients well.

AdamW (= Adam + Decoupled Weight Decay)
Standard Adam wrongly couples L2 regularization with gradient updates. AdamW decouples weight decay (λ=0.001), applying it directly to weights. → Proper L2 regularization without dampening learning.

Haversine Loss
Calculates great-circle distance on Earth's surface. MSE treats lat/lon as flat coordinates. → Gradients directly minimize physical distance (meters).

LR Schedule: ReduceLROnPlateau
Starts at 0.0004, halves when validation loss plateaus (patience=3). → Fast initial learning, fine-grained convergence later.

Final Results

5.24m mean error on test set

5.24m

Mean Error

4.01m

Median Error

P75

5.35m

75% of predictions

P95

10.22m

95% of predictions

✅

Accuracy Thresholds

94.5% within 10m • 98.2% within 20m

Geographic Error Analysis

Where does the model succeed and fail?

Green zones: Unique building facades, open pathways → <5m error

Red zones: Repetitive architecture, heavy vegetation → Higher error

💡

Why 360° Data Collection Helped

Standing in the same location and taking photos in all directions while the GPS stabilizes taught the model to understand depth and spatial relationships, not just structures. The model learns that buildings look different from different distances, not just different angles.

Robustness Analysis

How do models handle viewpoint variations?

Model	Regular	Augmented	Degradation
Final Model	5.24m	20.5m	3.91×
EfficientNet	5.71m	24.34m	4.26×
ConvNeXt	9.62m	14.35m	1.49×
ResNet18	11.34m	24.38m	2.15×

💡

Key Insight: Generalization vs Accuracy

ConvNeXt, despite having a higher baseline error, performed best under augmented testing and with half data. This indicates the architecture is better at generalizing — it learns location-invariant features rather than memorizing training viewpoints. Choose based on deployment: accuracy (Final Model) vs robustness to unseen conditions (ConvNeXt).

Future Research Directions

Extending the Visual GPS system

🌅 Temporal Robustness

Current Limitation: Single temporal snapshot (daytime, dry weather)

Challenge: Performance under different lighting (sunrise, sunset, night), weather (rain, fog), and seasons (summer vs winter vegetation)

Approach: Domain adaptation techniques — learn to map features from different conditions to the same location embedding

Data Needed: Same locations photographed across multiple times/conditions

📊 Uncertainty Quantification

Current Limitation: Model outputs a single point prediction with no confidence measure

Problem: Max error is still ~90m — how do we know when to trust the model?

Solution: Predict a distribution instead of a point: output (x, y, σ) where σ (sigma) represents uncertainty — the spread/variance of the predicted location

Benefit: High σ = "I'm not sure about this prediction" → we can filter unreliable predictions, limiting the worst-case errors

Method: Modify final layer to predict mean + variance, train with negative log-likelihood loss instead of MSE

🔬

SIFT Hybridization — Explored & Abandoned

SIFT (Scale-Invariant Feature Transform) extracts keypoints (corners, edges) with 128-dim descriptors that are rotation/scale invariant. We tried matching query images to a database using SIFT descriptors to refine CNN predictions.

Result: This added significant computational overhead (FLANN matching on 3,646 images) and made training slower, but didn't improve accuracy — the CNN already learns similar geometric features implicitly. Pure deep learning proved simpler and equally effective, so we dropped this approach.

Key Takeaways

360° data collection creates viewpoint-invariant models with depth understanding

5.24m mean error achieved with EfficientNet-B0 (with some improvements 😉)

Architectural choices matter: LayerNorm > BatchNorm, ScaledSigmoid, AdamW, Haversine Loss

Trade-off identified: Accuracy (Final Model) vs Robustness (ConvNeXt)

Data is critical: 50% data → 68% more error

🤗 Open Hugging Face Demo

Thank You!

Questions?

Campus Image-to-GPS Regressionfor Localization and Navigation