Predictive Shipment ETA Using Machine Learning on AWS SageMaker

Client Story
A client of our client, one of the global leaders in agricultural equipment manufacturing, faced challenges in accurately predicting shipment delivery times across their complex supply chain network. Traditional methods of estimating ETAs were unreliable due to multiple variables like shipping lanes, carrier service types, geographic factors, and transit status updates. The lack of accurate ETA predictions resulted in:
- Reduced supply chain visibility and planning efficiency
- Difficulty in coordinating multi-stop international shipments (US, Europe, Asia)
- Manual processes for tracking shipments through various transit statuses
- Inability to leverage historical shipment data from 4,400+ shipments for predictive insights
- Challenges in vendor service optimization with multiple carriers
The company needed a scalable, data-driven solution to predict shipment ETAs using historical patterns, geographic data, and real-time transit information, to boost operational decision-making and customer satisfaction.
We leveraged AWS SageMaker to build machine learning models (Linear Regression and XGBoost) that predict shipment estimated time of arrival (ETA) based on historical transit data, improving supply chain visibility and operational efficiency.
The Challenge
Building accurate ETA predictions required overcoming data inconsistencies, multi-stop complexities, and scaling ML across international lanes. Key insights emerged:
Technical Lessons:
- Data Quality Critical: Geocoding accuracy significantly impacts model performance. Required manual corrections for city/country mappings and filtering shipments with missing geographic coordinates, reducing the dataset from raw to ~4,400 usable shipments.
- Feature Engineering Impact: Transforming raw transit data into meaningful features (shipment lanes, distance calculations, business day indicators) was more impactful than model complexity. The 64-feature set, including categorical encodings for lanes and service codes, proved effective.
- Multi-Stop Complexity: Initial implementation excluded multi-stop pickups/deliveries, simplifying to single PICKEDUP → DELIVRED flows. Future iterations should incorporate multi-leg shipment patterns.
- Model Selection: Both Linear Regression and XGBoost achieved similar RMSE (~21-24 hours). Simpler linear models may be preferable for interpretability in production logistics environments.
- Endpoint Management: SageMaker endpoint costs require active management - implementing automatic shutdown for unused endpoints prevented unnecessary charges during development.
Operational Lessons:
- Data Pipeline Design: Building reusable Jupyter notebooks for data cleaning, geocoding, and feature preparation enabled rapid iteration and model retraining.
- Account-Based Routing: Lambda function with account ID routing provides a flexible, multi-tenant architecture for scaling to additional customers.
- Validation Requirements: Transit status filtering (PICKEDUP, INTRANST, DELIVRED) and temporal consistency checks were essential for training data quality.
The Solution
We developed an ML-powered predictive ETA solution on AWS that leverages historical shipment data to forecast delivery times with high accuracy.
Architecture Components
- Data Layer: Amazon Redshift stores orders, transit updates, pickup/delivery, and geo-data.
- Data Processing: Python ETL in SageMaker notebooks handles extraction, geocoding, distance calculations, and 64 features (lanes, service codes, business days).
- ML Training Pipeline: SageMaker trains Linear Regression (distance/categoricals) and XGBoost (tuned hyperparameters); models/datasets saved to S3.
- Inference Layer: SageMaker endpoints for real-time predictions; Lambda (predictShipmentETA) routes by account ID.
Technologies used:
- AWS Lambda
- Amazon SageMaker
- Amazon Redshift
- AWS S3
- Python
Team:
- Senior Python Engineer
- 2 Python Engineers
- Data Scientist
- Software Architect
The Results
The ETA prediction solution deployed seamlessly to production, slashing costs by eliminating on-premises GPU needs (avoiding $10K-50K CapEx), upfront hardware investments, and maintenance. Pay-as-you-go SageMaker with auto-shutdown endpoints, serverless Lambda routing, and automated predictions cuts operational overhead from manual ETA processes, delivering enterprise ML at a fraction of traditional infrastructure costs with elastic scalability.
Performance Metrics:
- Linear Regression model achieved RMSE of 21.4 hours on the test dataset (80889 shipments)
- XGBoost model achieved an RMSE of 23.9 hours with 50 training rounds
- Models trained on 304,441 unique shipments across multiple international shipping lanes
- Successfully processes 64 engineered features, including distance, geographic data, and service codes
Operational Outcomes:
- Automated ETA prediction capability
- Real-time predictions via AWS Lambda routing to SageMaker endpoints
- Multi-region support covering US, European, and international shipping lanes (60+ lane combinations)
- Account-based model deployment, enabling scalability to additional customers without a linear cost increase
- Production-ready REST API integration with existing shipment management systems
Business Impact:
- Improved supply chain visibility through data-driven ETA forecasting
- Enhanced ability to coordinate multi-stop international shipments
- Foundation for predictive analytics across the agricultural equipment logistics network
- Scalable ML infrastructure on AWS, enabling future model improvements and additional use cases
The solution validated the feasibility of ML-based shipment prediction and established a framework for continuous model refinement as more historical data becomes available.
Let’s discuss your needs
Unreliable shipment ETAs disrupt supply chains — no more. Whether you're tackling complex international logistics, optimizing carrier performance, or scaling predictive analytics across regions, we can build a custom ML solution on AWS SageMaker for you. At Erbis, we specialize in data-driven supply chain innovations that boost visibility, cut costs, and enable real-time decisions — just like this deployment that slashed RMSE to under 24 hours while avoiding hefty infrastructure expenses.






