White Paper

1. Introduction

Weather forecasting has long relied on numerical weather prediction (NWP) — physics-based models that simulate atmospheric behaviour using mathematical equations. While NWP models have improved significantly over the decades, they carry inherent limitations that machine learning post-processing can partially address:

Resolution gaps: Global models such as ECMWF IFS and GFS operate at grid spacings of 9–25 km, which can miss localised mesoscale weather events such as sea-breeze convergence, orographic rainfall, and urban heat islands.
Initialisation errors: Errors in the initial atmospheric state propagate and amplify through the forecast period, particularly beyond Day 5 (Lorenz, 1963).
Parameterisation assumptions: Sub-grid-scale processes — including convection, boundary-layer turbulence, and microphysics — are parameterised rather than explicitly resolved, introducing systematic biases that vary by location and season.
Ensemble calibration: While ensemble prediction systems quantify uncertainty, raw ensemble spreads are often underdispersive (Hamill & Colucci, 1997), producing overconfident probability estimates.

WeatherAI.io addresses these limitations by applying statistical and machine learning post-processing to established NWP outputs, correcting systematic biases and improving localised predictions. This approach — sometimes called Model Output Statistics (MOS) in its classical form — has a long history in operational meteorology (Glahn & Lowry, 1972). WeatherAI.io extends classical MOS with modern non-linear methods.

It is important to note that WeatherAI.io is a post-processing system. It does not replace NWP models; it enhances their output. The quality of WeatherAI.io's forecasts is therefore bounded by the quality and availability of the upstream NWP data.

2. Data Sources and Pipeline

WeatherAI.io ingests data from multiple upstream providers and observational networks:

Data Source	Type	Purpose
Global NWP model outputs (via upstream API providers)	Gridded forecast data	Baseline predictions for temperature, wind, pressure, precipitation
Synoptic and mesonet surface observations	Observational (hourly)	Ground truth for model calibration and bias correction
Archived forecast-observation pairs	Historical (3+ years)	Training data for supervised ML models
Satellite-derived products	Remote sensing	Supplementary cloud cover and precipitation estimates
Digital elevation models and land-use data	Static geospatial	Terrain-aware correction factors (elevation, slope, aspect, urban fraction)

Data is quality-controlled, gap-filled where necessary using neighbouring station interpolation, and aligned to a unified spatial-temporal grid. Outlier observations are flagged and excluded using standard range checks and spatial consistency tests (WMO, 2018).

3. Methodology

The post-processing pipeline applies four complementary techniques. Each addresses a different source of NWP error:

3.1 Non-Linear Bias Correction

Classical MOS uses linear regression to map NWP outputs to observed values (Glahn & Lowry, 1972). WeatherAI.io extends this with gradient-boosted decision trees (XGBoost; Chen & Guestrin, 2016) trained on historical forecast-versus-observation pairs. Predictor variables include the NWP forecast value, forecast lead time, time of day, season, station elevation, and recent model error trends. This captures non-linear, conditional biases — for example, NWP models may systematically overestimate overnight minimum temperatures in valley locations during clear-sky winter conditions, a pattern that linear MOS often underfits.

3.2 Temporal Error Learning

Forecast errors are not independent across lead times; errors at Day 1 carry information about likely errors at Day 3. We apply sequence models (LSTM networks; Hochreiter & Schmidhuber, 1997) to recent forecast error sequences to predict and correct errors at subsequent lead times. This is conceptually similar to the "error correction" approach described by Krasnopolsky & Lin (2012), extended with modern sequence-modelling architectures.

3.3 Statistical Downscaling

Coarse-resolution NWP grid cells (9–25 km) are refined to point-level station predictions using regression models that incorporate local terrain features (elevation, slope, aspect, distance to coast, urban fraction) as predictors. This approach follows established statistical downscaling methodology (Maraun et al., 2010) and is particularly effective for temperature and wind speed in complex terrain where sub-grid variability is high.

3.4 Ensemble Calibration

Raw NWP ensemble spreads tend to be underdispersive — they underestimate forecast uncertainty (Hamill & Colucci, 1997). WeatherAI.io applies Ensemble Model Output Statistics (EMOS; Gneiting et al., 2005) to recalibrate ensemble outputs into well-calibrated predictive distributions. The result is more reliable probability estimates for API consumers who need confidence intervals rather than point forecasts.

4. Accuracy Measurement Framework

4.1 Metrics

Forecast accuracy is evaluated using standard meteorological verification metrics as defined by the World Meteorological Organisation (WMO) and the Joint Working Group on Forecast Verification Research (JWGFVR):

Metric	Definition	Applied to
MAE	Mean Absolute Error — average \|forecast − observed\|	Temperature (2m), wind speed (10m)
RMSE	Root Mean Square Error — penalises large errors more heavily than MAE	Temperature, precipitation amount
CRPS	Continuous Ranked Probability Score — evaluates the full predictive distribution, not just the mean (Gneiting & Raftery, 2007)	Probabilistic forecasts (precipitation, temperature)

Improvement is expressed as percentage reduction in the error metric: ((Baseline − WeatherAI) / Baseline) × 100. For MAE and RMSE, lower values are better, so a reduction represents improvement. For CRPS, the same interpretation applies.

4.2 Baseline Definition

The baseline in all comparisons is the raw NWP forecast from the same upstream provider, interpolated to station locations using bilinear interpolation, with no post-processing applied. This ensures a fair like-for-like comparison — the only variable is whether the AI post-processing layer is applied.

4.3 Validation Protocol

To guard against overfitting and ensure results generalise, we apply the following validation protocol:

Strict temporal separation: Models are trained on forecast-observation pairs from January 2022 to December 2024 and validated on an independent 12-month holdout period (January–December 2025). No validation data is used during training or hyperparameter tuning.
Geographic scope: Validation is conducted across 547 surface weather stations spanning six climate zones (tropical, subtropical, temperate maritime, temperate continental, arid, and subarctic) across Europe, North America, and parts of Asia-Pacific.
Seasonal stratification: Results are computed separately for each meteorological season (DJF, MAM, JJA, SON) to ensure improvements are not confined to seasons with lower baseline skill.
Significance testing: Reported improvements are statistically significant at p < 0.01 using a paired bootstrap test (Hamill, 1999) with 10,000 resamples, unless otherwise noted.

5. Results

The following tables summarise performance on the 12-month holdout validation set (January–December 2025), aggregated across all 547 validation stations. Values represent the mean across stations; 95% confidence intervals (CI) from bootstrap resampling are provided.

5.1 2-Metre Temperature (MAE, °C)

Lead Time	Raw NWP MAE	WeatherAI MAE	MAE Reduction
Day 1 (24 h)	1.82	1.21	33.5% (CI: 30.8–36.1%)
Day 3 (72 h)	2.47	1.58	36.0% (CI: 33.2–38.9%)
Day 5 (120 h)	3.21	1.93	39.9% (CI: 36.7–42.8%)
Day 7 (168 h)	3.88	2.53	34.8% (CI: 31.5–37.9%)
Day 10 (240 h)	4.76	3.42	28.2% (CI: 24.9–31.3%)

5.2 Precipitation Amount (CRPS, mm)

CRPS evaluates the full predictive distribution. Lower values indicate better probabilistic calibration. Improvement is expressed as percentage reduction in CRPS.

Lead Time	Raw NWP CRPS	WeatherAI CRPS	CRPS Reduction
Day 1 (24 h)	1.38	0.95	31.2% (CI: 27.4–34.8%)
Day 3 (72 h)	2.14	1.30	39.3% (CI: 35.1–43.2%)
Day 5 (120 h)	2.87	1.73	39.7% (CI: 35.6–43.5%)

5.3 10-Metre Wind Speed (RMSE, m/s)

Lead Time	Raw NWP RMSE	WeatherAI RMSE	RMSE Reduction
Day 1 (24 h)	2.12	1.42	33.0% (CI: 29.8–36.1%)
Day 3 (72 h)	2.78	1.79	35.6% (CI: 32.1–39.0%)
Day 5 (120 h)	3.51	2.14	39.0% (CI: 35.4–42.5%)

5.4 Seasonal Variation

Improvements are not uniform across seasons. For 2-metre temperature at Day 5, MAE reduction ranges from 34.2% in summer (JJA) to 44.1% in winter (DJF). Winter shows the largest gains because NWP models exhibit stronger systematic biases during stable, cold-air conditions (temperature inversions, frost events) that are well-suited to statistical correction. Conversely, convective summer conditions introduce more stochastic variability that is harder to correct via post-processing.

5.5 Summary of the "Up to 40%" Claim

Across all measured variables and lead times, the error reduction ranges from 28.2% (temperature at Day 10) to 39.9% (temperature at Day 5), with a median reduction of approximately 35.6% across the full validation set. The "up to 40%" marketing claim is specifically supported by:

Day 5 temperature MAE reduction: 39.9% (upper CI bound: 42.8%)
Day 3 precipitation CRPS reduction: 39.3% (upper CI bound: 43.2%)
Day 5 precipitation CRPS reduction: 39.7% (upper CI bound: 43.5%)
Day 5 wind speed RMSE reduction: 39.0% (upper CI bound: 42.5%)
Winter-season temperature MAE reduction at Day 5: 44.1%

The "up to 40%" figure is therefore a reasonable representation of the upper range of observed improvements for medium-range forecasts (Day 3–5) where NWP systematic biases are most correctable. It is not representative of all variables, all lead times, or all geographic regions.

6. Limitations and Transparency

Internal benchmarks: The results presented here are based on internal validation conducted by Zoomash Ltd. They have not been independently verified or peer-reviewed. We intend to submit findings for independent verification and welcome enquiries from researchers.
"Up to" qualifier: The 40% figure represents peak improvement in specific forecast variables, lead times, and seasons. Day 1 forecasts show ~33% improvement (baseline accuracy is already higher), and Day 10+ forecasts show ~28% improvement (fundamental predictability limits constrain what post-processing can achieve).
Geographic bias: Validation stations are concentrated in Europe and North America, where observational density is highest. Performance in data-sparse regions (sub-Saharan Africa, open ocean, polar regions) is expected to be lower due to fewer training observations and should not be assumed to match the figures reported here.
Extreme events: Rare and extreme weather events (e.g., record-breaking heatwaves, Category 4+ hurricanes) have limited representation in training data. While post-processing reduces average errors, skill improvement for tail-risk events is more modest and may not match the headline figures. This is a well-known limitation of supervised learning approaches in meteorology (Herman & Schumacher, 2018).
Upstream dependency: WeatherAI.io enhances upstream NWP outputs; it does not replace them. If the upstream NWP provider experiences model degradation, data latency, or outages, downstream forecast quality is directly affected. The improvement percentages reported here assume normal upstream operational conditions.
Stationarity assumption: ML models are trained on historical data and assume that the statistical relationship between NWP errors and observational truth remains approximately stationary. Major changes to upstream NWP model versions may require model retraining. We monitor upstream model changes and retrain post-processing models as needed.

7. Context in Published Research

The magnitude of improvement reported here is consistent with published peer-reviewed research on ML-enhanced weather post-processing:

Rasp & Lerch (2018): "Neural Networks for Postprocessing Ensemble Weather Forecasts" (Monthly Weather Review, 146(11), 3885–3900). Demonstrated 10–20% CRPS improvement over raw ensembles using neural network-based EMOS for temperature and wind speed.
Lam et al. (2023): "Learning skillful medium-range global weather forecasting" (Science, 382(6677), 1416–1421). Google DeepMind's GraphCast demonstrated ML models matching or exceeding ECMWF HRES accuracy at 10-day lead times across multiple variables, establishing ML as competitive with operational NWP.
Schultz et al. (2021): "Can Deep Learning Beat Numerical Weather Prediction?" (Philosophical Transactions of the Royal Society A, 379(2194), 20200097). Reviewed ML applications in meteorology, reporting 15–30% improvement in precipitation nowcasting and short-range forecasting.
Gneiting et al. (2005): "Calibrated Probabilistic Forecasting Using Ensemble Model Output Statistics and Minimum CRPS Estimation" (Monthly Weather Review, 133(5), 1098–1118). Established the EMOS framework used as the basis for our ensemble calibration approach.
Chen & Guestrin (2016): "XGBoost: A Scalable Tree Boosting System" (Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794). Describes the gradient boosting framework used in our bias correction module.

WeatherAI.io's results — 28–40% error reduction depending on variable and lead time — sit at the upper end of published post-processing improvements. This is consistent with what would be expected from combining multiple complementary techniques (bias correction, temporal error learning, downscaling, and ensemble calibration) into a unified pipeline, as each technique targets a different error source.

8. Conclusion

By applying a multi-technique machine learning post-processing pipeline to upstream NWP data, WeatherAI.io achieves forecast error reductions of 28–40% across temperature, precipitation, and wind speed for lead times of 1–10 days, as measured on a 12-month holdout validation set across 547 stations.

The "up to 40% more accurate" claim is specifically supported by Day 3–5 forecast improvements in temperature (MAE), precipitation (CRPS), and wind speed (RMSE), where upstream NWP models exhibit the most correctable systematic biases. We are transparent that this figure represents the upper bound, not the average across all conditions.

These results are based on internal validation and we welcome independent scrutiny. Researchers interested in verification, collaboration, or accessing validation data for academic purposes are encouraged to contact us.

References

Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proc. 22nd ACM SIGKDD, 785–794. doi:10.1145/2939672.2939785
Glahn, H.R. & Lowry, D.A. (1972). The Use of Model Output Statistics (MOS) in Objective Weather Forecasting. Journal of Applied Meteorology, 11(8), 1203–1211.
Gneiting, T., Raftery, A.E., Westveld, A.H. & Goldman, T. (2005). Calibrated Probabilistic Forecasting Using Ensemble Model Output Statistics and Minimum CRPS Estimation. Monthly Weather Review, 133(5), 1098–1118.
Gneiting, T. & Raftery, A.E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359–378.
Hamill, T.M. (1999). Hypothesis Tests for Evaluating Numerical Precipitation Forecasts. Weather and Forecasting, 14(2), 155–167.
Hamill, T.M. & Colucci, S.J. (1997). Verification of Eta–RSM Short-Range Ensemble Forecasts. Monthly Weather Review, 125(6), 1312–1327.
Herman, G.R. & Schumacher, R.S. (2018). Money Doesn't Grow on Trees, but Forecasts Do. Bulletin of the American Meteorological Society, 99(7), 1405–1414.
Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
Krasnopolsky, V.M. & Lin, Y. (2012). A Neural Network Nonlinear Multimodel Ensemble to Improve Precipitation Forecasts over Continental US. Advances in Meteorology, 2012, 649450.
Lam, R. et al. (2023). Learning skillful medium-range global weather forecasting. Science, 382(6677), 1416–1421. doi:10.1126/science.adi2336
Lorenz, E.N. (1963). Deterministic Nonperiodic Flow. Journal of the Atmospheric Sciences, 20(2), 130–141.
Maraun, D. et al. (2010). Precipitation downscaling under climate change. Reviews of Geophysics, 48(3), RG3003.
Rasp, S. & Lerch, S. (2018). Neural Networks for Postprocessing Ensemble Weather Forecasts. Monthly Weather Review, 146(11), 3885–3900. doi:10.1175/MWR-D-18-0187.1
Schultz, M.G. et al. (2021). Can Deep Learning Beat Numerical Weather Prediction? Philosophical Transactions of the Royal Society A, 379(2194), 20200097.
WMO (2018). Guide to Meteorological Instruments and Methods of Observation (WMO-No. 8). World Meteorological Organization.

About Zoomash Ltd

Zoomash Ltd is a UK-based technology company registered in England (Company No. 7838145) at 3rd Floor, 86-90 Paul Street, London, Hackney, EC2A 4NE.

WeatherAI.io is its flagship weather intelligence platform.

For technical enquiries: Contact Us

Achieving Up to 40% Improvement in Forecast Accuracy Through AI-Enhanced Weather Prediction