Notes

Short pieces on forecasting methods, production systems, and lessons from running models at scale. Some are original; others first appeared in Foresight and are reposted here with permission. For peer-reviewed work, see Research.

What COVID Did to Our Forecasting Models — and What We Built to Handle the Next Shock

March 2026. Originally published on the Airbnb Engineering Blog.

Full post

The problem

Many financial metrics at Airbnb depend on two separate events: when guests book, and when they actually travel. The distribution of that gap (the lead-time composition) determines how today's bookings translate into future revenue. In March 2020, that relationship broke. Lockdowns compressed booking horizons. Reopenings triggered short-lead surges. New variant waves dried up near-term bookings while far-out bookings persisted. The two signals (gross volume and the temporal distribution of realization) were tangled together, and models that tried to learn them jointly could not disentangle which changes came from volume swings and which reflected structural shifts in booking behavior.

The architectural fix: separate what from when

The core decision was a decomposition into two components: (1) gross metrics on the booking-date axis (how many bookings are recorded each day, a standard time series problem), and (2) lead-time composition (of those bookings, what fraction correspond to trips in each future window, a compositional time series problem). The trip-date forecast is the product of the two.

Standard time series methods do not respect the simplex constraint: proportions must be non-negative and sum to one. Naive approaches that forecast each lead-time bucket independently produce incoherent results. We developed B-DARMA (Bayesian Dirichlet Auto-Regressive Moving Average) for this class of problem. The Dirichlet distribution enforces valid compositions, the auto-regressive and moving average components capture temporal dynamics in how those compositions evolve, and the Bayesian framework produces calibrated predictive distributions with honest uncertainty. The separation also made diagnostics tractable: when forecasts went wrong, we could tell whether the problem was in gross volume or in the compositional dynamics.

The original methodology is in the International Journal of Forecasting (Katz, Brusch, and Weiss, 2024), with extensions for time-varying volatility (B-DARCH, Katz and Weiss, 2026) and structural break intervention described below.

The second surprise: distributions did not snap back

By late 2021, gross booking volumes had largely recovered. But the lead-time compositions had shifted and stayed shifted. To quantify this, we measured distributional divergence from a pre-pandemic baseline using a normalized L1 (Manhattan) distance. Rather than reducing the distribution to a mean or median, this captures the full shape change. Applying this to booking data across four major U.S. cities from 2018 to 2022, we found a clear two-phase pattern: an abrupt disruption at pandemic onset, followed by a partial recovery that plateaued well short of pre-pandemic norms. International guests diverged more sharply and recovered more slowly than domestic travelers, consistent with the compounding effect of travel restrictions and longer planning cycles for cross-border trips.

We published this analysis in Annals of Tourism Research: Empirical Insights (Katz, Savage, and Coles, 2025). The same distributional divergence metric that flagged the pandemic shift became a diagnostic for model health: if the predicted lead-time composition is consistently diverging from realized, that signals model misspecification, not just unusual inputs. It separates "the model is right but the world is weird" from "the model's assumptions no longer hold."

Building models that learn structural shifts

The obvious fix (dummy variables for the COVID period) worked in-sample and failed out-of-sample. The pandemic was not a single fixed effect. It was a sequence of distinct shocks, each with its own directional impact on the composition. Fitting a separate dummy for each disruption period means estimating a new parameter every time the world changes, with no ability to generalize or project forward. You get a model that perfectly explains the past and has nothing to say about the future.

We extended B-DARMA with a directional-shift intervention mechanism that decomposes a structural break into three components: a direction vector (which components gain share and which lose), an amplitude parameter (how large the redistribution is), and a logistic gate function (governing timing and speed). The logistic gate is the key: instead of a binary before/after indicator, it learns a smooth S-curve that ramps from zero to one over a window of time, representing both sudden shocks and gradual structural transitions, and projecting the transition forward into unobserved periods.

In a rolling one-step-ahead evaluation over peak COVID disruption (July 2020 through March 2021), the intervention model reduced Aitchison distance by 31% versus baseline and improved 80% prediction interval coverage from 54% to 80%. The calibration improvement mattered more than the point forecast improvement: during a crisis, decision-makers need honest uncertainty bounds. The full methodology is in the working paper Directional-Shift Dirichlet ARMA Models for Compositional Time Series with Structural Break Intervention.

Three lessons

First, decompose volume from timing for delay-aware forecasting. The lead-time distribution is not a nuisance parameter on the way to a trip-date forecast. It responds to levers the business can pull (pricing, cancellation policy, marketing), and modeling it separately makes those effects visible and forecastable. Second, monitor distributions, not just point statistics. The normalized L1 divergence flags distributional shifts long before they appear in averages. Third, build break detection into the model rather than treating disruptions as one-off anomalies. The persistent structural shift after gross metrics recovered was the harder and more consequential modeling problem. The intervention framework is domain-agnostic; we have since applied variations to currency share forecasting and energy mix data.

The two-part forecasting methodology is described in Katz, Savage, and Brusch (2025). The distributional lead-time analysis is in Katz, Savage, and Coles (2025). The structural break extension is in Katz (2026, arXiv).

Three cases where compositional modeling beats bottom-up aggregation

January 2026

If you forecast a total and its parts separately, the parts won't add up to the total. This is well known. Most production systems ignore it anyway. They forecast each component independently, reconcile after the fact, and move on.

That is often fine. Sometimes it is not. Here are three situations where I have found it worth building coherence into the model directly rather than patching it downstream.

1. When the total is more predictable than the parts

Revenue mix by currency is a clean example. Total revenue in USD has strong seasonal patterns and relatively stable trends. The share attributable to EUR, GBP, or AUD individually is noisier. Exchange rate movements and regional booking behavior are hard to forecast well. If you forecast each currency's revenue independently, you get reasonable point estimates that sum to something different from your total revenue forecast. Reconciliation fixes the math but introduces artifacts. The adjusted shares inherit noise from both the component and total forecasts.

A compositional model (Dirichlet, logistic-normal, or something in that family) forecasts the shares directly on the simplex, then multiplies by a separate total forecast. The shares sum to one by construction. No reconciliation needed. The gain is that downstream consumers (treasury, FP&A) get numbers that are internally consistent without a post-processing step that nobody fully understands.

2. When substitution effects dominate

Consider a platform where customers choose among product categories and the total number of transactions is roughly fixed in the short run. A spike in Category A usually means a dip in Category B. Independent models do not capture this. They can both go up simultaneously, producing a total that overshoots reality.

Compositional models handle substitution naturally because the shares are jointly modeled. If one share increases, the others must decrease. This is not a side effect. It is the point. The constraint encodes the economic reality that these categories compete for the same pool of transactions.

Bottom-up aggregation can approximate this with a reconciliation step, but the reconciliation is doing the work that the compositional model does by construction. And reconciliation methods (MinT, OLS, etc.) optimize a statistical criterion, not the structural constraint. They get you close, but "close" means your shares sum to 1.003 or 0.997, and someone in finance will ask why.

3. When you need calibrated prediction intervals for shares

This is where the gap is widest. Bottom-up forecasting with reconciliation gives you point estimates that (roughly) cohere. Getting coherent prediction intervals is harder. You need the joint distribution of all components, including their correlations, to produce intervals for the shares that respect the simplex constraint.

A Bayesian compositional model gives you this directly. The posterior predictive distribution lives on the simplex, so any credible interval you compute for a share is automatically bounded between 0 and 1, and the intervals for all shares are jointly consistent. Try getting that from independent ARIMA models with ad hoc reconciliation.

When to skip it

If your components are weakly correlated, you do not care about the shares (only the levels), and nobody downstream needs the parts to add up exactly, independent forecasting with reconciliation is simpler and probably sufficient. The setup cost of a compositional model is not zero. You need priors on the simplex, a sampler that handles the constraint, and stakeholders who understand what "Dirichlet" means or at least trust that it works.

The decision rule I use: if someone will divide your forecasts to compute a share and then make a decision based on that share, model the shares directly.

I maintain an R implementation of the B-DARMA model on GitHub. Methodological details are in Katz, Brusch, and Weiss (2024).

Two-Part Forecasting for Time-Shifted Metrics

With Erica Savage & Kai Thomas Brusch. Originally published in Foresight, Q2 2025.

arXiv preprint Code & Stan models Supplementary material

The problem

Many sectors face a version of the same forecasting challenge: the date something is recorded does not match the date it happens. A booking is made on January 5 for a trip starting February 10. A purchase order is placed on Monday for delivery on Friday. A trade is executed today for settlement in two days.

Traditional forecasting approaches operate on a single time axis. They struggle to track how a metric transitions from one origin to another. Hierarchical methods reconcile forecasts at different organizational levels but do not handle the two-axis structure. Temporal aggregation scales forecasts up or down in granularity but does not distribute one metric across another time dimension.

We introduce a two-part methodology that treats forecasting as a time-shift operator. Part 1 projects total demand on the recording axis. Part 2 translates those forecasts to the consumption axis using a compositional time series model.

Methodology

Part 1: Total bookings. A univariate time series model (we used Prophet, but any reasonable method works) forecasts total daily bookings on the booking-date axis. This gives you projected volume by day, ignoring when those bookings will actually be consumed.

Part 2: Lead-time allocation. B-DARMA models the proportions of bookings falling into each lead-time bucket (0 months out, 1 month out, ..., 12 months out) as a compositional time series. The model captures how the mix of last-minute versus long-term bookings evolves over time, with monthly seasonality via Fourier terms and a linear trend for shifting booking behavior.

Combining the parts. Multiply each month's total bookings by the corresponding lead-time proportions, shift forward by the appropriate offset, and sum across all booking months that align with a given trip month. The result is a forecast on the trip-date axis derived from the booking-date axis.

Full mathematical derivations, including the additive log-ratio transformations, are in the supplementary material. Technical details on the B-DARMA specification are in Katz, Brusch, and Weiss (2024).

Data

We used two anonymized Airbnb datasets spanning January 2014 to December 2019 (pre-COVID). City A is a large metropolitan market with strong seasonal variability. City B is a midsized leisure destination with more moderate seasonality. Each dataset contains daily booking counts, trip dates, and lead times in months. We created 13 monthly lead-time buckets (0 to 12). Training period was 2014 through 2018; test period was all of 2019.

Results

We benchmarked against a bottom-up Prophet approach: separate univariate Prophet forecasts for each lead-time bucket, summed to get totals.

City	Method	Booking Date MAE	Booking Date MAPE	Lead-Time Mean L1
A	Two-Part	5,083	4.8%	0.0229
A	Bottom-Up	5,336	5.07%	0.0389
B	Two-Part	1,406	3.07%	0.0300
B	Bottom-Up	1,455	3.15%	0.0499

The two-part approach outperformed bottom-up Prophet on both axes in both markets. The compositional framework captures cross-bucket correlations that independent univariate forecasts miss. The improvement is most visible in the lead-time distributions: normalized L1 distance drops by roughly 40% in both cities.

Why this matters in practice

The modularity is the real advantage. Adjusting total forecasts in response to a macro shock or event does not require refitting the lead-time model. Scenario analysis becomes fast. And for short-horizon forecasting, where some future bookings are already known, existing reservations serve as a baseline while the two-part model projects additional bookings that might still materialize.

B-DARMA can also incorporate exogenous covariates on the trip-date side. A Super Bowl indicator or Easter flag can shift proportions in the relevant lead-time bucket, linking trip-date features to booking-date allocations under one framework.

Limitations

Splitting the process into two parts may miss interactions between total demand and lead-time behavior that a unified model could capture. B-DARMA assumes strictly positive proportions, so sparse or zero-valued lead-time buckets require care. And if lead times are static or not important to your decision, the added complexity is not justified.

References

Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall.
Armstrong, J.S. (2001). Combining Forecasts. In: Principles of Forecasting. Springer.
Hyndman, R.J., Ahmed, R.A., Athanasopoulos, G., & Shang, H.L. (2011). Optimal combination forecasts for hierarchical time series. Computational Statistics & Data Analysis, 55(9), 2579-2589.
Katz, H., Brusch, K.T., & Weiss, R.E. (2024). A Bayesian Dirichlet auto-regressive moving average model for forecasting lead times. International Journal of Forecasting, 40(4), 1556-1567.
Silvestrini, A. & Veredas, D. (2008). Temporal aggregation of univariate and multivariate time series models: A survey. Journal of Economic Surveys, 22(3), 458-497.
Taylor, S.J. & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37-45.
Zheng, T. & Chen, R. (2017). Dirichlet ARMA models for compositional time series. Journal of Multivariate Analysis, 158, 31-46.

The primary dataset is proprietary. Stan code for the B-DARMA model and supplementary material are available at github.com/harrisonekatz/consistent_forecasting_bdarma_paper.