Building a Synthetic Data Pipeline with QCBM and TimeGAN

 · finance  · quantum  · genai

Between 2023 and early 2026, the team built finance_lab — a synthetic data pipeline that generates realistic financial time series conditioned on market regimes. The goal was never synthetic data for its own sake. We needed it for three concrete problems: backtesting strategies against rare market events that appear maybe twice in a decade of historical data, running regulatory stress tests that require scenarios outside the historical distribution, and training downstream models without exposing sensitive portfolio data.

Why Synthetic Data Matters

Historical market data is finite. If you want to stress-test a portfolio against a 2008-style crash, you have exactly one 2008. Statistical resampling methods like bootstrapping preserve marginal distributions but destroy the temporal dependencies that make financial data interesting — autocorrelation, volatility clustering, regime persistence. We needed a generative approach that could produce new trajectories faithful to the joint distribution of returns, volumes, and volatility across time.

Market Regime Labeling

The pipeline starts with regime detection. We fit a Hidden Markov Model to rolling windows of realized volatility, return skewness, and cross-asset correlation. This segments historical data into regimes — roughly corresponding to low-volatility trending markets, high-volatility drawdowns, and transitional periods. The regime labels become conditioning variables for the generative models downstream.

Regime labeling is not just preprocessing. It is the key design decision that separates this pipeline from naive GAN-based data generation. Without regime conditioning, a generative model will average across market states and produce data that looks plausible in aggregate but fails to capture the fat tails and correlation spikes that matter for risk management.

QCBM: Quantum Circuit Born Machines

Quantum Circuit Born Machines use parameterized quantum circuits as implicit generative models. The circuit's output distribution — the Born-rule probability of each measurement outcome — becomes the model's learned distribution over discretized data. We use QCBMs to model the joint distribution of regime-conditioned features at each timestep. The quantum circuit's natural ability to represent complex multimodal distributions makes it well-suited for capturing the heavy tails and asymmetric dependencies in financial returns.

Training a QCBM involves optimizing circuit parameters to minimize the MMD (Maximum Mean Discrepancy) between the circuit's output distribution and the empirical data distribution. We run this optimization on the Omega Functions runtime, using adjoint differentiation for gradient computation.

Conditional TimeGAN

QCBMs handle the cross-sectional distribution at each timestep, but temporal dynamics require a sequential model. We use a Conditional TimeGAN — a time-series GAN architecture with an additional regime-conditioning input — to generate full synthetic trajectories. The TimeGAN's embedding network learns a latent representation of the temporal dynamics, and its generator produces new sequences that preserve autocorrelation structure, volatility clustering, and regime transition probabilities.

The QCBM and TimeGAN components are not alternatives; they work in tandem. The QCBM provides a rich prior over per-timestep feature distributions, and the TimeGAN learns to sequence those distributions into coherent trajectories.

Results and Use

The synthetic data passes standard statistical validation — Kolmogorov-Smirnov tests on marginals, autocorrelation function comparison, and stylized-fact checks (volatility clustering, leverage effect, fat tails). More importantly, portfolios stress-tested against synthetic crash scenarios show risk estimates consistent with what we observe in held-out real crisis periods.

finance_lab is closed source and runs as an internal service. The regime labeling and some of the time-series utilities have been factored into our time_series_base Rust workspace, which we discuss in a separate post.

일부 이전 게시물은 영어로만 제공됩니다.

← 블로그로 돌아가기