SynthDrive

Telematics data: synthetically generated, actuarial-grade

SynthDrive generates policy-level synthetic telematics portfolios—driver variables, vehicle variables, usage and driving-behavior signals, claim counts, and claim amounts—without requiring access to proprietary insurer data.

Usage-Based Insurance (UBI)

The diagram illustrates how a usage-based insurance system works from data collection through to pricing and claims outcomes.

A telematics device captures how, when, and how far a vehicle is driven — along with hard acceleration, harsh braking, and cornering events. That behavioral data replaces demographic proxies in the risk assessment, producing a premium calibrated to actual driving rather than assumed risk, and supporting downstream functions like fault determination and vehicle recovery.

How SynthDrive Works

You specify a portfolio size and a random seed; SynthDrive returns a reproducible dataset with the actuarial structure needed for pricing, fraud, and UBI model development.

SynthDrive has a three-stage generation pipeline. From left to right, a Gaussian copula samples correlated driver, vehicle, and telematics features jointly; a zero-inflated negative binomial model assigns claim counts with an exposure offset, reflected in the characteristic zero-spike frequency distribution shown at center; and a Gamma model draws claim amounts for policies with at least one claim, shown as a right-skewed severity density at right. The three stages feed into a single output portfolio of 100,000 policy-level rows across 52 columns.

The methodology behind SynthDrive is described in (Homayounfar, 2026).

What Makes It Unique

Constraint-aware by design. Compositional driving variables are enforced to sum correctly, exposure bounds are hard constraints, and claim amounts are zero whenever claim counts are zero. The dataset cannot be silently invalid.
Frequency-severity decomposition built in. Claim counts follow a zero-inflated negative binomial model with an exposure offset. Severity is drawn from a Gamma model, risk-adjusted per policy. The pipeline matches the standard actuarial GLM framework, not a black-box regressor.
Validated against a public benchmark. GLM coefficients, marginal distributions, and frequency relativities are compared against the So–Boucher–Valdez (2021) public synthetic telematics dataset. The validation report documents what the generator reproduces, what is approximate, and what is not tested.

Use Cases

Build and benchmark UBI pricing models without requesting proprietary insurer data.
Test frequency and severity GLMs, GBMs, or neural claim models on a controlled synthetic portfolio before applying them to real data.
Evaluate synthetic-data algorithms against a structured actuarial baseline with known ground truth.

How It Works

SynthDrive generates each synthetic policy in three steps: a Gaussian copula produces correlated driver, vehicle, and telematics variables; a zero-inflated negative binomial model assigns claim counts scaled to exposure; and a Gamma model draws claim amounts for policies that claim. No neural networks, no GPU.

Parameters are calibrated from public synthetic telematics datasets. (So et al., 2021; Duval et al., 2022)

Open-Source Access

The SynthDrive package is available on GitHub under an open-source license. Source code, documentation, and the formal algebraic specification are included in the repository.

GitHub →

Further Details

For research collaborations or licensing inquiries, contact us.

References

2026

SSRN

SynthDrive: An Actuarially Structured Synthetic Telematics Generator for Motor Insurance Research

Kambiz Homayounfar

Jun 2026

Abs HTML

Access to real telematics data for academic and independent insurance research is severely restricted by proprietary and contractual barriers, and the only widely cited public benchmark—the So–Boucher–Valdez (2021) dataset—was generated by an opaque SMOTE-plus-neural-network pipeline that is hard to calibrate or extend. SynthDrive v0.1 is an open-source Python package that addresses this gap with a transparent three-stage pipeline: a Gaussian copula for correlated feature generation, a zero-inflated negative binomial model with an exposure offset for claim frequency, and a Gamma model for aggregate severity. All stages enforce actuarial and physical constraints, including compositional day-of-week proportions, monotone hard-event threshold sequences, and claim-amount consistency. Validated against the So–Boucher–Valdez (SBV) seed dataset at 100,000 policies, SynthDrive reproduces marginal distributions and generalized linear model (GLM) coefficient directions within acceptable tolerance. The package requires no GPU hardware, ships with a built-in validation report and a formal algebraic specification, and is provided with full data provenance disclosure: SynthDrive is calibrated from a public synthetic seed, not from real insurer data.

2022

T&F

How much telematics information do insurers need for claim classification?

Francis Duval, Jean-Philippe Boucher, and Mathieu Pigeon

North American Actuarial Journal, Jun 2022

Abs

It has been shown several times in the literature that telematics data collected in motor insurance help to better understand an insured’s driving risk. Insurers who use these data reap several benefits, such as a better estimate of the pure premium, more segmented pricing, and less adverse selection. The flip side of the coin is that collected telematics information is often sensitive and can therefore compromise policyholders’ privacy. Moreover, due to their large volume, this type of data is costly to store and hard to manipulate. These factors, combined with the fact that insurance regulators tend to issue more and more recommendations regarding the collection and use of telematics data, make it important for an insurer to determine the right amount of telematics information to collect. In addition to traditional contract information such as the age and gender of the insured, we have access to a telematics dataset where information is summarized by trip. We first derive several features of interest from these trip summaries before building a claim classification model using both traditional and telematics features. By comparing a few classification algorithms, we find that logistic regression with lasso penalty is the most suitable for our problem. Using this model, we develop a method to determine how much information about policyholders’ driving should be kept by an insurer. Using real data from a North American insurance company, we find that telematics data become redundant after about 3 months or 4000 km of observation, at least from a claim classification perspective.

2021

MDPI

Synthetic dataset generation of driver telematics

Banghee So, Jean-Philippe Boucher, and Emiliano A Valdez

Risks, Jun 2021

Abs

This article describes the techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations regarding driver’s claims experience, together with associated classical risk variables and telematics-related variables. This work is aimed to produce a resource that can be used to advance models to assess risks for usage-based insurance. It follows a three-stage process while using machine learning algorithms. In the first stage, a synthetic portfolio of the space of feature variables is generated applying an extended SMOTE algorithm. The second stage is simulating values for the number of claims as multiple binary classifications applying feedforward neural networks. The third stage is simulating values for aggregated amount of claims as regression using feedforward neural networks, with number of claims included in the set of feature variables. The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data. Other visualization and data summarization produce remarkable similar statistics between the two datasets. We hope that researchers interested in obtaining telematics datasets to calibrate models or learning algorithms will find our work ot be valuable.