Drift Monitoring

Data drift occurs when the statistical properties of incoming (production) data diverge from the data the model was trained on. Skyulf provides a DriftCalculator to detect this automatically.

When to use drift detection

Before running predictions on new data batches.
As a scheduled health check in production pipelines.
After a data source change (new CSV, new API feed, schema migration).

Quick example

import polars as pl
from skyulf.profiling import DriftCalculator

# Reference = your training data
reference = pl.DataFrame({
    "age": [25, 30, 35, 40, 45, 50],
    "income": [30000, 45000, 55000, 65000, 70000, 80000],
})

# Current = new production data
current = pl.DataFrame({
    "age": [60, 65, 70, 75, 80, 85],
    "income": [90000, 95000, 100000, 110000, 120000, 130000],
})

calc = DriftCalculator(reference, current)
report = calc.calculate_drift()

print(f"Drifted columns: {report.drifted_columns_count}")
for col, drift in report.column_drifts.items():
    print(f"  {col}: drift={drift.drift_detected}")
    for m in drift.metrics:
        print(f"    {m.metric}: {m.value:.4f} (threshold={m.threshold}, drifted={m.has_drift})")

Drift metrics

The DriftCalculator computes these metrics for each numeric column:

Metric	What it measures	Default threshold
Wasserstein distance	How much "work" to transform one distribution into the other	0.1 (normalized)
KS test (Kolmogorov-Smirnov)	Maximum distance between CDFs; returns a p-value	p < 0.05
PSI (Population Stability Index)	Binned distribution shift	0.2
KL divergence	Information-theoretic divergence	0.1

A column is flagged as "drifted" if any metric exceeds its threshold.

Custom thresholds

report = calc.calculate_drift(thresholds={
    "psi": 0.15,
    "ks": 0.01,
    "wasserstein": 0.05,
    "kl_divergence": 0.2,
})

Schema drift

The report also detects structural changes:

report.missing_columns — columns present in reference but absent in current data.
report.new_columns — columns in current data that were not in the reference.

Report structure

DriftReport(
    reference_rows=1000,
    current_rows=500,
    drifted_columns_count=2,
    column_drifts={
        "age": ColumnDrift(
            column="age",
            metrics=[...],
            drift_detected=True,
            suggestions=["Consider retraining..."],
        ),
    },
    missing_columns=[],
    new_columns=["new_feature"],
)

Data format

DriftCalculator works with Polars DataFrames. If your data is in Pandas:

import polars as pl

reference_pl = pl.from_pandas(reference_pd)
current_pl = pl.from_pandas(current_pd)

Dependencies

Drift calculation requires scipy (installed with skyulf-core by default).

Tips

Run drift detection before prediction to catch issues early.
Log drift reports over time to track gradual distribution shifts.
If drift is detected, consider retraining the model on recent data.
Non-numeric columns are currently skipped — encode them first if you need categorical drift detection.