Drift Monitoring
Data drift occurs when the statistical properties of incoming (production) data diverge from the data the model was trained on. Skyulf provides a DriftCalculator to detect this automatically.
When to use drift detection
- Before running predictions on new data batches.
- As a scheduled health check in production pipelines.
- After a data source change (new CSV, new API feed, schema migration).
Quick example
import polars as pl
from skyulf.profiling import DriftCalculator
# Reference = your training data
reference = pl.DataFrame({
"age": [25, 30, 35, 40, 45, 50],
"income": [30000, 45000, 55000, 65000, 70000, 80000],
})
# Current = new production data
current = pl.DataFrame({
"age": [60, 65, 70, 75, 80, 85],
"income": [90000, 95000, 100000, 110000, 120000, 130000],
})
calc = DriftCalculator(reference, current)
report = calc.calculate_drift()
print(f"Drifted columns: {report.drifted_columns_count}")
for col, drift in report.column_drifts.items():
print(f" {col}: drift={drift.drift_detected}")
for m in drift.metrics:
print(f" {m.metric}: {m.value:.4f} (threshold={m.threshold}, drifted={m.has_drift})")
Drift metrics
The DriftCalculator computes these metrics for each numeric column:
| Metric | What it measures | Default threshold |
|---|---|---|
| Wasserstein distance | How much "work" to transform one distribution into the other | 0.1 (normalized) |
| KS test (Kolmogorov-Smirnov) | Maximum distance between CDFs; returns a p-value | p < 0.05 |
| PSI (Population Stability Index) | Binned distribution shift | 0.2 |
| KL divergence | Information-theoretic divergence | 0.1 |
A column is flagged as "drifted" if any metric exceeds its threshold.
Custom thresholds
report = calc.calculate_drift(thresholds={
"psi": 0.15,
"ks": 0.01,
"wasserstein": 0.05,
"kl_divergence": 0.2,
})
Schema drift
The report also detects structural changes:
report.missing_columns— columns present in reference but absent in current data.report.new_columns— columns in current data that were not in the reference.
Report structure
DriftReport(
reference_rows=1000,
current_rows=500,
drifted_columns_count=2,
column_drifts={
"age": ColumnDrift(
column="age",
metrics=[...],
drift_detected=True,
suggestions=["Consider retraining..."],
),
},
missing_columns=[],
new_columns=["new_feature"],
)
Data format
DriftCalculator works with Polars DataFrames. If your data is in Pandas:
import polars as pl
reference_pl = pl.from_pandas(reference_pd)
current_pl = pl.from_pandas(current_pd)
Dependencies
Drift calculation requires scipy (installed with skyulf-core by default).
Tips
- Run drift detection before prediction to catch issues early.
- Log drift reports over time to track gradual distribution shifts.
- If drift is detected, consider retraining the model on recent data.
- Non-numeric columns are currently skipped — encode them first if you need categorical drift detection.