Overview
skyulf-core is a standalone ML pipeline library designed for reproducible feature engineering and modeling.
Key idea: explicit learned state
Skyulf-core uses a strict Calculator → Applier pattern:
- A Calculator learns from data and returns a
paramsdictionary. - An Applier takes
paramsand transforms data.
This differs from scikit-learn’s default pattern:
- In scikit-learn,
fit()mutates the estimator/transformer object (e.g.self.mean_,self.categories_), andtransform()uses those hidden internal attributes. - In Skyulf-core,
fit()returns the learned state explicitly as a plainparamsdictionary (ideally JSON-serializable; sometimes pickled for complex sklearn objects), andapply()uses only that dictionary.
Practically, this makes learned state easier to inspect, persist, and apply consistently across train/test/inference.
Two ways to use the library
1) Pipeline way (recommended)
Use SkyulfPipeline to run preprocessing + modeling end-to-end.
2) Component way (low-level)
Call calculators/appliers directly for debugging, testing, or custom scripts.
This is also where you’ll see StatefulEstimator used: it’s a small convenience wrapper that
keeps a fitted model artifact in memory and can run fit_predict() on a SplitDataset.
SkyulfPipeline uses the same underlying idea internally.
How fit / transform works in Skyulf
At a high level:
-
SkyulfPipeline.fit(data, target_column=...)- Runs preprocessing in order.
- For each preprocessing step: Calculator learns params (typically from train only), then Applier applies those params to train/test/validation.
- Trains the model and reports metrics.
-
SkyulfPipeline.predict(df)- Applies the already-learned preprocessing params (no re-fitting).
- Skips steps that only make sense during training (e.g., splitters / resampling).
- Runs the trained model to produce predictions.
If you want reproducible “proof-style” checks (sklearn-style X/y split + leakage demonstration), see:
- Validation vs scikit-learn
Why splitting matters (leakage)
Many preprocessing nodes learn from data (means, categories, bin edges, vocabularies…). If you learn those from the full dataset and then evaluate on a test set, you leak information.
The safe pattern is:
- Split first (or provide a
SplitDataset). - Fit preprocessing on
trainonly. - Reuse the learned
paramsto transformtest/ new inference data.
Where things live
skyulf.preprocessing: feature engineering nodes (imputation, encoding, scaling, …)skyulf.modeling: estimators (classification/regression + tuning)skyulf.data:SplitDatasetfor safe train/test/validation flows