Skip to content

Overview

skyulf-core is a standalone ML pipeline library designed for reproducible feature engineering and modeling.

Key idea: explicit learned state

Skyulf-core uses a strict Calculator → Applier pattern:

  • A Calculator learns from data and returns a params dictionary.
  • An Applier takes params and transforms data.

This differs from scikit-learn’s default pattern:

  • In scikit-learn, fit() mutates the estimator/transformer object (e.g. self.mean_, self.categories_), and transform() uses those hidden internal attributes.
  • In Skyulf-core, fit() returns the learned state explicitly as a plain params dictionary (ideally JSON-serializable; sometimes pickled for complex sklearn objects), and apply() uses only that dictionary.

Practically, this makes learned state easier to inspect, persist, and apply consistently across train/test/inference.

Two ways to use the library

Use SkyulfPipeline to run preprocessing + modeling end-to-end.

2) Component way (low-level)

Call calculators/appliers directly for debugging, testing, or custom scripts.

This is also where you’ll see StatefulEstimator used: it’s a small convenience wrapper that keeps a fitted model artifact in memory and can run fit_predict() on a SplitDataset. SkyulfPipeline uses the same underlying idea internally.

How fit / transform works in Skyulf

At a high level:

  • SkyulfPipeline.fit(data, target_column=...)

    • Runs preprocessing in order.
    • For each preprocessing step: Calculator learns params (typically from train only), then Applier applies those params to train/test/validation.
    • Trains the model and reports metrics.
  • SkyulfPipeline.predict(df)

    • Applies the already-learned preprocessing params (no re-fitting).
    • Skips steps that only make sense during training (e.g., splitters / resampling).
    • Runs the trained model to produce predictions.

If you want reproducible “proof-style” checks (sklearn-style X/y split + leakage demonstration), see:

  • Validation vs scikit-learn

Why splitting matters (leakage)

Many preprocessing nodes learn from data (means, categories, bin edges, vocabularies…). If you learn those from the full dataset and then evaluate on a test set, you leak information.

The safe pattern is:

  • Split first (or provide a SplitDataset).
  • Fit preprocessing on train only.
  • Reuse the learned params to transform test / new inference data.

Where things live

  • skyulf.preprocessing: feature engineering nodes (imputation, encoding, scaling, …)
  • skyulf.modeling: estimators (classification/regression + tuning)
  • skyulf.data: SplitDataset for safe train/test/validation flows