API: preprocessing

The preprocessing package contains Calculator/Applier nodes and the FeatureEngineer orchestrator.

`skyulf.preprocessing`

`BaseApplier`

Bases: ABC

Source code in skyulf-core/skyulf/preprocessing/base.py

class BaseApplier(ABC):
    @abstractmethod
    def apply(self, df: pd.DataFrame | SkyulfDataFrame | tuple, params: dict[str, Any]) -> Any:
        """
        Applies the transformation using fitted parameters.

        The return type is intentionally `Any` because the concrete shape
        depends on the input: passing a `DataFrame` returns a `DataFrame`;
        passing an `(X, y)` tuple returns a tuple; splitters return
        `SplitDataset`. Encoding every case as a union forces callers to
        defensively narrow on every use, which is worse than `Any` here.
        """

`apply(df, params)` `abstractmethod`

Applies the transformation using fitted parameters.

The return type is intentionally Any because the concrete shape depends on the input: passing a DataFrame returns a DataFrame; passing an (X, y) tuple returns a tuple; splitters return SplitDataset. Encoding every case as a union forces callers to defensively narrow on every use, which is worse than Any here.

Source code in skyulf-core/skyulf/preprocessing/base.py

@abstractmethod
def apply(self, df: pd.DataFrame | SkyulfDataFrame | tuple, params: dict[str, Any]) -> Any:
    """
    Applies the transformation using fitted parameters.

    The return type is intentionally `Any` because the concrete shape
    depends on the input: passing a `DataFrame` returns a `DataFrame`;
    passing an `(X, y)` tuple returns a tuple; splitters return
    `SplitDataset`. Encoding every case as a union forces callers to
    defensively narrow on every use, which is worse than `Any` here.
    """

`BaseCalculator`

Bases: ABC

Source code in skyulf-core/skyulf/preprocessing/base.py

class BaseCalculator(ABC):
    @abstractmethod
    def fit(
        self, df: pd.DataFrame | SkyulfDataFrame | tuple, config: dict[str, Any]
    ) -> Mapping[str, Any]:
        """
        Calculates parameters from the training data.
        Returns a Mapping of fitted parameters (typically a TypedDict
        ``*Artifact`` declared in ``preprocessing._artifacts``). The return
        type is ``Mapping`` rather than ``Dict`` so concrete TypedDict
        subclasses are valid LSP-substitutable returns.
        """

    def infer_output_schema(
        self, input_schema: SkyulfSchema, config: dict[str, Any]
    ) -> SkyulfSchema | None:
        """Best-effort prediction of the output schema from config alone.

        Override this in concrete Calculators when the output columns/dtypes
        can be derived purely from ``input_schema`` and ``config`` (i.e.
        without seeing data). Examples:

        * Scalers — pass through (output == input).
        * Drop columns by name — drop the configured names.
        * One-hot — adds K columns per categorical (K is data-dependent →
          return ``None``).

        Default returns ``None`` to signal "unknown / data-dependent";
        callers should fall back to runtime introspection.
        """
        return None

`fit(df, config)` `abstractmethod`

Calculates parameters from the training data. Returns a Mapping of fitted parameters (typically a TypedDict *Artifact declared in preprocessing._artifacts). The return type is Mapping rather than Dict so concrete TypedDict subclasses are valid LSP-substitutable returns.

Source code in skyulf-core/skyulf/preprocessing/base.py

@abstractmethod
def fit(
    self, df: pd.DataFrame | SkyulfDataFrame | tuple, config: dict[str, Any]
) -> Mapping[str, Any]:
    """
    Calculates parameters from the training data.
    Returns a Mapping of fitted parameters (typically a TypedDict
    ``*Artifact`` declared in ``preprocessing._artifacts``). The return
    type is ``Mapping`` rather than ``Dict`` so concrete TypedDict
    subclasses are valid LSP-substitutable returns.
    """

`infer_output_schema(input_schema, config)`

Best-effort prediction of the output schema from config alone.

Override this in concrete Calculators when the output columns/dtypes can be derived purely from input_schema and config (i.e. without seeing data). Examples:

Scalers — pass through (output == input).
Drop columns by name — drop the configured names.
One-hot — adds K columns per categorical (K is data-dependent → return None).

Default returns None to signal "unknown / data-dependent"; callers should fall back to runtime introspection.

Source code in skyulf-core/skyulf/preprocessing/base.py

def infer_output_schema(
    self, input_schema: SkyulfSchema, config: dict[str, Any]
) -> SkyulfSchema | None:
    """Best-effort prediction of the output schema from config alone.

    Override this in concrete Calculators when the output columns/dtypes
    can be derived purely from ``input_schema`` and ``config`` (i.e.
    without seeing data). Examples:

    * Scalers — pass through (output == input).
    * Drop columns by name — drop the configured names.
    * One-hot — adds K columns per categorical (K is data-dependent →
      return ``None``).

    Default returns ``None`` to signal "unknown / data-dependent";
    callers should fall back to runtime introspection.
    """
    return None

`CustomBinningCalculator`

Bases: BaseCalculator

Apply user-supplied bin edges to selected columns.

Source code in skyulf-core/skyulf/preprocessing/bucketing.py

@NodeRegistry.register("CustomBinning", CustomBinningApplier)
@node_meta(
    id="CustomBinning",
    name="Custom Binning",
    category="Preprocessing",
    description="Bin data using custom edges.",
    params={"bins": [], "columns": []},
)
class CustomBinningCalculator(BaseCalculator):
    """Apply user-supplied bin edges to selected columns."""

    @fit_method
    def fit(self, X: Any, _y: Any, config: dict[str, Any]) -> GeneralBinningArtifact:
        if user_picked_no_columns(config):
            return cast(GeneralBinningArtifact, {})

        X = _to_pandas_for_fit(X)
        columns = resolve_columns(X, config, detect_numeric_columns)
        bins = config.get("bins")

        bin_edges_map: dict[str, list[float]] = {}
        if bins:
            sorted_bins = sorted(bins)
            for col in columns:
                if col in X.columns:
                    bin_edges_map[col] = sorted_bins

        artifact: dict[str, Any] = {
            "type": "general_binning",  # Reuses GeneralBinningApplier.
            "bin_edges": bin_edges_map,
        }
        artifact.update(_passthrough_artifact_options(config))
        return cast(GeneralBinningArtifact, artifact)

`FeatureEngineer`

Orchestrates a sequence of feature engineering steps.

Source code in skyulf-core/skyulf/preprocessing/pipeline.py

class FeatureEngineer:
    """
    Orchestrates a sequence of feature engineering steps.
    """

    # Resampling steps (SMOTE/undersampling) must only ever run on the train
    # split -- applying them to test/validation would fabricate synthetic rows
    # or delete real held-out rows purely to balance classes, corrupting any
    # metrics later computed on that "held-out" data. Kept as a single shared
    # constant (rather than duplicated literals) so `transform()`, `_run_step`,
    # and `_collect_step_metrics` can't drift out of sync with each other.
    _RESAMPLING_TYPES = {"Oversampling", "Undersampling"}

    def __init__(
        self,
        steps_config: Sequence[PreprocessingStepConfig | dict[str, Any]],
    ):
        # `Sequence` (covariant) accepts list[dict] or list[PreprocessingStepConfig].
        self.steps_config = steps_config
        self.fitted_steps: list[dict[str, Any]] = []

    def transform(self, data: pd.DataFrame | SkyulfDataFrame) -> pd.DataFrame | SkyulfDataFrame:
        """
        Apply fitted transformations to new data.
        """
        current_data = data

        for step in self.fitted_steps:
            name = step["name"]
            transformer_type = step["type"]
            applier = step["applier"]
            artifact = step["artifact"]

            # Skip splitters during inference/transform
            if transformer_type in [
                "TrainTestSplitter",
                "feature_target_split",
                *self._RESAMPLING_TYPES,
            ]:
                continue

            logger.debug(f"Applying step: {name} ({transformer_type})")
            current_data = applier.apply(current_data, artifact)

        return current_data

    def fit_transform(self, data: pd.DataFrame | SkyulfDataFrame | Any, node_id_prefix="") -> Any:
        """
        Runs the pipeline on data.
        Returns: (transformed_data, metrics_dict)
        """
        self.fitted_steps = []  # Reset fitted steps
        current_data = data
        metrics: dict[str, Any] = {}

        for i, step in enumerate(self.steps_config):
            name = step["name"]
            transformer_type = step["transformer"]
            params = step.get("params", {})

            logger.info(f"Running step {i}: {name} ({transformer_type})")
            logger.debug(f"FeatureEngineer running step {i}: {name} ({transformer_type})")
            logger.debug(f"current_data type: {type(current_data)}")

            # Snapshot before for shape-delta + Winsorize value-clipping metrics
            rows_before, cols_before = get_data_stats(current_data)
            data_before = current_data

            calculator, applier = self._get_transformer_components(transformer_type)
            step_node_id = f"{node_id_prefix}_{name}"

            current_data, fitted_params, transformer_inst = self._run_step(
                transformer_type=transformer_type,
                name=name,
                calculator=calculator,
                applier=applier,
                step_node_id=step_node_id,
                current_data=current_data,
                params=params,
            )

            if transformer_inst is not None:
                # Add node-level performance metrics directly into `metrics` dictionary
                metrics["fit_time"] = getattr(transformer_inst, "fit_time", 0.0)
                metrics["peak_memory_bytes"] = getattr(transformer_inst, "peak_memory_bytes", 0)
                metrics["rows_in"] = getattr(transformer_inst, "rows_in", 0)
                metrics["rows_out"] = getattr(transformer_inst, "rows_out", 0)

            logger.debug(f"Step {i} complete. New data type: {type(current_data)}")

            rows_after, cols_after = get_data_stats(current_data)
            self._collect_step_metrics(
                transformer_type=transformer_type,
                fitted_params=fitted_params,
                data_before=data_before,
                current_data=current_data,
                params=params,
                rows_before=rows_before,
                cols_before=cols_before,
                rows_after=rows_after,
                cols_after=cols_after,
                name=name,
                metrics=metrics,
            )

        return current_data, metrics

    # ------------------------------------------------------------------
    # Step execution
    # ------------------------------------------------------------------

    def _run_step(
        self,
        *,
        transformer_type: str,
        name: str,
        calculator: Any,
        applier: Any,
        step_node_id: str,
        current_data: Any,
        params: dict[str, Any],
    ) -> tuple:  # Returns (data, params, transformer)
        """Execute one pipeline step. Returns (new_data, fitted_params).

        Splitters change the data structure (DataFrame -> SplitDataset / (X, y)),
        so they bypass StatefulTransformer; everything else goes through the
        standard fit_transform wrapper and is appended to fitted_steps.
        """
        transformer = StatefulTransformer(
            calculator,
            applier,
            step_node_id,
            apply_on_test=transformer_type not in self._RESAMPLING_TYPES,
            apply_on_validation=transformer_type not in self._RESAMPLING_TYPES,
        )
        fitted_params: dict[str, Any] = {}

        if transformer_type == "TrainTestSplitter":
            logger.debug("Handling TrainTestSplitter")
            if isinstance(current_data, (pd.DataFrame, SkyulfDataFrame, tuple)):
                params = calculator.fit(current_data, params)
                current_data = applier.apply(current_data, params)
            else:
                logger.debug(f"Skipping TrainTestSplitter. current_data is {type(current_data)}")
                logger.warning(
                    "Attempting to split an already split dataset. Skipping TrainTestSplitter."
                )
            return current_data, fitted_params, None

        if transformer_type == "feature_target_split":
            logger.debug("Handling feature_target_split")
            params = calculator.fit(current_data, params)
            current_data = applier.apply(current_data, params)
            return current_data, fitted_params, None

        logger.debug("Handling standard transformer via StatefulTransformer")
        current_data = transformer.fit_transform(current_data, params)
        fitted_params = transformer.params
        self.fitted_steps.append(
            {
                "name": name,
                "type": transformer_type,
                "applier": applier,
                "artifact": fitted_params,
            }
        )
        return current_data, fitted_params, transformer

    # ------------------------------------------------------------------
    # Metrics collection
    # ------------------------------------------------------------------

    # Transformer-type groups, kept as class constants so dispatch is data-driven.
    _IMPUTATION_TYPES = {"SimpleImputer", "KNNImputer", "IterativeImputer"}
    _FEATURE_SELECTION_TYPES = {
        "feature_selection",
        "UnivariateSelection",
        "ModelBasedSelection",
        "VarianceThreshold",
    }
    _SCALING_TYPES = {"StandardScaler", "MinMaxScaler", "RobustScaler", "MaxAbsScaler"}
    _OUTLIER_TYPES = {"IQR", "Winsorize", "ZScore", "EllipticEnvelope"}
    _BUCKETING_TYPES = {
        "GeneralBinning",
        "EqualWidthBinning",
        "EqualFrequencyBinning",
        "CustomBinning",
        "KBinsDiscretizer",
    }
    _FEATURE_GEN_TYPES = {"FeatureMath", "FeatureGenerationNode"}
    _ROW_DROP_TYPES = {
        "DropMissingRows",
        "Deduplicate",
        "IQR",
        "ZScore",
        "EllipticEnvelope",
        "Winsorize",
    }
    _ENCODER_TYPES = {
        "OneHotEncoder",
        "LabelEncoder",
        "OrdinalEncoder",
        "TargetEncoder",
        "HashEncoder",
        "DummyEncoder",
    }

    def _collect_step_metrics(
        self,
        *,
        transformer_type: str,
        fitted_params: dict[str, Any],
        data_before: Any,
        current_data: Any,
        params: dict[str, Any],
        rows_before: int,
        cols_before: Any,
        rows_after: int,
        cols_after: Any,
        name: str,
        metrics: dict[str, Any],
    ) -> None:
        """Aggregate per-step metrics into the running metrics dict."""
        try:
            if fitted_params:
                self._metrics_from_fitted_params(
                    transformer_type, fitted_params, data_before, current_data, metrics
                )
        except Exception as e:
            logger.warning(f"Failed to retrieve metrics for step {name}: {e}")

        if transformer_type in self._RESAMPLING_TYPES:
            self._metrics_resampling(current_data, params, metrics)

        if rows_after > 0 or cols_after:
            self._metrics_shape_change(
                transformer_type,
                data_before,
                current_data,
                params,
                rows_before,
                cols_before,
                rows_after,
                cols_after,
                metrics,
            )

    @staticmethod
    def _copy_present_keys(
        fitted_params: dict[str, Any], metrics: dict[str, Any], keys: tuple[str, ...]
    ) -> None:
        """Copy each key from fitted_params into metrics, skipping keys that aren't present."""
        for key in keys:
            if key in fitted_params:
                metrics[key] = fitted_params[key]

    # Each rule maps a set of transformer types to the fitted_params/metrics key
    # to copy over when that transformer type produced it. Keys are the same on
    # both sides for every current rule.
    _OUTLIER_METRIC_RULES: tuple[tuple[frozenset[str], str], ...] = (
        (frozenset({"IQR", "Winsorize"}), "bounds"),
        (frozenset({"ZScore"}), "stats"),
        (frozenset({"EllipticEnvelope"}), "contamination"),
    )

    def _apply_outlier_metrics(
        self, transformer_type: str, fitted_params: dict[str, Any], metrics: dict[str, Any]
    ) -> None:
        """Populate outlier-detection related metrics (warnings/bounds/stats/contamination)."""
        if transformer_type in self._OUTLIER_TYPES and "warnings" in fitted_params:
            metrics["warnings"] = fitted_params["warnings"]
        for types, key in self._OUTLIER_METRIC_RULES:
            if transformer_type in types and key in fitted_params:
                metrics[key] = fitted_params[key]

    def _apply_feature_gen_metrics(
        self,
        fitted_params: dict[str, Any],
        data_before: Any,
        current_data: Any,
        metrics: dict[str, Any],
    ) -> None:
        """Populate operations count/list and newly generated feature columns metrics."""
        if "operations" in fitted_params:
            metrics["operations_count"] = len(fitted_params["operations"])
            metrics["operations"] = fitted_params["operations"]
        new_cols = self._diff_generated_columns(data_before, current_data)
        if new_cols is not None:
            metrics["generated_features"] = new_cols

    def _metrics_from_fitted_params(
        self,
        transformer_type: str,
        fitted_params: dict[str, Any],
        data_before: Any,
        current_data: Any,
        metrics: dict[str, Any],
    ) -> None:
        if transformer_type in self._IMPUTATION_TYPES:
            self._copy_present_keys(
                fitted_params, metrics, ("missing_counts", "total_missing", "fill_values")
            )

        if transformer_type in self._FEATURE_SELECTION_TYPES:
            self._copy_present_keys(
                fitted_params,
                metrics,
                (
                    "feature_scores",
                    "p_values",
                    "feature_importances",
                    "variances",
                    "ranking",
                    "selected_columns",
                ),
            )

        if transformer_type in self._SCALING_TYPES:
            self._copy_present_keys(
                fitted_params,
                metrics,
                (
                    "mean",
                    "scale",
                    "var",
                    "min",
                    "data_min",
                    "data_max",
                    "center",
                    "max_abs",
                    "columns",
                ),
            )

        self._apply_outlier_metrics(transformer_type, fitted_params, metrics)

        if transformer_type in self._BUCKETING_TYPES:
            self._copy_present_keys(fitted_params, metrics, ("bin_edges", "n_bins"))

        if transformer_type in self._FEATURE_GEN_TYPES:
            self._apply_feature_gen_metrics(fitted_params, data_before, current_data, metrics)

    @staticmethod
    def _columns_diff_if_dataframes(before: Any, after: Any):
        """Return the column-set difference if both objects are DataFrames, else None."""
        if isinstance(before, pd.DataFrame | SkyulfDataFrame) and isinstance(
            after, pd.DataFrame | SkyulfDataFrame
        ):
            return list(set(after.columns) - set(before.columns))
        return None

    @classmethod
    def _diff_generated_columns_split_dataset(
        cls, data_before: SplitDataset, current_data: SplitDataset
    ):
        """Diff columns for the SplitDataset case, handling both DataFrame and (X, y) train shapes."""
        before_train, after_train = data_before.train, current_data.train
        diff = cls._columns_diff_if_dataframes(before_train, after_train)
        if diff is not None:
            return diff
        if isinstance(before_train, tuple) and isinstance(after_train, tuple):
            x_before, _ = before_train
            x_after, _ = after_train
            return cls._columns_diff_if_dataframes(x_before, x_after)
        return None

    @classmethod
    def _diff_generated_columns(cls, data_before: Any, current_data: Any):
        """Return the set of newly added columns between two pipeline data objects.

        Handles plain DataFrames, SplitDatasets of DataFrames, and (X, y) tuple variants.
        Returns None if the structures don't allow a meaningful diff.
        """
        diff = cls._columns_diff_if_dataframes(data_before, current_data)
        if diff is not None:
            return diff

        if isinstance(data_before, SplitDataset) and isinstance(current_data, SplitDataset):
            return cls._diff_generated_columns_split_dataset(data_before, current_data)
        return None

    @staticmethod
    def _extract_y_from_split_dataset(current_data: SplitDataset, params: dict[str, Any]):
        """Pull the target Series out of a SplitDataset's train split (tuple or DataFrame form)."""
        if isinstance(current_data.train, tuple):
            _, y_res = current_data.train
            return y_res
        if isinstance(current_data.train, (pd.DataFrame, SkyulfDataFrame)):
            target_col = params.get("target_column")
            if target_col and target_col in current_data.train.columns:
                return current_data.train[target_col]
        return None

    @classmethod
    def _extract_y_for_resampling(cls, current_data: Any, params: dict[str, Any]):
        """Pull the target Series out of whatever shape the resampler produced."""
        if isinstance(current_data, SplitDataset):
            return cls._extract_y_from_split_dataset(current_data, params)
        if isinstance(current_data, tuple):
            _, y_res = current_data
            return y_res
        if isinstance(current_data, pd.DataFrame | SkyulfDataFrame):
            target_col = params.get("target_column")
            if target_col and target_col in current_data.columns:
                return current_data[target_col]
        return None

    def _metrics_resampling(
        self, current_data: Any, params: dict[str, Any], metrics: dict[str, Any]
    ) -> None:
        try:
            y_res: Any = self._extract_y_for_resampling(current_data, params)
            if y_res is None:
                return
            if hasattr(y_res, "to_pandas"):
                y_res = y_res.to_pandas()
            counts = y_res.value_counts().to_dict()
            metrics["class_counts"] = {str(k): int(v) for k, v in counts.items()}
            metrics["total_samples"] = int(len(y_res))
        except Exception as e:
            logger.warning(f"Failed to calculate resampling metrics: {e}")

    @staticmethod
    def _to_pandas_if_needed(obj: Any) -> Any:
        """Convert obj to pandas via to_pandas() if it supports that, otherwise return as-is."""
        return obj.to_pandas() if hasattr(obj, "to_pandas") else obj

    @classmethod
    def _count_diff_cells(cls, a: Any, b: Any, types: tuple[type, ...]) -> int:
        """Count differing cells between two objects of the given types with matching shape."""
        a = cls._to_pandas_if_needed(a)
        b = cls._to_pandas_if_needed(b)
        if isinstance(a, types) and isinstance(b, types) and a.shape == b.shape:
            return int(a.ne(b).sum().sum())
        return 0

    @classmethod
    def _count_winsorize_tuple_diffs(cls, d1: tuple, d2: tuple) -> int:
        """Count differing cells across the X/y halves of two (X, y) tuple pairs."""
        diffs = cls._count_diff_cells(d1[0], d2[0], (pd.DataFrame,))
        diffs += cls._count_diff_cells(d1[1], d2[1], (pd.DataFrame, pd.Series))
        return diffs

    @classmethod
    def _count_winsorize_diffs(cls, d1: Any, d2: Any) -> int:
        """Count cells that differ between two data objects, for Winsorize clipping metric."""
        d1 = cls._to_pandas_if_needed(d1)
        d2 = cls._to_pandas_if_needed(d2)

        if isinstance(d1, pd.DataFrame) and isinstance(d2, pd.DataFrame):
            return cls._count_diff_cells(d1, d2, (pd.DataFrame,))

        if isinstance(d1, tuple) and isinstance(d2, tuple) and len(d1) == 2 and len(d2) == 2:
            return cls._count_winsorize_tuple_diffs(d1, d2)
        return 0

    def _metrics_winsorize_clipped(
        self, data_before: Any, current_data: Any, metrics: dict[str, Any]
    ) -> None:
        try:
            clipped_count = 0
            if isinstance(data_before, (pd.DataFrame, SkyulfDataFrame)) and isinstance(
                current_data, (pd.DataFrame, SkyulfDataFrame)
            ):
                clipped_count = self._count_winsorize_diffs(data_before, current_data)
            elif isinstance(data_before, SplitDataset) and isinstance(current_data, SplitDataset):
                clipped_count += self._count_winsorize_diffs(data_before.train, current_data.train)
                clipped_count += self._count_winsorize_diffs(data_before.test, current_data.test)
                clipped_count += self._count_winsorize_diffs(
                    data_before.validation, current_data.validation
                )
            metrics["values_clipped"] = clipped_count
        except Exception as e:
            logger.warning(f"Failed to calculate values_clipped for Winsorize: {e}")

    def _metrics_shape_change(
        self,
        transformer_type: str,
        data_before: Any,
        current_data: Any,
        params: dict[str, Any],
        rows_before: int,
        cols_before: Any,
        rows_after: int,
        cols_after: Any,
        metrics: dict[str, Any],
    ) -> None:
        if transformer_type in self._ROW_DROP_TYPES:
            dropped = rows_before - rows_after
            metrics[f"{transformer_type}_rows_removed"] = dropped
            metrics[f"{transformer_type}_rows_remaining"] = rows_after
            metrics[f"{transformer_type}_rows_total"] = rows_before
            metrics["rows_removed"] = dropped
            metrics["rows_total"] = rows_before
            if transformer_type == "Winsorize":
                self._metrics_winsorize_clipped(data_before, current_data, metrics)

        if transformer_type == "MissingIndicator":
            new_cols_set = cols_after - cols_before
            metrics["missing_indicators_created"] = len(new_cols_set)
            metrics["missing_indicators_columns"] = list(new_cols_set)

        if transformer_type in {"DropMissingColumns", "feature_selection"}:
            dropped_cols_set = cols_before - cols_after
            metrics["dropped_columns"] = list(dropped_cols_set)
            metrics["dropped_columns_count"] = len(dropped_cols_set)

        if transformer_type in self._ENCODER_TYPES:
            new_cols_set = cols_after - cols_before
            metrics["new_features_count"] = len(new_cols_set)
            metrics["encoded_columns_count"] = len(params.get("columns", []))
            if "categories_count" in params:
                metrics["categories_count"] = params["categories_count"]
            if "classes_count" in params:
                metrics["classes_count"] = params["classes_count"]

    def _get_transformer_components(self, type_name: str):
        try:
            return (
                NodeRegistry.get_calculator(type_name)(),
                NodeRegistry.get_applier(type_name)(),
            )
        except ValueError:
            raise ValueError(f"Unknown transformer type: {type_name}") from None

`fit_transform(data, node_id_prefix='')`

Runs the pipeline on data. Returns: (transformed_data, metrics_dict)

Source code in skyulf-core/skyulf/preprocessing/pipeline.py

def fit_transform(self, data: pd.DataFrame | SkyulfDataFrame | Any, node_id_prefix="") -> Any:
    """
    Runs the pipeline on data.
    Returns: (transformed_data, metrics_dict)
    """
    self.fitted_steps = []  # Reset fitted steps
    current_data = data
    metrics: dict[str, Any] = {}

    for i, step in enumerate(self.steps_config):
        name = step["name"]
        transformer_type = step["transformer"]
        params = step.get("params", {})

        logger.info(f"Running step {i}: {name} ({transformer_type})")
        logger.debug(f"FeatureEngineer running step {i}: {name} ({transformer_type})")
        logger.debug(f"current_data type: {type(current_data)}")

        # Snapshot before for shape-delta + Winsorize value-clipping metrics
        rows_before, cols_before = get_data_stats(current_data)
        data_before = current_data

        calculator, applier = self._get_transformer_components(transformer_type)
        step_node_id = f"{node_id_prefix}_{name}"

        current_data, fitted_params, transformer_inst = self._run_step(
            transformer_type=transformer_type,
            name=name,
            calculator=calculator,
            applier=applier,
            step_node_id=step_node_id,
            current_data=current_data,
            params=params,
        )

        if transformer_inst is not None:
            # Add node-level performance metrics directly into `metrics` dictionary
            metrics["fit_time"] = getattr(transformer_inst, "fit_time", 0.0)
            metrics["peak_memory_bytes"] = getattr(transformer_inst, "peak_memory_bytes", 0)
            metrics["rows_in"] = getattr(transformer_inst, "rows_in", 0)
            metrics["rows_out"] = getattr(transformer_inst, "rows_out", 0)

        logger.debug(f"Step {i} complete. New data type: {type(current_data)}")

        rows_after, cols_after = get_data_stats(current_data)
        self._collect_step_metrics(
            transformer_type=transformer_type,
            fitted_params=fitted_params,
            data_before=data_before,
            current_data=current_data,
            params=params,
            rows_before=rows_before,
            cols_before=cols_before,
            rows_after=rows_after,
            cols_after=cols_after,
            name=name,
            metrics=metrics,
        )

    return current_data, metrics

`transform(data)`

Apply fitted transformations to new data.

Source code in skyulf-core/skyulf/preprocessing/pipeline.py

def transform(self, data: pd.DataFrame | SkyulfDataFrame) -> pd.DataFrame | SkyulfDataFrame:
    """
    Apply fitted transformations to new data.
    """
    current_data = data

    for step in self.fitted_steps:
        name = step["name"]
        transformer_type = step["type"]
        applier = step["applier"]
        artifact = step["artifact"]

        # Skip splitters during inference/transform
        if transformer_type in [
            "TrainTestSplitter",
            "feature_target_split",
            *self._RESAMPLING_TYPES,
        ]:
            continue

        logger.debug(f"Applying step: {name} ({transformer_type})")
        current_data = applier.apply(current_data, artifact)

    return current_data

`GeneralBinningCalculator`

Bases: BaseCalculator

Master calculator that handles mixed strategies and per-column overrides.

Source code in skyulf-core/skyulf/preprocessing/bucketing.py

@NodeRegistry.register("GeneralBinning", GeneralBinningApplier)
@node_meta(
    id="GeneralBinning",
    name="General Binning",
    category="Preprocessing",
    description="Bin continuous data into intervals.",
    params={"n_bins": 5, "strategy": "uniform", "columns": []},
)
class GeneralBinningCalculator(BaseCalculator):
    """Master calculator that handles mixed strategies and per-column overrides."""

    @fit_method
    def fit(self, X: Any, _y: Any, config: dict[str, Any]) -> GeneralBinningArtifact:  # pylint: disable=arguments-differ
        if user_picked_no_columns(config):
            return cast(GeneralBinningArtifact, {})

        X = _to_pandas_for_fit(X)
        columns = resolve_columns(X, config, detect_numeric_columns)

        defaults = {
            "default_n_bins": config.get("n_bins", 5),
            "n_bins": config.get("equal_width_bins", config.get("n_bins", 5)),
            "q_bins": config.get("equal_frequency_bins", config.get("n_bins", 5)),
            "duplicates": config.get("duplicates", "drop"),
        }

        valid_cols = [c for c in columns if c in X.columns]
        bin_edges_map: dict[str, list[float]] = {}
        custom_labels_map: dict[str, Any] = {}

        for col in valid_cols:
            _fit_one_column_into_maps(X, col, config, defaults, bin_edges_map, custom_labels_map)

        artifact: dict[str, Any] = {
            "type": "general_binning",
            "bin_edges": bin_edges_map,
            "custom_labels": custom_labels_map,
        }
        artifact.update(_passthrough_artifact_options(config))
        return cast(GeneralBinningArtifact, artifact)

`KBinsDiscretizerCalculator`

Bases: GeneralBinningCalculator

Thin wrapper around :class:GeneralBinningCalculator with kbins strategy.

Source code in skyulf-core/skyulf/preprocessing/bucketing.py

@NodeRegistry.register("KBinsDiscretizer", KBinsDiscretizerApplier)
@node_meta(
    id="KBinsDiscretizer",
    name="K-Bins Discretizer",
    category="Preprocessing",
    description="Bin continuous data into intervals using sklearn KBinsDiscretizer.",
    params={"n_bins": 5, "encode": "ordinal", "strategy": "quantile", "columns": []},
)
class KBinsDiscretizerCalculator(GeneralBinningCalculator):
    """Thin wrapper around :class:`GeneralBinningCalculator` with ``kbins`` strategy."""

    def fit(  # pylint: disable=arguments-differ
        self,
        df: pd.DataFrame | SkyulfDataFrame | tuple[Any, ...] | Any,
        config: dict[str, Any],
    ) -> GeneralBinningArtifact:
        new_config = config.copy()
        new_config["strategy"] = "kbins"
        if "n_bins" in config:
            new_config["kbins_n_bins"] = config["n_bins"]
        if "strategy" in config and config["strategy"] != "kbins":
            new_config["kbins_strategy"] = config["strategy"]
        return super().fit(df, new_config)  # pylint: disable=no-value-for-parameter

`SchemaMismatchError`

Bases: ValueError

Raised when an actual frame schema violates an expected SkyulfSchema.

Carries structured details so callers can render a precise message instead of a generic KeyError deep inside a transformer:

Attributes:

Name	Type	Description
`missing`		Expected columns absent from the actual frame.
`unexpected`		Actual columns not present in the expected schema.
`dtype_mismatches`		`{column: (expected_dtype, actual_dtype)}` for shared columns whose dtype labels differ (only when dtype checking is requested).
`order_mismatch`		`True` when the shared columns appear in a different relative order than expected (only when order checking is requested).

Source code in skyulf-core/skyulf/core/schema.py

class SchemaMismatchError(ValueError):
    """Raised when an actual frame schema violates an expected ``SkyulfSchema``.

    Carries structured details so callers can render a precise message instead
    of a generic ``KeyError`` deep inside a transformer:

    Attributes:
        missing: Expected columns absent from the actual frame.
        unexpected: Actual columns not present in the expected schema.
        dtype_mismatches: ``{column: (expected_dtype, actual_dtype)}`` for
            shared columns whose dtype labels differ (only when dtype checking
            is requested).
        order_mismatch: ``True`` when the shared columns appear in a different
            relative order than expected (only when order checking is requested).
    """

    def __init__(
        self,
        message: str,
        *,
        missing: list[str] | None = None,
        unexpected: list[str] | None = None,
        dtype_mismatches: dict[str, tuple[str, str]] | None = None,
        order_mismatch: bool = False,
    ) -> None:
        super().__init__(message)
        self.missing = missing or []
        self.unexpected = unexpected or []
        self.dtype_mismatches = dtype_mismatches or {}
        self.order_mismatch = order_mismatch

`SkyulfSchema` `dataclass`

Immutable schema description.

Attributes:

Name	Type	Description
`columns`	`tuple[str, ...]`	Ordered list of column names.
`dtypes`	`dict[str, str]`	Mapping of column name → string dtype label (engine-agnostic; e.g. `"int64"`, `"float64"`, `"string"`, `"category"`, `"datetime"`, `"bool"`, or `"unknown"`). A column may be present in `columns` but absent from `dtypes` when its type is unknown.

Source code in skyulf-core/skyulf/core/schema.py

@dataclass(frozen=True)
class SkyulfSchema:
    """Immutable schema description.

    Attributes:
        columns: Ordered list of column names.
        dtypes: Mapping of column name → string dtype label
            (engine-agnostic; e.g. ``"int64"``, ``"float64"``, ``"string"``,
            ``"category"``, ``"datetime"``, ``"bool"``, or ``"unknown"``).
            A column may be present in ``columns`` but absent from
            ``dtypes`` when its type is unknown.
    """

    columns: tuple[str, ...]
    dtypes: dict[str, str] = field(default_factory=dict)

    # ---- Constructors -----------------------------------------------------

    @classmethod
    def from_columns(
        cls, columns: Iterable[str], dtypes: dict[str, str] | None = None
    ) -> "SkyulfSchema":
        cols = tuple(columns)
        return cls(columns=cols, dtypes=dict(dtypes or {}))

    @classmethod
    def from_dataframe(cls, df: Any) -> "SkyulfSchema":
        """Best-effort schema extraction from a Pandas/Polars/Wrapper frame."""
        raw_cols = getattr(df, "columns", None)
        cols = list(raw_cols) if raw_cols is not None else []
        dtypes = _extract_pandas_dtypes(df)
        if not dtypes:
            dtypes = _extract_polars_dtypes(df)
        return cls(columns=tuple(cols), dtypes=dtypes)

    # ---- Mutations (return new instances) ---------------------------------

    def drop(self, names: Iterable[str]) -> "SkyulfSchema":
        drop_set = set(names)
        new_cols = tuple(c for c in self.columns if c not in drop_set)
        new_dtypes = {k: v for k, v in self.dtypes.items() if k not in drop_set}
        return replace(self, columns=new_cols, dtypes=new_dtypes)

    def add(self, name: str, dtype: str = "unknown") -> "SkyulfSchema":
        if name in self.columns:
            return self
        new_dtypes = dict(self.dtypes)
        new_dtypes[name] = dtype
        return replace(self, columns=self.columns + (name,), dtypes=new_dtypes)

    def rename(self, mapping: dict[str, str]) -> "SkyulfSchema":
        new_cols = tuple(mapping.get(c, c) for c in self.columns)
        if len(set(new_cols)) != len(new_cols):
            seen: set[str] = set()
            collisions: set[str] = set()
            for c in new_cols:
                if c in seen:
                    collisions.add(c)
                seen.add(c)
            raise ValueError(
                "Schema rename() would produce duplicate column name(s): "
                f"{sorted(collisions)}. Rename mapping: {mapping}"
            )
        new_dtypes: dict[str, str] = {}
        for k, v in self.dtypes.items():
            new_dtypes[mapping.get(k, k)] = v
        return replace(self, columns=new_cols, dtypes=new_dtypes)

    def with_dtype(self, name: str, dtype: str) -> "SkyulfSchema":
        if name not in self.columns:
            return self
        new_dtypes = dict(self.dtypes)
        new_dtypes[name] = dtype
        return replace(self, dtypes=new_dtypes)

    # ---- Queries ----------------------------------------------------------

    def has(self, name: str) -> bool:
        return name in self.columns

    def column_list(self) -> list[str]:
        return list(self.columns)

    def __contains__(self, item: object) -> bool:
        return item in self.columns

    def __len__(self) -> int:
        return len(self.columns)

    # ---- Contract validation ---------------------------------------------

    def assert_compatible(
        self,
        actual: "SkyulfSchema",
        *,
        check_dtypes: bool = False,
        check_order: bool = False,
        where: str = "input",
    ) -> None:
        """Validate that ``actual`` satisfies this (expected) schema.

        ``self`` is the expected schema (e.g. what an Applier was fitted on);
        ``actual`` is the schema observed at apply time. Raises
        :class:`SchemaMismatchError` describing every discrepancy. Presence of
        the expected columns is always checked; dtype and column-order checks
        are opt-in to keep the default contract permissive and non-breaking.

        Args:
            actual: The schema observed at runtime.
            check_dtypes: Also compare dtype labels for shared columns.
            check_order: Also require shared columns in the same relative order.
            where: Label used in the error message (e.g. ``"input"``).
        """
        missing, unexpected = _presence_diff(self, actual)
        dtype_mismatches = _dtype_mismatches(self, actual) if check_dtypes else {}
        order_mismatch = _check_order(self, actual, check_order, missing)

        if missing or unexpected or dtype_mismatches or order_mismatch:
            raise SchemaMismatchError(
                _format_mismatch(where, missing, unexpected, dtype_mismatches, order_mismatch),
                missing=missing,
                unexpected=unexpected,
                dtype_mismatches=dtype_mismatches,
                order_mismatch=order_mismatch,
            )

`assert_compatible(actual, *, check_dtypes=False, check_order=False, where='input')`

Validate that actual satisfies this (expected) schema.

self is the expected schema (e.g. what an Applier was fitted on); actual is the schema observed at apply time. Raises :class:SchemaMismatchError describing every discrepancy. Presence of the expected columns is always checked; dtype and column-order checks are opt-in to keep the default contract permissive and non-breaking.

Parameters:

Name	Type	Description	Default
`actual`	`SkyulfSchema`	The schema observed at runtime.	required
`check_dtypes`	`bool`	Also compare dtype labels for shared columns.	`False`
`check_order`	`bool`	Also require shared columns in the same relative order.	`False`
`where`	`str`	Label used in the error message (e.g. `"input"`).	`'input'`

Source code in skyulf-core/skyulf/core/schema.py

def assert_compatible(
    self,
    actual: "SkyulfSchema",
    *,
    check_dtypes: bool = False,
    check_order: bool = False,
    where: str = "input",
) -> None:
    """Validate that ``actual`` satisfies this (expected) schema.

    ``self`` is the expected schema (e.g. what an Applier was fitted on);
    ``actual`` is the schema observed at apply time. Raises
    :class:`SchemaMismatchError` describing every discrepancy. Presence of
    the expected columns is always checked; dtype and column-order checks
    are opt-in to keep the default contract permissive and non-breaking.

    Args:
        actual: The schema observed at runtime.
        check_dtypes: Also compare dtype labels for shared columns.
        check_order: Also require shared columns in the same relative order.
        where: Label used in the error message (e.g. ``"input"``).
    """
    missing, unexpected = _presence_diff(self, actual)
    dtype_mismatches = _dtype_mismatches(self, actual) if check_dtypes else {}
    order_mismatch = _check_order(self, actual, check_order, missing)

    if missing or unexpected or dtype_mismatches or order_mismatch:
        raise SchemaMismatchError(
            _format_mismatch(where, missing, unexpected, dtype_mismatches, order_mismatch),
            missing=missing,
            unexpected=unexpected,
            dtype_mismatches=dtype_mismatches,
            order_mismatch=order_mismatch,
        )

`from_dataframe(df)` `classmethod`

Best-effort schema extraction from a Pandas/Polars/Wrapper frame.

Source code in skyulf-core/skyulf/core/schema.py

@classmethod
def from_dataframe(cls, df: Any) -> "SkyulfSchema":
    """Best-effort schema extraction from a Pandas/Polars/Wrapper frame."""
    raw_cols = getattr(df, "columns", None)
    cols = list(raw_cols) if raw_cols is not None else []
    dtypes = _extract_pandas_dtypes(df)
    if not dtypes:
        dtypes = _extract_polars_dtypes(df)
    return cls(columns=tuple(cols), dtypes=dtypes)

`StatefulTransformer`

Fits + applies one pipeline step.

Accepts anything satisfying :class:~skyulf.core.protocols.CalculatorProtocol / :class:~skyulf.core.protocols.ApplierProtocol (structural typing) — a BaseCalculator/BaseApplier subclass, or any duck-typed object exposing matching fit/apply methods, works without subclassing.

Source code in skyulf-core/skyulf/preprocessing/base.py

class StatefulTransformer:
    """Fits + applies one pipeline step.

    Accepts anything satisfying :class:`~skyulf.core.protocols.CalculatorProtocol` /
    :class:`~skyulf.core.protocols.ApplierProtocol` (structural typing) — a
    ``BaseCalculator``/``BaseApplier`` subclass, or any duck-typed object
    exposing matching ``fit``/``apply`` methods, works without subclassing.
    """

    def __init__(
        self,
        calculator: CalculatorProtocol,
        applier: ApplierProtocol,
        node_id: str,
        apply_on_test: bool = True,
        apply_on_validation: bool = True,
    ):
        self.calculator = calculator
        self.applier = applier
        self.node_id = node_id
        self.apply_on_test = apply_on_test
        self.apply_on_validation = apply_on_validation
        self.params: dict[str, Any] = {}  # Store params in memory instead of ArtifactStore
        # Profiling metrics
        self.fit_time: float = 0.0
        self.peak_memory_bytes: int = 0
        self.rows_in: int = 0
        self.rows_out: int = 0

    def fit_transform(
        self,
        dataset: SplitDataset | pd.DataFrame | SkyulfDataFrame | tuple,
        config: dict[str, Any],
    ) -> SplitDataset | pd.DataFrame | SkyulfDataFrame | tuple:
        self.rows_in, _ = get_data_stats(dataset)
        tracemalloc.start()
        start = time.time()

        result = self._fit_transform_inner(dataset, config)

        self.fit_time = time.time() - start

        if tracemalloc.is_tracing():
            _, peak = tracemalloc.get_traced_memory()
            self.peak_memory_bytes = peak
            tracemalloc.stop()

        self.rows_out, _ = get_data_stats(result)
        return result

    def _fit_transform_inner(
        self,
        dataset: SplitDataset | pd.DataFrame | SkyulfDataFrame | tuple,
        config: dict[str, Any],
    ) -> SplitDataset | pd.DataFrame | SkyulfDataFrame | tuple:
        # Check for DataFrame-like (Pandas, Polars, Wrapper)
        if (
            hasattr(dataset, "shape")
            and hasattr(dataset, "columns")
            and not isinstance(dataset, tuple)
        ):
            # Fit on the whole dataframe (be careful about leakage!)
            # ty can't narrow a Union through hasattr — cast once for both calls.
            frame = cast(Any, dataset)
            # Calculator.fit returns Mapping (TypedDicts allowed); cast to Dict
            # for storage so Appliers continue to receive a concrete Dict.
            self.params = cast(dict[str, Any], self.calculator.fit(frame, config))
            return self.applier.apply(frame, self.params)

        # If dataset is a tuple (e.g. from FeatureTargetSplitter), pass it through.
        # This allows nodes like TrainTestSplitter to accept (X, y) tuples.
        if isinstance(dataset, tuple):
            self.params = cast(dict[str, Any], self.calculator.fit(dataset, config))
            return self.applier.apply(dataset, self.params)

        # 1. Calculate on Train
        self.params = cast(dict[str, Any], self.calculator.fit(dataset.train, config))

        # 2. Apply to all splits
        return self._apply_to_split_dataset(dataset, self.params)

    def _apply_guarded(self, data: Any, params: dict[str, Any]) -> Any:
        """Apply the applier to `data` and raise if it produces a nested SplitDataset."""
        result = self.applier.apply(data, params)
        if isinstance(result, SplitDataset):
            raise TypeError(
                "Applier returned SplitDataset inside StatefulTransformer, which is not supported."
            )
        return result

    def _apply_to_split_dataset(
        self, dataset: SplitDataset, params: dict[str, Any]
    ) -> SplitDataset:
        """Apply the applier to each split (train/test/validation) of a SplitDataset."""
        new_train = self._apply_guarded(dataset.train, params)

        new_test = dataset.test
        if self.apply_on_test:
            new_test = self._apply_guarded(dataset.test, params)

        new_val = dataset.validation
        if self.apply_on_validation and dataset.validation is not None:
            new_val = self._apply_guarded(dataset.validation, params)

        return SplitDataset(train=new_train, test=new_test, validation=new_val)

    def transform(
        self, dataset: SplitDataset | pd.DataFrame | SkyulfDataFrame | tuple
    ) -> SplitDataset | pd.DataFrame | SkyulfDataFrame | tuple:
        # Use stored params
        params = self.params

        if isinstance(dataset, pd.DataFrame):
            return self.applier.apply(dataset, params)

        if isinstance(dataset, tuple):
            return self.applier.apply(dataset, params)

        # 2. Apply
        # ty can't narrow SplitDataset out of the SkyulfDataFrame branch of this
        # Union via isinstance alone (mirrors the `frame = cast(Any, dataset)`
        # note in `_fit_transform_inner` above).
        return self._apply_to_split_dataset(cast(SplitDataset, dataset), params)

`validate_schema(expected, actual, *, check_dtypes=False, check_order=False, where='input')`

Validate a live DataFrame against an expected schema.

Thin convenience wrapper: builds a :class:SkyulfSchema from actual (Pandas/Polars/wrapper frame) and delegates to :meth:SkyulfSchema.assert_compatible. Raises :class:SchemaMismatchError on any violation.

Source code in skyulf-core/skyulf/core/schema.py

def validate_schema(
    expected: SkyulfSchema,
    actual: Any,
    *,
    check_dtypes: bool = False,
    check_order: bool = False,
    where: str = "input",
) -> None:
    """Validate a live DataFrame against an ``expected`` schema.

    Thin convenience wrapper: builds a :class:`SkyulfSchema` from ``actual``
    (Pandas/Polars/wrapper frame) and delegates to
    :meth:`SkyulfSchema.assert_compatible`. Raises
    :class:`SchemaMismatchError` on any violation.
    """
    actual_schema = (
        actual if isinstance(actual, SkyulfSchema) else SkyulfSchema.from_dataframe(actual)
    )
    expected.assert_compatible(
        actual_schema,
        check_dtypes=check_dtypes,
        check_order=check_order,
        where=where,
    )

API: preprocessing

skyulf.preprocessing

BaseApplier

apply(df, params) abstractmethod

BaseCalculator

fit(df, config) abstractmethod

infer_output_schema(input_schema, config)

CustomBinningCalculator

FeatureEngineer

fit_transform(data, node_id_prefix='')

transform(data)

GeneralBinningCalculator

KBinsDiscretizerCalculator

SchemaMismatchError

SkyulfSchema dataclass

assert_compatible(actual, *, check_dtypes=False, check_order=False, where='input')

from_dataframe(df) classmethod

StatefulTransformer

validate_schema(expected, actual, *, check_dtypes=False, check_order=False, where='input')

`skyulf.preprocessing`

`BaseApplier`

`apply(df, params)` `abstractmethod`

`BaseCalculator`

`fit(df, config)` `abstractmethod`

`infer_output_schema(input_schema, config)`

`CustomBinningCalculator`

`FeatureEngineer`

`fit_transform(data, node_id_prefix='')`

`transform(data)`

`GeneralBinningCalculator`

`KBinsDiscretizerCalculator`

`SchemaMismatchError`

`SkyulfSchema` `dataclass`

`assert_compatible(actual, *, check_dtypes=False, check_order=False, where='input')`

`from_dataframe(df)` `classmethod`

`StatefulTransformer`

`validate_schema(expected, actual, *, check_dtypes=False, check_order=False, where='input')`