Skip to content

API: preprocessing.base

skyulf.preprocessing.base

BaseApplier

Bases: ABC

Source code in skyulf-core/skyulf/preprocessing/base.py
106
107
108
109
110
111
112
113
114
115
116
117
class BaseApplier(ABC):
    @abstractmethod
    def apply(self, df: Union[pd.DataFrame, SkyulfDataFrame, tuple], params: Dict[str, Any]) -> Any:
        """
        Applies the transformation using fitted parameters.

        The return type is intentionally `Any` because the concrete shape
        depends on the input: passing a `DataFrame` returns a `DataFrame`;
        passing an `(X, y)` tuple returns a tuple; splitters return
        `SplitDataset`. Encoding every case as a union forces callers to
        defensively narrow on every use, which is worse than `Any` here.
        """

apply(df, params) abstractmethod

Applies the transformation using fitted parameters.

The return type is intentionally Any because the concrete shape depends on the input: passing a DataFrame returns a DataFrame; passing an (X, y) tuple returns a tuple; splitters return SplitDataset. Encoding every case as a union forces callers to defensively narrow on every use, which is worse than Any here.

Source code in skyulf-core/skyulf/preprocessing/base.py
107
108
109
110
111
112
113
114
115
116
117
@abstractmethod
def apply(self, df: Union[pd.DataFrame, SkyulfDataFrame, tuple], params: Dict[str, Any]) -> Any:
    """
    Applies the transformation using fitted parameters.

    The return type is intentionally `Any` because the concrete shape
    depends on the input: passing a `DataFrame` returns a `DataFrame`;
    passing an `(X, y)` tuple returns a tuple; splitters return
    `SplitDataset`. Encoding every case as a union forces callers to
    defensively narrow on every use, which is worse than `Any` here.
    """

BaseCalculator

Bases: ABC

Source code in skyulf-core/skyulf/preprocessing/base.py
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
class BaseCalculator(ABC):
    @abstractmethod
    def fit(
        self, df: Union[pd.DataFrame, SkyulfDataFrame, tuple], config: Dict[str, Any]
    ) -> Mapping[str, Any]:
        """
        Calculates parameters from the training data.
        Returns a Mapping of fitted parameters (typically a TypedDict
        ``*Artifact`` declared in ``preprocessing._artifacts``). The return
        type is ``Mapping`` rather than ``Dict`` so concrete TypedDict
        subclasses are valid LSP-substitutable returns.
        """

    def infer_output_schema(
        self, input_schema: SkyulfSchema, config: Dict[str, Any]
    ) -> Optional[SkyulfSchema]:
        """Best-effort prediction of the output schema from config alone.

        Override this in concrete Calculators when the output columns/dtypes
        can be derived purely from ``input_schema`` and ``config`` (i.e.
        without seeing data). Examples:

        * Scalers — pass through (output == input).
        * Drop columns by name — drop the configured names.
        * One-hot — adds K columns per categorical (K is data-dependent →
          return ``None``).

        Default returns ``None`` to signal "unknown / data-dependent";
        callers should fall back to runtime introspection.
        """
        return None

fit(df, config) abstractmethod

Calculates parameters from the training data. Returns a Mapping of fitted parameters (typically a TypedDict *Artifact declared in preprocessing._artifacts). The return type is Mapping rather than Dict so concrete TypedDict subclasses are valid LSP-substitutable returns.

Source code in skyulf-core/skyulf/preprocessing/base.py
74
75
76
77
78
79
80
81
82
83
84
@abstractmethod
def fit(
    self, df: Union[pd.DataFrame, SkyulfDataFrame, tuple], config: Dict[str, Any]
) -> Mapping[str, Any]:
    """
    Calculates parameters from the training data.
    Returns a Mapping of fitted parameters (typically a TypedDict
    ``*Artifact`` declared in ``preprocessing._artifacts``). The return
    type is ``Mapping`` rather than ``Dict`` so concrete TypedDict
    subclasses are valid LSP-substitutable returns.
    """

infer_output_schema(input_schema, config)

Best-effort prediction of the output schema from config alone.

Override this in concrete Calculators when the output columns/dtypes can be derived purely from input_schema and config (i.e. without seeing data). Examples:

  • Scalers — pass through (output == input).
  • Drop columns by name — drop the configured names.
  • One-hot — adds K columns per categorical (K is data-dependent → return None).

Default returns None to signal "unknown / data-dependent"; callers should fall back to runtime introspection.

Source code in skyulf-core/skyulf/preprocessing/base.py
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
def infer_output_schema(
    self, input_schema: SkyulfSchema, config: Dict[str, Any]
) -> Optional[SkyulfSchema]:
    """Best-effort prediction of the output schema from config alone.

    Override this in concrete Calculators when the output columns/dtypes
    can be derived purely from ``input_schema`` and ``config`` (i.e.
    without seeing data). Examples:

    * Scalers — pass through (output == input).
    * Drop columns by name — drop the configured names.
    * One-hot — adds K columns per categorical (K is data-dependent →
      return ``None``).

    Default returns ``None`` to signal "unknown / data-dependent";
    callers should fall back to runtime introspection.
    """
    return None

apply_method(fn)

Decorator that handles unpack/pack boilerplate around an Applier's apply.

The decorated method is written with signature (self, X, y, params) instead of (self, df, params). The wrapper:

  1. Calls unpack_pipeline_input(df) to get (X, y, is_tuple).
  2. Invokes the user's method with the unpacked X and y.
  3. If the method returns a 2-tuple (X_out, y_out), that pair is packed; otherwise the result is treated as X_out and the original y is reused.
  4. Calls pack_pipeline_output to restore the original input shape.

Useful for ~50 Appliers that share the same boilerplate. Skip it for splitters (which return SplitDataset directly) or analyzers that don't transform the frame.

Source code in skyulf-core/skyulf/preprocessing/base.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def apply_method(fn: Callable[..., Any]) -> Callable[..., Any]:
    """Decorator that handles unpack/pack boilerplate around an Applier's `apply`.

    The decorated method is written with signature ``(self, X, y, params)``
    instead of ``(self, df, params)``. The wrapper:

    1. Calls ``unpack_pipeline_input(df)`` to get ``(X, y, is_tuple)``.
    2. Invokes the user's method with the unpacked ``X`` and ``y``.
    3. If the method returns a 2-tuple ``(X_out, y_out)``, that pair is
       packed; otherwise the result is treated as ``X_out`` and the
       original ``y`` is reused.
    4. Calls ``pack_pipeline_output`` to restore the original input shape.

    Useful for ~50 Appliers that share the same boilerplate. Skip it for
    splitters (which return ``SplitDataset`` directly) or analyzers that
    don't transform the frame.
    """

    @functools.wraps(fn)
    def wrapper(self: Any, df: Any, params: Dict[str, Any]) -> Any:
        X, y, is_tuple = unpack_pipeline_input(df)
        result = fn(self, X, y, params)
        if isinstance(result, tuple) and len(result) == 2:
            X_out, y_out = result
        else:
            X_out, y_out = result, y
        return pack_pipeline_output(X_out, y_out, is_tuple)

    return wrapper

fit_method(fn)

Decorator that handles unpack boilerplate around a Calculator's fit.

The decorated method is written as (self, X, y, config) and may ignore y for X-only fits. No packing is done — fit returns a params dict, not a frame.

The TypeVar _NodeParams preserves the specific TypedDict return type (see preprocessing._artifacts) so callers see the concrete shape.

Source code in skyulf-core/skyulf/preprocessing/base.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def fit_method(fn: Callable[..., _NodeParams]) -> Callable[..., _NodeParams]:
    """Decorator that handles unpack boilerplate around a Calculator's `fit`.

    The decorated method is written as ``(self, X, y, config)`` and may
    ignore ``y`` for X-only fits. No packing is done — `fit` returns a
    params dict, not a frame.

    The TypeVar ``_NodeParams`` preserves the specific TypedDict return type
    (see ``preprocessing._artifacts``) so callers see the concrete shape.
    """

    @functools.wraps(fn)
    def wrapper(self: Any, df: Any, config: Dict[str, Any]) -> _NodeParams:
        X, y, _ = unpack_pipeline_input(df)
        return fn(self, X, y, config)

    return wrapper  # type: ignore[return-value]