Skip to content

Preprocessing Nodes

This page documents all preprocessing node types supported by Skyulf's NodeRegistry (resolved by FeatureEngineer at runtime).

Step schema

Every preprocessing step in the pipeline config uses:

{
  "name": "...",
  "transformer": "TransformerType",
  "params": { ... }
}

Where params is passed into the node's Calculator fit().

Splitters

Example step:

{"name": "split", "transformer": "TrainTestSplitter", "params": {"test_size": 0.2, "random_state": 42, "target_column": "target"}}

TrainTestSplitter

Splits a DataFrame (or (X, y) tuple) into SplitDataset(train, test, validation).

Config (params):

  • test_size: float (default 0.2)
  • validation_size: float (default 0.0)
  • random_state: int (default 42)
  • shuffle: bool (default True)
  • stratify: bool (default False)
  • target_column: str (required only when splitting a DataFrame and using stratify)

Learned params: none (passes through config).

feature_target_split

Splits a DataFrame into (X, y) (or applies the split to each SplitDataset split).

Config:

  • target_column: str (required)

Learned params: none.

Cleaning

Example step:

{
  "name": "clean_text",
  "transformer": "TextCleaning",
  "params": {
    "columns": ["free_text"],
    "operations": [
      {"op": "trim", "mode": "both"},
      {"op": "case", "mode": "lower"},
      {"op": "regex", "mode": "collapse_whitespace"}
    ]
  }
}

TextCleaning

Applies a list of string operations.

Config:

  • columns: list[str] (optional; auto-detects text-like columns)
  • operations: list[dict]
  • { "op": "trim", "mode": "both"|"leading"|"trailing" }
  • { "op": "case", "mode": "lower"|"upper"|"title"|"sentence" }
  • { "op": "remove_special", "mode": "keep_alphanumeric"|"keep_alphanumeric_space"|"letters_only"|"digits_only", "replacement": "" }
  • { "op": "regex", "mode": "collapse_whitespace"|"extract_digits"|"custom", "pattern": "...", "repl": "..." }

Learned params:

  • columns
  • operations

ValueReplacement

Replaces values in selected columns.

Config:

  • columns: list[str]
  • Either:
  • mapping: dict (global mapping) or dict[col -> mapping]
  • to_replace + value
  • replacements: list of {old, new} (converted into a mapping)

Learned params:

  • columns, mapping, to_replace, value

AliasReplacement

Normalizes common textual aliases (boolean/country/custom).

Config:

  • columns: list[str] (optional; auto-detects text-like columns)
  • alias_type: boolean | country | custom (also supports legacy mode)
  • custom_map: dict[str, str] (also supports legacy custom_pairs)

Learned params:

  • columns, alias_type, custom_map

InvalidValueReplacement

Replaces invalid numeric values.

Config:

  • columns: list[str]
  • rule: negative | zero | negative_to_nan | custom_range (also supports legacy mode)
  • replacement: any (default NaN)
  • min_value / max_value: used by custom_range

Learned params:

  • columns, rule, replacement, min_value, max_value

Drop & Missing

Deduplicate

Config:

  • subset: list[str] | None
  • keep: first | last | none (mapped to False)

Learned params:

  • subset, keep

DropMissingColumns

Config:

  • missing_threshold: float (percentage; if > 0, drops columns with missing% >= threshold)
  • columns: list[str] (explicit columns to drop)

Learned params:

  • columns_to_drop, threshold

DropMissingRows

Config:

  • subset: list[str] | None
  • how: any | all (ignored if threshold provided)
  • threshold: int | None (min non-null values)

Learned params:

  • subset, how, threshold

MissingIndicator

Adds {col}_missing indicator columns.

Config:

  • columns: list[str] (optional; defaults to all columns with any missing values)

Learned params:

  • columns

Imputation

Example step:

{"name": "impute", "transformer": "SimpleImputer", "params": {"strategy": "median", "columns": ["age"]}}

SimpleImputer

Config:

  • strategy: mean | median | most_frequent | constant (also accepts mode)
  • columns: list[str] (optional; numeric auto-detection for mean/median)
  • fill_value: any (used for constant)

Learned params:

  • columns
  • strategy
  • fill_values: dict[col -> value]
  • missing_counts: dict[col -> count]
  • total_missing: int

KNNImputer

Config:

  • columns: list[str] (numeric)
  • n_neighbors: int (default 5)
  • weights: uniform | distance

Learned params:

  • columns
  • imputer_object (sklearn object; pickled in pipeline)
  • n_neighbors, weights

IterativeImputer

Config:

  • columns: list[str] (numeric)
  • max_iter: int (default 10)
  • estimator: BayesianRidge | DecisionTree | ExtraTrees | KNeighbors

Learned params:

  • columns
  • imputer_object (sklearn object; pickled in pipeline)
  • estimator

Encoding

Example step:

{"name": "encode", "transformer": "OneHotEncoder", "params": {"columns": ["city"], "drop_original": True, "handle_unknown": "ignore"}}

OneHotEncoder

Config:

  • columns: list[str] (optional; auto-detects categorical columns)
  • drop_first: bool (default False)
  • max_categories: int (default 20)
  • handle_unknown: ignore | error (default ignore)
  • drop_original: bool (default True)
  • include_missing: bool (default False)

Learned params:

  • columns
  • encoder_object (sklearn OneHotEncoder)
  • feature_names: list[str]
  • drop_original, include_missing

DummyEncoder

Config:

  • columns: list[str]
  • drop_first: bool

Learned params:

  • columns
  • categories: dict[col -> list[str]]
  • drop_first

OrdinalEncoder

Config:

  • columns: list[str]
  • handle_unknown: str (default use_encoded_value)
  • unknown_value: int/float (default -1)

Learned params:

  • columns
  • encoder_object (sklearn OrdinalEncoder)
  • categories_count

LabelEncoder

Encodes either target or selected feature columns.

Config:

  • columns: optional list[str]
  • if omitted, encodes the provided target y
  • if provided, encodes those feature columns (and also target if included)

Learned params:

  • encoders: dict[col or "target" -> sklearn LabelEncoder]
  • classes_count

TargetEncoder

Requires a target series (y).

Config:

  • columns: list[str]
  • smooth: auto or numeric
  • target_type: auto or explicit type

Learned params:

  • columns
  • encoder_object (sklearn TargetEncoder)

HashEncoder

Config:

  • columns: list[str]
  • n_features: int (default 10)

Learned params:

  • columns, n_features

Scaling

All scaling nodes accept columns (optional; numeric auto-detect) and return learned numeric arrays.

StandardScaler

Config:

  • columns: list[str]
  • with_mean: bool (default True)
  • with_std: bool (default True)

Learned params:

  • columns, mean, scale, var, with_mean, with_std

MinMaxScaler

Config:

  • columns: list[str]
  • feature_range: tuple (default (0, 1))

Learned params:

  • columns, min, scale, data_min, data_max, feature_range

RobustScaler

Config:

  • columns: list[str]
  • quantile_range: tuple (default (25.0, 75.0))
  • with_centering: bool (default True)
  • with_scaling: bool (default True)

Learned params:

  • columns, center, scale, quantile_range, with_centering, with_scaling

MaxAbsScaler

Config:

  • columns: list[str]

Learned params:

  • columns, scale, max_abs

Outliers

IQR

Filters rows outside per-column IQR bounds.

Config:

  • columns: list[str]
  • multiplier: float (default 1.5)

Learned params:

  • bounds: dict[col -> {lower, upper}]
  • warnings

ZScore

Config:

  • columns: list[str]
  • threshold: float (default 3.0)

Learned params:

  • stats: dict[col -> {mean, std}]
  • threshold, warnings

Winsorize

Clips values into per-column percentile bounds.

Config:

  • columns: list[str]
  • lower_percentile: float (default 5.0)
  • upper_percentile: float (default 95.0)

Learned params:

  • bounds: dict[col -> {lower, upper}]

ManualBounds

Filters rows outside user-provided bounds.

Config:

  • bounds: dict[col -> {lower, upper}]

Learned params:

  • bounds

EllipticEnvelope

Learns a per-column EllipticEnvelope model and filters outliers.

Config:

  • columns: list[str]
  • contamination: float (default 0.01)

Learned params:

  • models: dict[col -> sklearn model]
  • contamination, warnings

Transformations

PowerTransformer

Config:

  • columns: list[str]
  • method: yeo-johnson | box-cox
  • standardize: bool

Learned params:

  • columns, lambdas, method, standardize, scaler_params

SimpleTransformation

Config:

  • transformations: list of {column, method, clip_threshold?}
  • methods include log, square_root, cube_root, reciprocal, square, exponential

Learned params:

  • transformations (passes through)

GeneralTransformation

Config:

  • transformations: list of {column, method, clip_threshold?}
  • methods include power transforms (box-cox, yeo-johnson) and the simple methods

Learned params:

  • transformations with fitted lambdas/scaler_params where applicable

Bucketing (Binning)

GeneralBinning

Creates binned features with configurable strategies.

Config:

  • columns: list[str] (numeric)
  • strategy: equal_width | equal_frequency | kmeans | custom | kbins
  • n_bins and strategy-specific keys:
  • equal_width_bins, equal_frequency_bins, duplicates
  • kbins_n_bins, kbins_strategy
  • custom_bins: dict[col -> edges]
  • custom_labels: dict[col -> labels]
  • output formatting:
  • output_suffix, drop_original, label_format, missing_strategy, missing_label, include_lowest, precision

Learned params:

  • bin_edges (dict[col -> edges])
  • output formatting settings

CustomBinning

Config:

  • columns: list[str]
  • bins: list[float] (shared edges)
  • plus output formatting keys (same as GeneralBinning)

Learned params:

  • bin_edges

KBinsDiscretizer

Wrapper around GeneralBinning with a KBins-style interface.

Config:

  • columns: list[str]
  • n_bins: int
  • strategy: uniform | quantile | kmeans

Learned params:

  • bin_edges

Casting

Casting

Config:

  • Either:
  • column_types: dict[col -> dtype]
  • or columns + target_type
  • coerce_on_error: bool (default True)

Learned params:

  • type_map, coerce_on_error

Feature Generation

PolynomialFeatures

(Alias: PolynomialFeaturesNode)

Config:

  • columns: list[str]
  • auto_detect: bool
  • degree: int
  • interaction_only: bool
  • include_bias: bool
  • include_input_features: bool
  • output_prefix: str

Learned params:

  • columns, degree, interaction_only, include_bias, include_input_features, output_prefix, feature_names

FeatureGeneration

(Aliases: FeatureMath, FeatureGenerationNode)

Config:

  • operations: list[dict]
  • epsilon: float (default 1e-9)
  • allow_overwrite: bool

Learned params:

  • operations, epsilon, allow_overwrite

Feature Selection

VarianceThreshold

Config:

  • threshold: float (default 0.0)
  • columns: list[str] (optional; numeric)
  • drop_columns: bool (default True)

Learned params:

  • candidate_columns, selected_columns, variances, threshold, drop_columns

CorrelationThreshold

Config:

  • threshold: float (default 0.95)
  • correlation_method: pearson | spearman | kendall (default pearson)
  • columns: list[str] (numeric)
  • drop_columns: bool

Learned params:

  • columns_to_drop, threshold, method, drop_columns

UnivariateSelection

Config:

  • target_column: str (if y not passed as tuple)
  • problem_type: auto | classification | regression
  • method: select_k_best | select_percentile | select_fpr | select_fdr | select_fwe | generic_univariate_select
  • selector parameters (depending on method): k, percentile, alpha, mode, param
  • scoring: score_func (e.g., f_classif, mutual_info_classif, …)
  • drop_columns: bool

Learned params:

  • selected_columns, candidate_columns, scores, pvalues (when available), plus selector config

ModelBasedSelection

Config:

  • target_column: str
  • problem_type: auto | classification | regression
  • method: select_from_model | rfe
  • estimator: auto | logistic_regression | random_forest | linear_regression
  • For select_from_model: threshold, max_features
  • For RFE: n_features_to_select, step
  • drop_columns: bool

Learned params:

  • selected_columns, candidate_columns, and method-specific metadata

feature_selection

A higher-level facade node that dispatches to the selection implementations.

Resampling

Oversampling

Config:

  • method: smote | adasyn | borderline_smote | svm_smote | kmeans_smote | smote_tomek
  • target_column: required if y is not provided as tuple
  • sampling_strategy: auto or dict
  • random_state: int
  • method-specific keys: k_neighbors, m_neighbors, kind, out_step, cluster_balance_threshold, density_exponent, n_jobs

Learned params: none (passes through config).

Undersampling

Config:

  • method: random_under_sampling | nearmiss | tomek_links | edited_nearest_neighbours
  • target_column: required if y not provided as tuple
  • sampling_strategy, random_state, replacement, version, n_neighbors, kind_sel, n_jobs

Learned params: none.

Inspection

DatasetProfile

Captures basic dataset stats without modifying data.

Config: none.

Learned params:

  • profile: rows/columns/dtypes/missing/numeric_stats

DataSnapshot

Captures the first N rows without modifying data.

Config:

  • n_rows: int (default 5)

Learned params:

  • snapshot: list[dict]