Skip to content

Preprocessing Nodes

This page documents the preprocessing node types supported by FeatureEngineer.

Step schema

Every preprocessing step in the pipeline config uses:

{
  "name": "...",
  "transformer": "TransformerType",
  "params": { ... }
}

Where params is passed into the node’s Calculator fit().

Splitters

Example step:

{"name": "split", "transformer": "TrainTestSplitter", "params": {"test_size": 0.2, "random_state": 42, "target_column": "target"}}

TrainTestSplitter

Splits a DataFrame (or (X, y) tuple) into SplitDataset(train, test, validation).

Config (params):

  • test_size: float (default 0.2)
  • validation_size: float (default 0.0)
  • random_state: int (default 42)
  • shuffle: bool (default True)
  • stratify: bool (default False)
  • target_column: str (required only when splitting a DataFrame and using stratify)

Learned params: none (passes through config).

feature_target_split

Splits a DataFrame into (X, y) (or applies the split to each SplitDataset split).

Config:

  • target_column: str (required)

Learned params: none.

Cleaning

Example step:

{
  "name": "clean_text",
  "transformer": "TextCleaning",
  "params": {
    "columns": ["free_text"],
    "operations": [
      {"op": "trim", "mode": "both"},
      {"op": "case", "mode": "lower"},
      {"op": "regex", "mode": "collapse_whitespace"}
    ]
  }
}

TextCleaning

Applies a list of string operations.

Config:

  • columns: list[str] (optional; auto-detects text-like columns)
  • operations: list[dict]
  • { "op": "trim", "mode": "both"|"leading"|"trailing" }
  • { "op": "case", "mode": "lower"|"upper"|"title"|"sentence" }
  • { "op": "remove_special", "mode": "keep_alphanumeric"|"keep_alphanumeric_space"|"letters_only"|"digits_only", "replacement": "" }
  • { "op": "regex", "mode": "collapse_whitespace"|"extract_digits"|"custom", "pattern": "...", "repl": "..." }

Learned params:

  • columns
  • operations

ValueReplacement

Replaces values in selected columns.

Config:

  • columns: list[str]
  • Either:
  • mapping: dict (global mapping) or dict[col -> mapping]
  • to_replace + value
  • replacements: list of {old, new} (converted into a mapping)

Learned params:

  • columns, mapping, to_replace, value

AliasReplacement

Normalizes common textual aliases (boolean/country/custom).

Config:

  • columns: list[str] (optional; auto-detects text-like columns)
  • alias_type: boolean | country | custom (also supports legacy mode)
  • custom_map: dict[str, str] (also supports legacy custom_pairs)

Learned params:

  • columns, alias_type, custom_map

InvalidValueReplacement

Replaces invalid numeric values.

Config:

  • columns: list[str]
  • rule: negative | zero | negative_to_nan | custom_range (also supports legacy mode)
  • replacement: any (default NaN)
  • min_value / max_value: used by custom_range

Learned params:

  • columns, rule, replacement, min_value, max_value

Drop & Missing

Deduplicate

Config:

  • subset: list[str] | None
  • keep: first | last | none (mapped to False)

Learned params:

  • subset, keep

DropMissingColumns

Config:

  • missing_threshold: float (percentage; if > 0, drops columns with missing% >= threshold)
  • columns: list[str] (explicit columns to drop)

Learned params:

  • columns_to_drop, threshold

DropMissingRows

Config:

  • subset: list[str] | None
  • how: any | all (ignored if threshold provided)
  • threshold: int | None (min non-null values)

Learned params:

  • subset, how, threshold

MissingIndicator

Adds {col}_missing indicator columns.

Config:

  • columns: list[str] (optional; defaults to all columns with any missing values)

Learned params:

  • columns

Imputation

Example step:

{"name": "impute", "transformer": "SimpleImputer", "params": {"strategy": "median", "columns": ["age"]}}

SimpleImputer

Config:

  • strategy: mean | median | most_frequent | constant (also accepts mode)
  • columns: list[str] (optional; numeric auto-detection for mean/median)
  • fill_value: any (used for constant)

Learned params:

  • columns
  • strategy
  • fill_values: dict[col -> value]
  • missing_counts: dict[col -> count]
  • total_missing: int

KNNImputer

Config:

  • columns: list[str] (numeric)
  • n_neighbors: int (default 5)
  • weights: uniform | distance

Learned params:

  • columns
  • imputer_object (sklearn object; pickled in pipeline)
  • n_neighbors, weights

IterativeImputer

Config:

  • columns: list[str] (numeric)
  • max_iter: int (default 10)
  • estimator: BayesianRidge | DecisionTree | ExtraTrees | KNeighbors

Learned params:

  • columns
  • imputer_object (sklearn object; pickled in pipeline)
  • estimator

Encoding

Example step:

{"name": "encode", "transformer": "OneHotEncoder", "params": {"columns": ["city"], "drop_original": True, "handle_unknown": "ignore"}}

OneHotEncoder

Config:

  • columns: list[str] (optional; auto-detects categorical columns)
  • drop_first: bool (default False)
  • max_categories: int (default 20)
  • handle_unknown: ignore | error (default ignore)
  • drop_original: bool (default True)
  • include_missing: bool (default False)

Learned params:

  • columns
  • encoder_object (sklearn OneHotEncoder)
  • feature_names: list[str]
  • drop_original, include_missing

DummyEncoder

Config:

  • columns: list[str]
  • drop_first: bool

Learned params:

  • columns
  • categories: dict[col -> list[str]]
  • drop_first

OrdinalEncoder

Config:

  • columns: list[str]
  • handle_unknown: str (default use_encoded_value)
  • unknown_value: int/float (default -1)

Learned params:

  • columns
  • encoder_object (sklearn OrdinalEncoder)
  • categories_count

LabelEncoder

Encodes either target or selected feature columns.

Config:

  • columns: optional list[str]
  • if omitted, encodes the provided target y
  • if provided, encodes those feature columns (and also target if included)

Learned params:

  • encoders: dict[col or "target" -> sklearn LabelEncoder]
  • classes_count

TargetEncoder

Requires a target series (y).

Config:

  • columns: list[str]
  • smooth: auto or numeric
  • target_type: auto or explicit type

Learned params:

  • columns
  • encoder_object (sklearn TargetEncoder)

HashEncoder

Config:

  • columns: list[str]
  • n_features: int (default 10)

Learned params:

  • columns, n_features

Scaling

All scaling nodes accept columns (optional; numeric auto-detect) and return learned numeric arrays.

StandardScaler

Config:

  • columns: list[str]
  • with_mean: bool (default True)
  • with_std: bool (default True)

Learned params:

  • columns, mean, scale, var, with_mean, with_std

MinMaxScaler

Config:

  • columns: list[str]
  • feature_range: tuple (default (0, 1))

Learned params:

  • columns, min, scale, data_min, data_max, feature_range

RobustScaler

Config:

  • columns: list[str]
  • quantile_range: tuple (default (25.0, 75.0))
  • with_centering: bool (default True)
  • with_scaling: bool (default True)

Learned params:

  • columns, center, scale, quantile_range, with_centering, with_scaling

MaxAbsScaler

Config:

  • columns: list[str]

Learned params:

  • columns, scale, max_abs

Outliers

IQR

Filters rows outside per-column IQR bounds.

Config:

  • columns: list[str]
  • multiplier: float (default 1.5)

Learned params:

  • bounds: dict[col -> {lower, upper}]
  • warnings

ZScore

Config:

  • columns: list[str]
  • threshold: float (default 3.0)

Learned params:

  • stats: dict[col -> {mean, std}]
  • threshold, warnings

Winsorize

Clips values into per-column percentile bounds.

Config:

  • columns: list[str]
  • lower_percentile: float (default 5.0)
  • upper_percentile: float (default 95.0)

Learned params:

  • bounds: dict[col -> {lower, upper}]

ManualBounds

Filters rows outside user-provided bounds.

Config:

  • bounds: dict[col -> {lower, upper}]

Learned params:

  • bounds

EllipticEnvelope

Learns a per-column EllipticEnvelope model and filters outliers.

Config:

  • columns: list[str]
  • contamination: float (default 0.01)

Learned params:

  • models: dict[col -> sklearn model]
  • contamination, warnings

Transformations

PowerTransformer

Config:

  • columns: list[str]
  • method: yeo-johnson | box-cox
  • standardize: bool

Learned params:

  • columns, lambdas, method, standardize, scaler_params

SimpleTransformation

Config:

  • transformations: list of {column, method, clip_threshold?}
  • methods include log, square_root, cube_root, reciprocal, square, exponential

Learned params:

  • transformations (passes through)

GeneralTransformation

Config:

  • transformations: list of {column, method, clip_threshold?}
  • methods include power transforms (box-cox, yeo-johnson) and the simple methods

Learned params:

  • transformations with fitted lambdas/scaler_params where applicable

Bucketing (Binning)

GeneralBinning

Creates binned features with configurable strategies.

Config:

  • columns: list[str] (numeric)
  • strategy: equal_width | equal_frequency | kmeans | custom | kbins
  • n_bins and strategy-specific keys:
  • equal_width_bins, equal_frequency_bins, duplicates
  • kbins_n_bins, kbins_strategy
  • custom_bins: dict[col -> edges]
  • custom_labels: dict[col -> labels]
  • output formatting:
  • output_suffix, drop_original, label_format, missing_strategy, missing_label, include_lowest, precision

Learned params:

  • bin_edges (dict[col -> edges])
  • output formatting settings

CustomBinning

Config:

  • columns: list[str]
  • bins: list[float] (shared edges)
  • plus output formatting keys (same as GeneralBinning)

Learned params:

  • bin_edges

KBinsDiscretizer

Wrapper around GeneralBinning with a KBins-style interface.

Config:

  • columns: list[str]
  • n_bins: int
  • strategy: uniform | quantile | kmeans

Learned params:

  • bin_edges

Casting

Casting

Config:

  • Either:
  • column_types: dict[col -> dtype]
  • or columns + target_type
  • coerce_on_error: bool (default True)

Learned params:

  • type_map, coerce_on_error

Feature Generation

PolynomialFeatures

(Alias: PolynomialFeaturesNode)

Config:

  • columns: list[str]
  • auto_detect: bool
  • degree: int
  • interaction_only: bool
  • include_bias: bool
  • include_input_features: bool
  • output_prefix: str

Learned params:

  • columns, degree, interaction_only, include_bias, include_input_features, output_prefix, feature_names

FeatureGeneration

(Aliases: FeatureMath, FeatureGenerationNode)

Config:

  • operations: list[dict]
  • epsilon: float (default 1e-9)
  • allow_overwrite: bool

Learned params:

  • operations, epsilon, allow_overwrite

Feature Selection

VarianceThreshold

Config:

  • threshold: float (default 0.0)
  • columns: list[str] (optional; numeric)
  • drop_columns: bool (default True)

Learned params:

  • candidate_columns, selected_columns, variances, threshold, drop_columns

CorrelationThreshold

Config:

  • threshold: float (default 0.95)
  • correlation_method: pearson | spearman | kendall (default pearson)
  • columns: list[str] (numeric)
  • drop_columns: bool

Learned params:

  • columns_to_drop, threshold, method, drop_columns

UnivariateSelection

Config:

  • target_column: str (if y not passed as tuple)
  • problem_type: auto | classification | regression
  • method: select_k_best | select_percentile | select_fpr | select_fdr | select_fwe | generic_univariate_select
  • selector parameters (depending on method): k, percentile, alpha, mode, param
  • scoring: score_func (e.g., f_classif, mutual_info_classif, …)
  • drop_columns: bool

Learned params:

  • selected_columns, candidate_columns, scores, pvalues (when available), plus selector config

ModelBasedSelection

Config:

  • target_column: str
  • problem_type: auto | classification | regression
  • method: select_from_model | rfe
  • estimator: auto | logistic_regression | random_forest | linear_regression
  • For select_from_model: threshold, max_features
  • For RFE: n_features_to_select, step
  • drop_columns: bool

Learned params:

  • selected_columns, candidate_columns, and method-specific metadata

feature_selection

A higher-level facade node that dispatches to the selection implementations.

Resampling

Oversampling

Config:

  • method: smote | adasyn | borderline_smote | svm_smote | kmeans_smote | smote_tomek
  • target_column: required if y is not provided as tuple
  • sampling_strategy: auto or dict
  • random_state: int
  • method-specific keys: k_neighbors, m_neighbors, kind, out_step, cluster_balance_threshold, density_exponent, n_jobs

Learned params: none (passes through config).

Undersampling

Config:

  • method: random_under_sampling | nearmiss | tomek_links | edited_nearest_neighbours
  • target_column: required if y not provided as tuple
  • sampling_strategy, random_state, replacement, version, n_neighbors, kind_sel, n_jobs

Learned params: none.

Inspection

DatasetProfile

Captures basic dataset stats without modifying data.

Config: none.

Learned params:

  • profile: rows/columns/dtypes/missing/numeric_stats

DataSnapshot

Captures the first N rows without modifying data.

Config:

  • n_rows: int (default 5)

Learned params:

  • snapshot: list[dict]