Preprocessing Nodes
This page documents the preprocessing node types supported by FeatureEngineer.
Step schema
Every preprocessing step in the pipeline config uses:
{
"name": "...",
"transformer": "TransformerType",
"params": { ... }
}
Where params is passed into the node’s Calculator fit().
Splitters
Example step:
{"name": "split", "transformer": "TrainTestSplitter", "params": {"test_size": 0.2, "random_state": 42, "target_column": "target"}}
TrainTestSplitter
Splits a DataFrame (or (X, y) tuple) into SplitDataset(train, test, validation).
Config (params):
test_size: float (default 0.2)validation_size: float (default 0.0)random_state: int (default 42)shuffle: bool (default True)stratify: bool (default False)target_column: str (required only when splitting a DataFrame and using stratify)
Learned params: none (passes through config).
feature_target_split
Splits a DataFrame into (X, y) (or applies the split to each SplitDataset split).
Config:
target_column: str (required)
Learned params: none.
Cleaning
Example step:
{
"name": "clean_text",
"transformer": "TextCleaning",
"params": {
"columns": ["free_text"],
"operations": [
{"op": "trim", "mode": "both"},
{"op": "case", "mode": "lower"},
{"op": "regex", "mode": "collapse_whitespace"}
]
}
}
TextCleaning
Applies a list of string operations.
Config:
columns: list[str] (optional; auto-detects text-like columns)operations: list[dict]{ "op": "trim", "mode": "both"|"leading"|"trailing" }{ "op": "case", "mode": "lower"|"upper"|"title"|"sentence" }{ "op": "remove_special", "mode": "keep_alphanumeric"|"keep_alphanumeric_space"|"letters_only"|"digits_only", "replacement": "" }{ "op": "regex", "mode": "collapse_whitespace"|"extract_digits"|"custom", "pattern": "...", "repl": "..." }
Learned params:
columnsoperations
ValueReplacement
Replaces values in selected columns.
Config:
columns: list[str]- Either:
mapping: dict (global mapping) or dict[col -> mapping]to_replace+valuereplacements: list of{old, new}(converted into a mapping)
Learned params:
columns,mapping,to_replace,value
AliasReplacement
Normalizes common textual aliases (boolean/country/custom).
Config:
columns: list[str] (optional; auto-detects text-like columns)alias_type:boolean|country|custom(also supports legacymode)custom_map: dict[str, str] (also supports legacycustom_pairs)
Learned params:
columns,alias_type,custom_map
InvalidValueReplacement
Replaces invalid numeric values.
Config:
columns: list[str]rule:negative|zero|negative_to_nan|custom_range(also supports legacymode)replacement: any (default NaN)min_value/max_value: used bycustom_range
Learned params:
columns,rule,replacement,min_value,max_value
Drop & Missing
Deduplicate
Config:
subset: list[str] | Nonekeep:first|last|none(mapped toFalse)
Learned params:
subset,keep
DropMissingColumns
Config:
missing_threshold: float (percentage; if > 0, drops columns with missing% >= threshold)columns: list[str] (explicit columns to drop)
Learned params:
columns_to_drop,threshold
DropMissingRows
Config:
subset: list[str] | Nonehow:any|all(ignored ifthresholdprovided)threshold: int | None (min non-null values)
Learned params:
subset,how,threshold
MissingIndicator
Adds {col}_missing indicator columns.
Config:
columns: list[str] (optional; defaults to all columns with any missing values)
Learned params:
columns
Imputation
Example step:
{"name": "impute", "transformer": "SimpleImputer", "params": {"strategy": "median", "columns": ["age"]}}
SimpleImputer
Config:
strategy:mean|median|most_frequent|constant(also acceptsmode)columns: list[str] (optional; numeric auto-detection for mean/median)fill_value: any (used forconstant)
Learned params:
columnsstrategyfill_values: dict[col -> value]missing_counts: dict[col -> count]total_missing: int
KNNImputer
Config:
columns: list[str] (numeric)n_neighbors: int (default 5)weights:uniform|distance
Learned params:
columnsimputer_object(sklearn object; pickled in pipeline)n_neighbors,weights
IterativeImputer
Config:
columns: list[str] (numeric)max_iter: int (default 10)estimator:BayesianRidge|DecisionTree|ExtraTrees|KNeighbors
Learned params:
columnsimputer_object(sklearn object; pickled in pipeline)estimator
Encoding
Example step:
{"name": "encode", "transformer": "OneHotEncoder", "params": {"columns": ["city"], "drop_original": True, "handle_unknown": "ignore"}}
OneHotEncoder
Config:
columns: list[str] (optional; auto-detects categorical columns)drop_first: bool (default False)max_categories: int (default 20)handle_unknown:ignore|error(default ignore)drop_original: bool (default True)include_missing: bool (default False)
Learned params:
columnsencoder_object(sklearn OneHotEncoder)feature_names: list[str]drop_original,include_missing
DummyEncoder
Config:
columns: list[str]drop_first: bool
Learned params:
columnscategories: dict[col -> list[str]]drop_first
OrdinalEncoder
Config:
columns: list[str]handle_unknown: str (defaultuse_encoded_value)unknown_value: int/float (default -1)
Learned params:
columnsencoder_object(sklearn OrdinalEncoder)categories_count
LabelEncoder
Encodes either target or selected feature columns.
Config:
columns: optional list[str]- if omitted, encodes the provided target
y - if provided, encodes those feature columns (and also target if included)
Learned params:
encoders: dict[col or "target" -> sklearn LabelEncoder]classes_count
TargetEncoder
Requires a target series (y).
Config:
columns: list[str]smooth:autoor numerictarget_type:autoor explicit type
Learned params:
columnsencoder_object(sklearn TargetEncoder)
HashEncoder
Config:
columns: list[str]n_features: int (default 10)
Learned params:
columns,n_features
Scaling
All scaling nodes accept columns (optional; numeric auto-detect) and return learned numeric arrays.
StandardScaler
Config:
columns: list[str]with_mean: bool (default True)with_std: bool (default True)
Learned params:
columns,mean,scale,var,with_mean,with_std
MinMaxScaler
Config:
columns: list[str]feature_range: tuple (default (0, 1))
Learned params:
columns,min,scale,data_min,data_max,feature_range
RobustScaler
Config:
columns: list[str]quantile_range: tuple (default (25.0, 75.0))with_centering: bool (default True)with_scaling: bool (default True)
Learned params:
columns,center,scale,quantile_range,with_centering,with_scaling
MaxAbsScaler
Config:
columns: list[str]
Learned params:
columns,scale,max_abs
Outliers
IQR
Filters rows outside per-column IQR bounds.
Config:
columns: list[str]multiplier: float (default 1.5)
Learned params:
bounds: dict[col -> {lower, upper}]warnings
ZScore
Config:
columns: list[str]threshold: float (default 3.0)
Learned params:
stats: dict[col -> {mean, std}]threshold,warnings
Winsorize
Clips values into per-column percentile bounds.
Config:
columns: list[str]lower_percentile: float (default 5.0)upper_percentile: float (default 95.0)
Learned params:
bounds: dict[col -> {lower, upper}]
ManualBounds
Filters rows outside user-provided bounds.
Config:
bounds: dict[col -> {lower, upper}]
Learned params:
bounds
EllipticEnvelope
Learns a per-column EllipticEnvelope model and filters outliers.
Config:
columns: list[str]contamination: float (default 0.01)
Learned params:
models: dict[col -> sklearn model]contamination,warnings
Transformations
PowerTransformer
Config:
columns: list[str]method:yeo-johnson|box-coxstandardize: bool
Learned params:
columns,lambdas,method,standardize,scaler_params
SimpleTransformation
Config:
transformations: list of{column, method, clip_threshold?}- methods include
log,square_root,cube_root,reciprocal,square,exponential
Learned params:
transformations(passes through)
GeneralTransformation
Config:
transformations: list of{column, method, clip_threshold?}- methods include power transforms (
box-cox,yeo-johnson) and the simple methods
Learned params:
transformationswith fittedlambdas/scaler_paramswhere applicable
Bucketing (Binning)
GeneralBinning
Creates binned features with configurable strategies.
Config:
columns: list[str] (numeric)strategy:equal_width|equal_frequency|kmeans|custom|kbinsn_binsand strategy-specific keys:equal_width_bins,equal_frequency_bins,duplicateskbins_n_bins,kbins_strategycustom_bins: dict[col -> edges]custom_labels: dict[col -> labels]- output formatting:
output_suffix,drop_original,label_format,missing_strategy,missing_label,include_lowest,precision
Learned params:
bin_edges(dict[col -> edges])- output formatting settings
CustomBinning
Config:
columns: list[str]bins: list[float] (shared edges)- plus output formatting keys (same as GeneralBinning)
Learned params:
bin_edges
KBinsDiscretizer
Wrapper around GeneralBinning with a KBins-style interface.
Config:
columns: list[str]n_bins: intstrategy:uniform|quantile|kmeans
Learned params:
bin_edges
Casting
Casting
Config:
- Either:
column_types: dict[col -> dtype]- or
columns+target_type coerce_on_error: bool (default True)
Learned params:
type_map,coerce_on_error
Feature Generation
PolynomialFeatures
(Alias: PolynomialFeaturesNode)
Config:
columns: list[str]auto_detect: booldegree: intinteraction_only: boolinclude_bias: boolinclude_input_features: booloutput_prefix: str
Learned params:
columns,degree,interaction_only,include_bias,include_input_features,output_prefix,feature_names
FeatureGeneration
(Aliases: FeatureMath, FeatureGenerationNode)
Config:
operations: list[dict]epsilon: float (default 1e-9)allow_overwrite: bool
Learned params:
operations,epsilon,allow_overwrite
Feature Selection
VarianceThreshold
Config:
threshold: float (default 0.0)columns: list[str] (optional; numeric)drop_columns: bool (default True)
Learned params:
candidate_columns,selected_columns,variances,threshold,drop_columns
CorrelationThreshold
Config:
threshold: float (default 0.95)correlation_method:pearson|spearman|kendall(default pearson)columns: list[str] (numeric)drop_columns: bool
Learned params:
columns_to_drop,threshold,method,drop_columns
UnivariateSelection
Config:
target_column: str (ifynot passed as tuple)problem_type:auto|classification|regressionmethod:select_k_best|select_percentile|select_fpr|select_fdr|select_fwe|generic_univariate_select- selector parameters (depending on method):
k,percentile,alpha,mode,param - scoring:
score_func(e.g.,f_classif,mutual_info_classif, …) drop_columns: bool
Learned params:
selected_columns,candidate_columns,scores,pvalues(when available), plus selector config
ModelBasedSelection
Config:
target_column: strproblem_type:auto|classification|regressionmethod:select_from_model|rfeestimator:auto|logistic_regression|random_forest|linear_regression- For select_from_model:
threshold,max_features - For RFE:
n_features_to_select,step drop_columns: bool
Learned params:
selected_columns,candidate_columns, and method-specific metadata
feature_selection
A higher-level facade node that dispatches to the selection implementations.
Resampling
Oversampling
Config:
method:smote|adasyn|borderline_smote|svm_smote|kmeans_smote|smote_tomektarget_column: required ifyis not provided as tuplesampling_strategy:autoor dictrandom_state: int- method-specific keys:
k_neighbors,m_neighbors,kind,out_step,cluster_balance_threshold,density_exponent,n_jobs
Learned params: none (passes through config).
Undersampling
Config:
method:random_under_sampling|nearmiss|tomek_links|edited_nearest_neighbourstarget_column: required ifynot provided as tuplesampling_strategy,random_state,replacement,version,n_neighbors,kind_sel,n_jobs
Learned params: none.
Inspection
DatasetProfile
Captures basic dataset stats without modifying data.
Config: none.
Learned params:
profile: rows/columns/dtypes/missing/numeric_stats
DataSnapshot
Captures the first N rows without modifying data.
Config:
n_rows: int (default 5)
Learned params:
snapshot: list[dict]