Modeling Nodes
This page documents modeling configuration for SkyulfPipeline.
Common config shape
SkyulfPipeline expects a modeling block like:
{
"type": "logistic_regression",
"node_id": "model_node", # optional
"execution_mode": "merge", # optional; "merge" (default) or "parallel"
"params": { ... } # optional; estimator hyperparameters
}
execution_mode (v0.4.0+)
When a training node has 2+ incoming connections:
| Value | Behavior |
|---|---|
"merge" (default) |
Combine all upstream DataFrames into one before training |
"parallel" |
Each incoming branch runs as a separate job |
This field is set via the Merge/Parallel toggle on training nodes in the canvas UI.
The sklearn wrapper supports both:
- Nested params (preferred):
{ "params": {"C": 1.0} } - Flat params (legacy):
{ "C": 1.0, "type": "..." }
Example (RandomForestClassifier):
{
"type": "random_forest_classifier",
"params": {"n_estimators": 50, "random_state": 42}
}
Classification
logistic_regression
Backed by sklearn.linear_model.LogisticRegression.
Defaults:
max_iter=1000solver=lbfgsrandom_state=42
Learned params:
- fitted sklearn estimator (stored in-memory and pickled when saving the pipeline)
random_forest_classifier
Backed by sklearn.ensemble.RandomForestClassifier.
Defaults include:
n_estimators=50,max_depth=10min_samples_split=5,min_samples_leaf=2n_jobs=-1,random_state=42
Learned params:
- fitted sklearn estimator
svc
Backed by sklearn.svm.SVC.
Defaults:
- C=1.0, kernel=rbf, gamma=scale
- probability=True, random_state=42
k_neighbors_classifier
Backed by sklearn.neighbors.KNeighborsClassifier.
Defaults:
- n_neighbors=5, weights=uniform
- algorithm=auto, n_jobs=-1
decision_tree_classifier
Backed by sklearn.tree.DecisionTreeClassifier.
Defaults:
- max_depth=None, min_samples_split=2
- criterion=gini, random_state=42
gradient_boosting_classifier
Backed by sklearn.ensemble.GradientBoostingClassifier.
Defaults:
- n_estimators=100, learning_rate=0.1
- max_depth=3, random_state=42
adaboost_classifier
Backed by sklearn.ensemble.AdaBoostClassifier.
Defaults:
- n_estimators=50, learning_rate=1.0
- random_state=42
xgboost_classifier
Backed by xgboost.XGBClassifier.
Defaults:
- n_estimators=100, max_depth=6
- learning_rate=0.3, n_jobs=-1
- random_state=42
gaussian_nb
Backed by sklearn.naive_bayes.GaussianNB.
Defaults:
- var_smoothing=1e-9
Regression
ridge_regression
Backed by sklearn.linear_model.Ridge.
Defaults:
alpha=1.0,solver=auto,random_state=42
lasso_regression
Backed by sklearn.linear_model.Lasso.
Defaults:
- alpha=1.0, selection=cyclic
- random_state=42
elasticnet_regression
Backed by sklearn.linear_model.ElasticNet.
Defaults:
- alpha=1.0, l1_ratio=0.5
- selection=cyclic, random_state=42
random_forest_regressor
Backed by sklearn.ensemble.RandomForestRegressor.
Defaults include:
n_estimators=50,max_depth=10min_samples_split=5,min_samples_leaf=2n_jobs=-1,random_state=42
svr
Backed by sklearn.svm.SVR.
Defaults:
- C=1.0, kernel=rbf, gamma=scale
k_neighbors_regressor
Backed by sklearn.neighbors.KNeighborsRegressor.
Defaults:
- n_neighbors=5, weights=uniform
- algorithm=auto, n_jobs=-1
decision_tree_regressor
Backed by sklearn.tree.DecisionTreeRegressor.
Defaults:
- max_depth=None, min_samples_split=2
- criterion=squared_error, random_state=42
gradient_boosting_regressor
Backed by sklearn.ensemble.GradientBoostingRegressor.
Defaults:
- n_estimators=100, learning_rate=0.1
- max_depth=3, random_state=42
adaboost_regressor
Backed by sklearn.ensemble.AdaBoostRegressor.
Defaults:
- n_estimators=50, learning_rate=1.0
- random_state=42
xgboost_regressor
Backed by xgboost.XGBRegressor.
Defaults:
- n_estimators=100, max_depth=6
- learning_rate=0.3, n_jobs=-1
- random_state=42
Hyperparameter tuning
hyperparameter_tuner
This mode wraps a base model and performs search.
Config:
type:hyperparameter_tunerbase_model: dict with a supported base model type (e.g., logistic regression)- tuning options such as:
strategy:grid|random|halving_grid|halving_random|optuna(availability depends on installed packages)search_space: dict of parameter → list/rangemetric: e.g.,accuracy,f1,roc_auc,rmse,r2cv_enabled,cv_type,cv_folds,random_state
Learned params:
- a tuple
(best_model, tuning_result)wherebest_modelis a fitted estimator.
Cross-validation
StatefulEstimator.cross_validate() can perform CV on the train split and returns aggregated fold metrics.
Five CV methods are supported:
| Key | Strategy | Notes |
|---|---|---|
k_fold |
K-Fold | Default. Shuffled. |
stratified_k_fold |
Stratified K-Fold | Preserves class distribution (classification). Falls back to K-Fold for regression. |
shuffle_split |
Shuffle Split | Random 80/20 splits; samples may repeat across folds. |
time_series_split |
Time Series Split | Expanding window. Auto-sorts by datetime column if cv_time_column is set. |
nested_cv |
Nested CV | Outer loop evaluates generalization; inner 3-fold loop checks HP stability. With advanced tuning, post-tuning eval auto-downgrades to stratified_k_fold/k_fold since the inner loop already ran during the search. |
Config keys: cv_enabled, cv_type, cv_folds, cv_time_column.
See the Cross-Validation Guide for details.
Note: SkyulfPipeline performs modeling through the same building blocks (a calculator + applier); StatefulEstimator
is the lightweight wrapper exposed for low-level usage.