Skip to content

FAQ & Comparison

Frequently asked questions and how Skyulf compares to other ML platforms.


General

What is Skyulf?

Skyulf is a self-hosted, privacy-first MLOps platform that combines:

  • A Python library (skyulf-core) for reproducible ML pipelines.
  • A FastAPI backend for data management, pipeline execution, and model serving.
  • A React-based visual ML Canvas for building pipelines without writing code.

Who is Skyulf for?

  • Data Scientists who want a visual pipeline builder with proper leakage prevention.
  • ML Engineers who need a self-hosted alternative to cloud ML platforms.
  • Teams who need reproducible, auditable ML workflows without vendor lock-in.
  • Students/Researchers who want to learn ML engineering best practices.

Is Skyulf free?

Yes. Skyulf is open-source. See the LICENSE for details.

Can I use skyulf-core without the web platform?

Absolutely. skyulf-core is a standalone PyPI package. Install it with pip install skyulf-core and use it like any Python library. The web platform is optional.


How Skyulf differs from other tools

Skyulf vs. MLflow

Aspect Skyulf MLflow
Focus End-to-end pipeline (preprocessing + training + deploy) Experiment tracking and model registry
Pipeline building Visual canvas + config-driven Code-only (no visual builder)
Preprocessing 30+ built-in nodes (imputation, encoding, scaling, outliers, feature selection, resampling) None — you bring your own preprocessing
Leakage prevention Calculator/Applier pattern enforces train-only statistics Not addressed
Deployment Built into the platform Separate deployment step
Self-hosted Yes Yes

Summary: MLflow is great for experiment tracking. Skyulf covers the full pipeline from raw data to deployed model, including preprocessing.

Skyulf vs. Kubeflow / ZenML

Aspect Skyulf Kubeflow / ZenML
Infrastructure Single machine, Docker optional Kubernetes required (Kubeflow) or multi-runtime (ZenML)
Setup complexity pip install or docker-compose up Significant infrastructure setup
Visual builder Drag-and-drop React Flow canvas DAG visualizations (read-only)
Target audience Small-to-medium teams, individuals Enterprise orchestration at scale
Preprocessing Built-in node library BYO preprocessing code

Summary: Kubeflow/ZenML excel at large-scale orchestration. Skyulf is simpler to set up and includes preprocessing out of the box.

Skyulf vs. scikit-learn Pipelines

Aspect Skyulf scikit-learn Pipeline
Config format JSON-compatible dicts (serializable, storable) Python objects (code-defined)
State management Explicit params dict (inspectable, portable) Hidden in self. attributes
Leakage safety Enforced by architecture (Calculator learns on train only) Manual responsibility (fit on train, transform on test)
Visual builder Yes (ML Canvas) No
Model variety 20 models + hyperparameter tuning Full scikit-learn ecosystem
EDA/Profiling Built-in analyzer + visualizer None

Summary: scikit-learn is the gold standard for ML in Python. Skyulf wraps scikit-learn models and adds config-driven pipelines, leakage prevention, and a visual interface.

Skyulf vs. AutoML (Auto-sklearn, FLAML, H2O)

Aspect Skyulf AutoML tools
Approach Manual or semi-automated pipeline building Fully automated model selection
Control Full control over every preprocessing and modeling step Black-box optimization
Tuning Configurable (grid, random, Optuna, halving) Automatic (built-in)
Transparency Every step inspectable, every parameter visible Results-focused, less transparent
Use case When you need to understand and control your pipeline When you want fastest time-to-result

Summary: AutoML tools optimize for speed. Skyulf optimizes for transparency and control.


Technical FAQ

What Python version is required?

Python 3.9 or higher. We recommend 3.10 or 3.11 or 3.12.

Does Skyulf support GPU training?

Not directly. Models use scikit-learn (CPU) and XGBoost (which supports GPU if configured). There is no built-in PyTorch/TensorFlow integration.

Can I add my own preprocessing nodes?

Yes. Implement a Calculator and Applier, decorate with @node_meta and @NodeRegistry.register. See Extending Skyulf-Core.

Can I add my own models?

Yes. Implement a BaseModelCalculator and BaseModelApplier, register with @NodeRegistry.register, and use the model key in your config. See Extending Skyulf-Core.

Does Skyulf handle feature engineering?

Yes. The preprocessing system includes 30+ nodes: imputation (Simple, KNN, Iterative), encoding (OneHot, Ordinal, Label, Target, Hash), scaling (Standard, MinMax, Robust, MaxAbs), outlier detection (IQR, ZScore, Winsorize, EllipticEnvelope), feature generation (Polynomial, Math), feature selection (Variance, Correlation, Univariate, Model-based), and more.

What data formats are supported?

  • skyulf-core library: Pandas DataFrames and Polars DataFrames (auto-detected).
  • Web platform: CSV upload via the data ingestion API. Database sources (PostgreSQL, etc.) via the ingestion endpoint.

How does the hybrid Polars/Pandas engine work?

Skyulf auto-detects whether your data is Polars or Pandas. Simple operations (scaling, imputation) run natively in Polars for speed. Complex operations (feature selection, some sklearn-backed nodes) temporarily bridge to Pandas/NumPy via Apache Arrow (near zero-copy). See Engine Mechanics.

Is there an API for programmatic access?

Yes. The backend exposes a REST API with endpoints for data upload, pipeline execution, model deployment, and inference. See Platform Walkthrough.

How do I run multiple experiments in parallel? (v0.4.0+)

Connect 2+ training nodes to your dataset (each with its own preprocessing path). A Run All Experiments button appears in the toolbar — clicking it queues all branches at once and returns separate job_ids for each. You can also click Train on an individual node to run just that branch.

What's the difference between Merge and Parallel?

  • Merge: Combines data from multiple upstream branches into a single DataFrame before training. Use when you have parallel preprocessing paths feeding one model.
  • Parallel: Each incoming branch becomes a separate experiment job. Use when you want independent experiments.

Training nodes with 2+ inputs show a toggle to switch between modes. See the Multi-Path Pipelines guide.

How do I copy-paste nodes on the canvas? (v0.4.0+)

Select one or more nodes, press Ctrl+C (Cmd+C on Mac) to copy, then Ctrl+V (Cmd+V on Mac) to paste. Nodes are pasted with a position offset. Internal edges between selected nodes are preserved.