Architecture

Skyulf is split into three pieces with strict boundaries:

skyulf-core (this docs set focuses on this)
A standalone Python ML library.
Implements a strict Calculator → Applier pattern for every node.
Depends on Pandas/Numpy/Scikit-Learn (and a few optional ML utilities).
backend
FastAPI + Celery orchestration layer.
Handles ingestion, jobs, persistence, and exposes REST APIs.
frontend
React + TypeScript UI (ML canvas).
Builds pipeline configs and talks to the backend.

The Calculator → Applier Pattern

Skyulf-core separates learning from transformation:

Calculator: fit(data, config) -> params
Learns statistics / encoders / models.
Returns a serializable params dictionary.
Applier: apply(data, params) -> transformed_data
Stateless transformer.
Applies learned parameters.

This makes pipelines easier to persist and safer to run in production:

Learning happens on train.
The learned state is explicit.
Applying is pure and repeatable.

Hybrid Engine (Polars & Pandas)

Skyulf employs a Hybrid Engine architecture to maximize performance:

Polars: Used for high-performance data ingestion (ETL) and stateless transformations (Scaling, Imputation, Encoding) where possible.
Pandas/Numpy: Used for stateful learning (Calculators) and compatibility with Scikit-Learn models.

The system automatically detects the input data type (pd.DataFrame or pl.DataFrame) and dispatches to the appropriate optimized path.

Node Registry

Skyulf uses a Registry Pattern to decouple the pipeline orchestrator from specific node implementations.

Registration: Nodes self-register using the @NodeRegistry.register("NodeName", ApplierClass) decorator on the Calculator class.
Discovery: The pipeline dynamically looks up the Calculator and Applier classes by name at runtime.
Extensibility: New nodes can be added simply by creating a new file and decorating the class; no changes to pipeline.py are required.

Data Catalog

To decouple data loading from the execution engine, Skyulf uses a Data Catalog pattern.

Interface: DataCatalog (in skyulf-core) defines the contract for loading data by identifier.
Implementation: FileSystemCatalog (in backend) implements this interface to load files from the local filesystem.
Usage: The PipelineEngine is injected with a catalog instance. Nodes request data by ID (or path), and the catalog handles the retrieval.

Pipeline Data Flow

At runtime, SkyulfPipeline orchestrates:

Preprocessing: FeatureEngineer
Executes a list of steps (each step is a transformer).
Some steps change the data structure (e.g., splitters) and are handled specially.
Modeling: StatefulEstimator
Trains a model on the train split.
Optionally evaluates on test/validation.

High-level flow:

Raw DataFrame
  └─ FeatureEngineer.fit_transform(...)  -> DataFrame or SplitDataset
        └─ (optionally) SplitDataset.train / test / validation
              └─ StatefulEstimator.fit_predict(...) -> predictions

Avoiding Data Leakage

If you split first (or provide a SplitDataset), calculators should learn only on the train split. This prevents leakage of statistics from test/validation.

See the User Guide section “SplitDataset & Leakage” for recommended patterns.