Skip to content

Architecture

Skyulf is split into three pieces with strict boundaries:

  1. skyulf-core (this docs set focuses on this)
  2. A standalone Python ML library.
  3. Implements a strict Calculator → Applier pattern for every node.
  4. Depends on Pandas/Numpy/Scikit-Learn (and a few optional ML utilities).
  5. backend
  6. FastAPI + Celery orchestration layer.
  7. Handles ingestion, jobs, persistence, and exposes REST APIs.
  8. frontend
  9. React + TypeScript UI (ML canvas).
  10. Builds pipeline configs and talks to the backend.

The Calculator → Applier Pattern

Skyulf-core separates learning from transformation:

  • Calculator: fit(data, config) -> params
  • Learns statistics / encoders / models.
  • Returns a serializable params dictionary.
  • Applier: apply(data, params) -> transformed_data
  • Stateless transformer.
  • Applies learned parameters.

This makes pipelines easier to persist and safer to run in production:

  • Learning happens on train.
  • The learned state is explicit.
  • Applying is pure and repeatable.

Hybrid Engine (Polars & Pandas)

Skyulf employs a Hybrid Engine architecture to maximize performance:

  • Polars: Used for high-performance data ingestion (ETL) and stateless transformations (Scaling, Imputation, Encoding) where possible.
  • Pandas/Numpy: Used for stateful learning (Calculators) and compatibility with Scikit-Learn models.

The system automatically detects the input data type (pd.DataFrame or pl.DataFrame) and dispatches to the appropriate optimized path.

Node Registry

Skyulf uses a Registry Pattern to decouple the pipeline orchestrator from specific node implementations.

  • Registration: Nodes self-register using the @NodeRegistry.register("NodeName", ApplierClass) decorator on the Calculator class.
  • Discovery: The pipeline dynamically looks up the Calculator and Applier classes by name at runtime.
  • Extensibility: New nodes can be added simply by creating a new file and decorating the class; no changes to pipeline.py are required.

Data Catalog

To decouple data loading from the execution engine, Skyulf uses a Data Catalog pattern.

  • Interface: DataCatalog (in skyulf-core) defines the contract for loading data by identifier.
  • Implementation: FileSystemCatalog (in backend) implements this interface to load files from the local filesystem.
  • Usage: The PipelineEngine is injected with a catalog instance. Nodes request data by ID (or path), and the catalog handles the retrieval.

Pipeline Data Flow

At runtime, SkyulfPipeline orchestrates:

  1. Preprocessing: FeatureEngineer
  2. Executes a list of steps (each step is a transformer).
  3. Some steps change the data structure (e.g., splitters) and are handled specially.
  4. Modeling: StatefulEstimator
  5. Trains a model on the train split.
  6. Optionally evaluates on test/validation.

High-level flow:

Raw DataFrame
  └─ FeatureEngineer.fit_transform(...)  -> DataFrame or SplitDataset
        └─ (optionally) SplitDataset.train / test / validation
              └─ StatefulEstimator.fit_predict(...) -> predictions

Avoiding Data Leakage

If you split first (or provide a SplitDataset), calculators should learn only on the train split. This prevents leakage of statistics from test/validation.

See the User Guide section “SplitDataset & Leakage” for recommended patterns.