Skip to content

Skyulf Architecture & Data Flow

1. The Dual-Engine Strategy: Polars & Pandas

Skyulf uses a hybrid approach to data processing to balance performance (Ingestion) with compatibility (Machine Learning).

A. Ingestion & Preview (Polars)

  • Engine: Polars
  • Why: Polars is significantly faster than Pandas for reading large files (CSV, Parquet) and performing initial scans. It uses lazy evaluation and multi-threading.
  • Where:
    • backend/data_ingestion/: Reading uploaded files.
    • backend/services/data_service.py: Generating data previews and samples for the UI.
  • Format: Data is kept in Polars DataFrames or converted to Python dictionaries (to_dicts()) for JSON API responses.

B. Machine Learning Core (Pandas/Numpy)

  • Engine: Pandas & Numpy
  • Why: The vast majority of the Python ML ecosystem (Scikit-Learn, XGBoost, LightGBM) is built around Numpy arrays and Pandas DataFrames.
  • Where:
    • skyulf-core/: The actual ML pipeline execution.
    • backend/ml_pipeline/execution/: The orchestration layer that runs the core library.
  • Format: Data is converted to Pandas DataFrames before entering the SkyulfPipeline.

C. The Bridge: Apache Arrow

  • Technology: Apache Arrow
  • Role: Arrow is the in-memory columnar format that both Polars and Pandas (2.0+) support.
  • Benefit: It allows for zero-copy (or near zero-copy) conversion between Polars and Pandas. When we load data with Polars and then hand it to Scikit-Learn, we aren't serializing/deserializing text; we are just passing memory pointers. This makes the "switch" extremely efficient.

2. Future Architecture: The AI Hub

Skyulf is evolving from a Tabular ML tool into a multi-modal AI Hub.

The "Node" Abstraction

The core architecture (Graph -> Nodes -> Artifacts) is agnostic to the data type. * Today: Nodes process pd.DataFrame. * Tomorrow: Nodes will process ImageBatch, TextCorpus, or HuggingFaceDataset.

Planned Engines

We will introduce specialized engines alongside PandasEngine and PolarsEngine: 1. TorchEngine: For Deep Learning workflows (PyTorch). 2. HuggingFaceEngine: For NLP pipelines (Tokenizers, Transformers). 3. LlamaIndexEngine: For RAG and LLM orchestration.

Artifact Store Evolution

The ArtifactStore will evolve to handle: * Large Blobs: Storing images/audio directly or via S3 references. * Model Weights: Managing .pt, .onnx, and .gguf files efficiently. * Vector Indices: Storing FAISS/ChromaDB indices for RAG.

See ROADMAP.md for the detailed timeline.