Skyulf Engine Mechanics

This guide explains how Skyulf handles different dataframes (Polars and Pandas) under the hood, ensuring both high performance and compatibility with the Python ML ecosystem (Scikit-Learn).

1. Specifically, what is `SkyulfDataFrame`?

You might see SkyulfDataFrame in our type hints. It is not a new class. It is a Type Alias (Protocol) that means: "This variable can be either a polars.DataFrame or a pandas.DataFrame."

# Conceptually:
SkyulfDataFrame = Union[pl.DataFrame, pd.DataFrame]

This ensures our code knows how to handle both formats without forcing you to convert everything manually.

2. Engine Detection: How do we know which one it is?

We don't ask you to specify the engine. We detect it automatically using the get_engine(df) utility.

The Logic (`skyulf.engines`)

When a DataFrame enters a Calculator or Applier: 1. We check isinstance(df, polars.DataFrame). * If True: We tag it as Engine.POLARS. 2. If not, we check isinstance(df, pandas.DataFrame). * If True: We tag it as Engine.PANDAS. 3. Otherwise, we raise an error.

from skyulf.engines import get_engine

def fit(self, df: SkyulfDataFrame, ...):
    engine = get_engine(df)

    if engine.name == "polars":
        # Run optimized Polars logic
    else:
        # Run standard Pandas logic

3. The Hybrid Architecture

We want the speed of Polars but the ecosystem of Scikit-Learn. To achieve this, we use a hybrid strategy.

Path A: Pure Polars (The Fast Path)

For operations that Polars supports natively, we stay in Polars. This is zero-copy and extremely fast. * Examples: StandardScaler, MinMaxScaler, SimpleImputing (mean/median), LogTransformation, Encoding. * Mechanism: We construct Polars expressions (e.g., pl.col(c).mean()) and apply them lazy or eager.

Path B: The "Hybrid Bridge" (Compatibility Path)

For complex algorithms that only exist in Scikit-Learn (e.g., IsolationForest, RFE, PolynomialFeatures), we temporarily bridge to Pandas/Numpy.

The Workflow: 1. Input: Receive pl.DataFrame. 2. Bridge: Convert to pd.DataFrame (using Arrow for speed). 3. Compute: Run Scikit-Learn (e.g., sklearn_model.fit(df_pandas)). 4. Return: * If the result is small (parameters), we just store them. * If the result is data (transformed rows), we convert the result back to pl.DataFrame.

Why Arrow? (The "Zero-Copy" Magic)

A common fear with hybrid systems is: "Won't converting data back and forth double my memory usage and kill performance?"

This is where Apache Arrow comes in. * Polars is built on top of the Arrow memory format. * Pandas (2.0+) supports Arrow backends, and even older Pandas can ingest Arrow very efficiently.

When we call df_polars.to_pandas(), it doesn't serialize data to Python objects (slow). It hands over the pointer to the Arrow memory buffer. This is often Zero-Copy (or near zero-copy), meaning the data stays in the same place in RAM, and Pandas just "views" it. This makes the hybrid bridge incredibly lightweight compared to traditional CSV/method conversion.

The `SklearnBridge` Utility

To keep our code clean, we use a utility called SklearnBridge in skyulf.engines.sklearn_bridge.

Instead of writing if polars: to_numpy() everywhere, SklearnBridge handles the standardization:

Standardizes Input: Accepts Polars DF, Pandas DF, or Tuples (X, y).
Handles Targets (The "Flattening" Problem):
- The Issue: DataFrames treat a single column as a 2D structure (a list of lists, like [[1], [2], [3]], shape (N, 1)).
- The Need: Scikit-Learn models often expect the target y to be a simple 1D array (like [1, 2, 3], shape (N,)).
- The Fix: The Bridge automatically detects this and "flattens" (ravels) the array so Scikit-Learn doesn't throw a shape mismatch error.
Safe Output: Ensures the result is always a Numpy array ready for .fit().

# Inside your complex Calculator
from skyulf.engines.sklearn_bridge import SklearnBridge

def fit(self, df, ...):
    # Doesn't matter if df is Pandas or Polars
    X_matrix, y_vector = SklearnBridge.to_sklearn(df)

    model = IsolationForest()
    model.fit(X_matrix, y_vector)

This isolates the "compatibility boilerplate" away from your ML logic.

This allows us to support every ML feature without rewriting Scikit-Learn from scratch in Polars.

4. Calculator vs. Applier in the Hybrid World

Calculator (`fit`)

Polars Engine: Calculates stats (min, max, mean) using fast Polars aggregations. Returns a simple Python dictionary (e.g., {'mean': 5.2}).
Pandas Engine: Calculates stats using .mean(). Returns the same dictionary structure.
Result: The params dictionary is engine-agnostic. It doesn't care where it came from.

Applier (`apply`)

Receives the agnostic params.
Checks the current dataframe's engine.
If Polars: Uses pl.col("A") - params["mean"].
If Pandas: Uses df["A"] - params["mean"].

This means you can train on Pandas and predict on Polars (or vice versa)!

Summary

Feature	Polars Path	Pandas Path
Speed	🚀 High	🐢 Standard
Memory	Efficient (Rust)	High Overhead
Compatibility	Native ops (Expr)	Full Scikit-Learn support
Usage	Default for ETL/Serving	Default for Training complex models

Skyulf handles the switching for you. You just pass the data.