Quickstart Guide

This guide demonstrates how to create a simple end-to-end pipeline using Skyulf Core.

1. Setup

First, import the necessary modules.

import numpy as np
import pandas as pd
from skyulf import SkyulfPipeline

2. Create Dummy Data

We'll create a synthetic dataset with some missing values and categorical features.

def create_dummy_data(n: int = 200) -> pd.DataFrame:
    np.random.seed(42)
    df = pd.DataFrame({
        'age': np.random.randint(18, 80, n),
        'income': np.random.normal(50000, 15000, n),
        'city': np.random.choice(['New York', 'London', 'Paris'], n),
        'is_customer': np.random.choice([0, 1], n),
    })
    # Introduce missing values
    df.loc[0:10, 'income'] = np.nan
    return df

data = create_dummy_data()
print(data.head())

Output:

   age        income      city  is_customer
0   56           NaN     Paris            0
1   69           NaN  New York            1
2   46           NaN     Paris            0
3   32           NaN     Paris            0
4   60           NaN  New York            0

3. Define Pipeline Configuration

Skyulf pipelines are defined using a JSON-compatible dictionary. This makes them easy to serialize and store.

config = {
    'preprocessing': [
        # 1. Split Data into Train/Test
        {
            'name': 'split_data',
            'transformer': 'TrainTestSplitter',
            'params': {
                'test_size': 0.2,
                'target_column': 'is_customer',
            },
        },
        # 2. Impute Missing Income
        {
            'name': 'impute_income',
            'transformer': 'SimpleImputer',
            'params': {
                'columns': ['income'],
                'strategy': 'mean',
            },
        },
        # 3. Encode City (Categorical)
        {
            'name': 'encode_city',
            'transformer': 'OneHotEncoder',
            'params': {'columns': ['city']},
        },
        # 4. Scale Numeric Features
        {
            'name': 'scale_features',
            'transformer': 'StandardScaler',
            'params': {'columns': ['age', 'income']},
        },
    ],
    'modeling': {
        'type': 'random_forest_classifier',
        'params': {'n_estimators': 50, 'max_depth': 5},
    },
}

4. Run Pipeline

Initialize and fit the pipeline.

pipeline = SkyulfPipeline(config)
metrics = pipeline.fit(data, target_column='is_customer')
print(metrics)

Output:

{
    'preprocessing': {...},
    'modeling': {
        'accuracy': 0.85,
        'f1_score': 0.82,
        ...
    }
}

5. Save and Load

Pipelines can be saved to disk and reloaded for inference.

import os

artifact_path = 'my_model.pkl'
pipeline.save(artifact_path)

# Load back
loaded = SkyulfPipeline.load(artifact_path)

# Predict on new data
new_data = pd.DataFrame({
    'age': [25, 40],
    'income': [60000, np.nan],
    'city': ['London', 'Paris'],
})

predictions = loaded.predict(new_data)
print(predictions)

# Cleanup
if os.path.exists(artifact_path):
    os.remove(artifact_path)

Output:

0    0
1    1
dtype: int64