Skip to content

Data Validation Expectations

skyulf.profiling.expect is a lightweight, dependency-free data-validation helper — a tiny subset of what Great Expectations offers, but with zero extra dependencies. Each expect_* function checks a single condition and raises ExpectationError with a precise message when the condition is violated.

It is engine-agnostic: Pandas frames are used directly; Polars (or any frame exposing to_pandas()) is converted first.

When to use it

These are manual assertions — they are not wired into profiling or CI automatically. You call them yourself in two main places:

  1. In tests / CI — guard a dataset contract so a bad upstream change fails the build.
  2. In a pipeline — assert preconditions before an expensive step, so you get a clear error instead of a deep traceback later.

Available expectations

Function Checks
expect_columns_exist(df, columns) Every name in columns is present.
expect_no_nulls(df, columns=None) Given columns (default: all) have no nulls.
expect_value_range(df, column, *, minimum, maximum, inclusive=True) All values fall within [minimum, maximum].
expect_unique(df, columns) The combination of columns has no duplicate rows.

Example: a dataset contract in CI

import pandas as pd
from skyulf.profiling.expect import (
    expect_columns_exist,
    expect_no_nulls,
    expect_value_range,
    expect_unique,
    ExpectationError,
)


def validate_customers(df: pd.DataFrame) -> None:
    """Raises ExpectationError if the customers frame breaks its contract."""
    expect_columns_exist(df, ["customer_id", "age", "signup_date"])
    expect_unique(df, ["customer_id"])
    expect_no_nulls(df, ["customer_id", "signup_date"])
    expect_value_range(df, "age", minimum=0, maximum=120)

Wire it into a test so CI enforces it:

def test_customers_contract():
    df = pd.read_parquet("data/customers.parquet")
    validate_customers(df)  # raises ExpectationError on violation → test fails

Example: a pipeline guard

from skyulf.profiling.expect import expect_no_nulls

def run(df):
    # Fail fast with a clear message before an expensive fit.
    expect_no_nulls(df, ["target"])
    ...

API reference

skyulf.profiling.expect

Lightweight data-validation expectations (no Great Expectations dependency).

Each expect_* function checks a single condition on a DataFrame and raises :class:ExpectationError with a precise message when the condition is violated. Pure-Python and engine-agnostic: Pandas frames are used directly; Polars (or any frame exposing to_pandas()) is converted first.

Example

import pandas as pd from skyulf.profiling.expect import expect_no_nulls, expect_value_range df = pd.DataFrame({"age": [21, 35, 40]}) expect_no_nulls(df) expect_value_range(df, "age", minimum=0, maximum=120)

ExpectationError

Bases: ValueError

Raised when a data-validation expectation is not met.

Source code in skyulf-core/skyulf/profiling/expect.py
29
30
class ExpectationError(ValueError):
    """Raised when a data-validation expectation is not met."""

expect_columns_exist(df, columns)

Assert that every name in columns is present in df.

Source code in skyulf-core/skyulf/profiling/expect.py
50
51
52
53
54
55
def expect_columns_exist(df: Any, columns: Sequence[str]) -> None:
    """Assert that every name in ``columns`` is present in ``df``."""
    frame = _as_pandas(df)
    missing = [c for c in columns if c not in frame.columns]
    if missing:
        raise ExpectationError(f"Expected columns are missing: {missing}")

expect_no_nulls(df, columns=None)

Assert that the given columns (default: all) contain no null values.

Source code in skyulf-core/skyulf/profiling/expect.py
58
59
60
61
62
63
64
65
def expect_no_nulls(df: Any, columns: Optional[Sequence[str]] = None) -> None:
    """Assert that the given columns (default: all) contain no null values."""
    frame = _as_pandas(df)
    cols = _resolve_columns(frame, columns)
    null_counts = {c: int(frame[c].isnull().sum()) for c in cols}
    offenders = {c: n for c, n in null_counts.items() if n > 0}
    if offenders:
        raise ExpectationError(f"Null values found in columns: {offenders}")

expect_unique(df, columns)

Assert that the combination of columns has no duplicate rows.

Source code in skyulf-core/skyulf/profiling/expect.py
112
113
114
115
116
117
118
119
120
121
def expect_unique(df: Any, columns: Sequence[str]) -> None:
    """Assert that the combination of ``columns`` has no duplicate rows."""
    frame = _as_pandas(df)
    expect_columns_exist(frame, columns)
    duplicated = frame.duplicated(subset=list(columns), keep=False)
    dup_count = int(duplicated.sum())
    if dup_count:
        raise ExpectationError(
            f"Expected unique values for {list(columns)} but found {dup_count} duplicate rows"
        )

expect_value_range(df, column, *, minimum=None, maximum=None, inclusive=True)

Assert that all values in column fall within [minimum, maximum].

minimum / maximum are optional (open-ended on the unset side). Null values are ignored. Set inclusive=False for a strict comparison.

Source code in skyulf-core/skyulf/profiling/expect.py
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
def expect_value_range(
    df: Any,
    column: str,
    *,
    minimum: Optional[float] = None,
    maximum: Optional[float] = None,
    inclusive: bool = True,
) -> None:
    """Assert that all values in ``column`` fall within ``[minimum, maximum]``.

    ``minimum`` / ``maximum`` are optional (open-ended on the unset side).
    Null values are ignored. Set ``inclusive=False`` for a strict comparison.
    """
    frame = _as_pandas(df)
    expect_columns_exist(frame, [column])
    series = frame[column].dropna()
    _check_lower_bound(series, column, minimum, inclusive)
    _check_upper_bound(series, column, maximum, inclusive)